clulab / processors

Natural Language Processors
https://clulab.github.io/processors/
Apache License 2.0
418 stars 101 forks source link

processors hangs on sentences with long lists of numbers #813

Open kwalcock opened 6 days ago

kwalcock commented 6 days ago

I'm not yet sure where this happens. It may be that a rule includes a regular expression that can match too many ways. The program usually stalls for a long, long time and may eventually run out of memory and crash. Here are example "sentences":

Year/quarter expatriate permanent contract casual total   2005/1 0 1,565 25 183 1,773  2005/2 0 1,561 29 420 2,010  2005/3 0 1,549 36 178 1,763  2005/4 0 1,524 45 213 1,782  2006/1 0 1,471 61 347 1,879  2006/2 0 1,250 270 342 1,862  2006/3 3 1,177 281 237 1,698  2006/4 3 1,190 276 217 1,686  2007/1 3 1,198 301 237 1,739  2007/2 2 1,213 294 215 1,724  2007/3 1 1,253 329 274 1,857  2007/4 1 1,281 279 222 1,783  2008/1 1 1,277 290 249 1,817  2008/2 2 1,292 289 232 1,815  2008/3 2 1,299 265 237 1,803  2008/4 1 1,277 283 230 1,791  2009/1 3 1,288 249 232 1,772  2009/2 3 1,288 245 231 1,767  2009/3 3 1,291 247 239 1,780  2009/4 15 1,273 264 239 1,791  2010/1 15 1,234 251 244 1,744  2010/2 16 1,251 250 244 1,761  .
2006/1 289,874 24,109 812 126 328  2006/2 291,724 24,099 836 130 314  2006/3 295,418 24,870 859 134 314  2006/4 296,702 24,870 870 139 315  2007/1 275,947 23,980 899 143 299  2007/2 279,439 24,922 933 148 314  2007/3 274,855 24,715 936 152 312  2007/4 277,393 24,602 954 161 334  2008/1 274,106 23,553 1,017 166 361  2008/2 276,447 23,627 1,034 170 346  2008/3 275,599 23,331 1,056 169 342  2008/4 276,255 20,484 864 159 291  2009/1 282,520 22,382 884 174 323  2009/2 289,504 22,382 913 184 316  2009/3 282,194 22,243 918 189 233  .
0 11 481 339 263 188 404 210 182 137 785 1,052 449 977 523 1,308 1,043 1,180  273 1,519 754 3,852 1,820 1,150 421 419 138 3,320 1,228 924 1,263 1,474 4,008 6,196 1,390 3,625   98 100 107 240 1,590 37 67 114 31 325 459 1,694 1,858 4,982 3,427 2,859 2,386  .
Global Delegations Rice Total Rice Dagana Lake Podor Matam Bakel Total Winter 2019_20 difference 23,700 2,900 8,200 10,000 1330 46 130 35 345 10 785 23,700 800 6500 8,500 500 40,000 30,000 10,000 17,030 1,726 6,132 6 610 146 31 644 29 886 1,758 0 0 63 61 65 189 226 - 37 0 0 0 73 66 139 190 - 51 1 57 106 27 11 203 209 - 6 8 1,135 0 0 0 1,143 1,652 - 509 0 134 0 1 9 145 594 - 449 13 1,536 95 41 105 1,790 2 207 - 417 17,052 4,588 6,396 6 814 403 35,253 34 963 72% 158% 78% 68% 30% 76% 99% - 22.
Delegations Rice Okra Potato Peanut Others Total Global Of which rice Dagana Lake Podor Matam Bakel Total SSC 2019 difference 37,250 4,350 10,550 3,075 840 54 815 54,400 37,250 1,800 9,000 1,700 250 50,000 50,000 36,641 2 271 8,102 2 189 59 49,262 45 947 3,315 75 0 115 50 42 282 105 177 4 76 0 0 0 79 20 59 0 108 34 1 0 143 8 135 54 407 209 103 37 810 380 429 36,774 2,862 8,461 2 342 137 50,577 46,461 4,116 99% 78% 80% 84% 23% 92% 85% 7.
kwalcock commented 1 day ago

It looks like it might be measurement-3 that is problematic. I see *)+. I think that Marco's algorithm doesn't backtrack, but it must keep tabs on all the possibilities, and maybe there are too many piling up.