clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Possible infinite loop in serial-json output #736

Open kwalcock opened 3 years ago

kwalcock commented 3 years ago

As noted in #723, reach seems to hang and never make it through this particular file with this output format, at least until some internal part overflows, positive numbers become negative, and an exception is thrown a day later.

Exception Format Plan PMCID Sentence
java.lang.NegativeArraySizeException serial-json unsolved PMC7176272 There seems to be an infinite loop somewhere.

PMC7176272.nxml.txt

kwalcock commented 3 years ago

There seems to be something strange going on with the Antecedents related to the Anaphoric trait. There are very long chains of them. I stopped when they got to 100. When printing output for one of them, the 100 antecedents must be printed. For the 99th, its 98 must be printed, for the 98th, its 97, etc. This quickly explodes. There is a loop detector now and if it is trustworthy, there are no loops, but these longs chains are probably causing problems. Who knows anything about them? I am counting them as they are being output here: https://github.com/clulab/reach/blob/3e632c5cad9ea93584a629611fde6ed5fe521d0c/main/src/main/scala/org/clulab/reach/mentions/serialization/json/package.scala#L251

MihaiSurdeanu commented 3 years ago

This was probably done by @danebell. @danebell: any chance you can look into this probable infinite loop thing? Thank you!

herongrove commented 3 years ago

Absolutely. Let me take a look at it and get back to you.

kwalcock commented 3 years ago

@danebell, please see branch kwalcock-loop and the test TestLoop (which I just updated). It should access PMC7176272.nxml which is one of the test resources. It is a large file in which something like 4000 mentions are found. Finding them is not a big problem, but outputting them will take overnight and break when a buffer exceeds 2gb, it seems. I can figure it out eventually, but if someone has a large head start, they should be consulted.