clulab / reach

Reach Biomedical Information Extraction
Other
96 stars 39 forks source link

RunReachCLI gets stuck with specific nxml files #783

Open guerrerosimonl opened 1 year ago

guerrerosimonl commented 1 year ago

I'm wondering if this is a bug. I have been running a bunch of nxml files (1240 of them) with the standard procedure described at https://github.com/clulab/reach/wiki/Running-Reach. 1224 papers are able to run successfully (re-tested in a couple of them). 16 of them stay for hours running and don't let the process finish (also tried individually).

The 16 files are here: https://we.tl/t-1DHGfBjKc9

kwalcock commented 1 year ago

Someone saw this and will try to replicate.

kwalcock commented 1 year ago

I took the smallest of the files (PMC8589633.nxml), divided it up into even smaller parts, and ran them. They all finished, eventually, like after 24 hours. That particular file did not get stuck in an infinite loop or anything. It's just slow. That's partly to blame on there being around 7425 mentions found, but more to blame on some inefficient code for the serial-json format. If you don't happen to need that format, removing it from the list is an easy solution. If that's not possible and you have time, you might wait it out. Be sure to allow Java lots of memory, probably more than 10GB, so that doesn't spend too much time garbage collecting. A third alternative is to wait for some more efficient code, which is what will be discussed below.

kwalcock commented 1 year ago

@enoriega, the main problem seems to be in org/clulab/reach/mentions/serialization/json/package.scala where in the process of producing IDs, things like BioTextBoundMentionOps are running TextBoundMentionOps(tb).jsonAST which calls into processors code to calculate document.equivalenceHash. A hash for an entire Document is a major undertaking. The processors code also calculates the equivalenceHash of the Mention itself, which in turn calls document.equivalenceHash again. Then the BioTextBoundMentionOps replaces the id in json with yet another calculation of the Mention's ID which again calculates the document's equivalenceHash. That's at least 3 times the same hash value is used per Mention, and more if that Mention is related to other mentions as triggers or arguments. There are around 7500 Mentions in the Document, so this can take a very long time.

Processors can't in general know that the Document hasn't changed between serializations of Mentions, so the recalculation is partially justified. Reach knows that all the mentions are being serialized at the end of processing with no further changes expected to the Document. I believe that a cache of document equivalenceHashes can be stored there so that values can be reused. Some code may have to be copied over from processors in order to achieve this. (Maybe not if some related changes to processors go through.) I'll assign this to myself in nobody objects.

FYI @MihaiSurdeanu

kwalcock commented 1 year ago

Some of the larger files, if they don't hang, will eventually crash because they generate strings over 2GB in length. That is being looked into. A work-around is to divide the input files into smaller sized documents.

bgyori commented 1 year ago

Thanks @kwalcock for working on this! I just wanted to chime in and say that based on my prior interactions with @guerrerosimonl, I suspect you only need the fries output, and so @kwalcock's remark that "If you don't happen to need that format, removing it from the list is an easy solution." (from this list specifically: https://github.com/clulab/reach/blob/master/main/src/main/resources/application.conf#L40) I think applies here.

kwalcock commented 1 year ago

Thanks for the tip @bgyori. If the fries output suffices, that's the more expedient solution. I hope to have json output in not too long, nevertheless.