clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

With the Arizona and CMU formats enabled, reach drops the ball #716

Closed kwalcock closed 3 years ago

kwalcock commented 3 years ago

Regarding #691, @enoriega reports

I think there is some sneaky bug in there.

After enabling the Arizona and CMU formats, ReachCLI is picking up papers, generating the fries output and apparently dropping the ball without actually producing the other output format files. It doesn't report any exceptions nor stack traces.

The log file reports starting the papers but not finishing them. Can someone double check this?

This is the line I have in application.conf in case having these formats simultaneously is what triggers the problem:

outputTypes = ["fries", "serial-json", "text", "indexcard", "arizona", "cmu"]

MihaiSurdeanu commented 3 years ago

Thanks!

On December 30, 2020 at 10:58:40 AM, Keith Alcock (notifications@github.com) wrote:

Regarding @691 https://github.com/691, @enoriega https://github.com/enoriega reports

I think there is some sneaky bug in there.

After enabling the Arizona and CMU formats, ReachCLI is picking up papers, generating the fries output and apparently dropping the ball without actually producing the other output format files. It doesn't report any exceptions nor stack traces.

The log file reports starting the papers but not finishing them. Can someone double check this?

This is the line I have in application.conf in case having these formats simultaneously is what triggers the problem:

outputTypes = ["fries", "serial-json", "text", "indexcard", "arizona", "cmu"]

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/issues/716, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TTEDJFH6WUOOG7GANDSXNS47ANCNFSM4VOSCQWA .

enoriega commented 3 years ago

@kwalcock Let me know if I can provide data to reproduce this

kwalcock commented 3 years ago

You aren't running afoul of the restart mechanism, are you? It was turned on by default not too long ago (May, this year). It looks for restart.log in the output directory and skips files recorded there. If you ran once with the fries output successfully, added some formats, and ran again, the second run would not reprocess the files already done unless restart.log was first erased.

The json output is not working for me, but the other formats are generated if I'm sure to remove restart.log or turn that feature off in application.conf.

kwalcock commented 3 years ago

It's really cool that we can output in Friesian!

enoriega commented 3 years ago

@kwalcock I am aware of the restart mechanism. So far I am pointing this issue w.r.t. files that haven't been processed yet.

After the fixes in kwalcock-fixes, I noticed that for some nxlm files are processed correctly and all the files are generated. Others crash at the time of json serialization, (as in json4s, not necessarily REACH's json format) with missing matches, like:

scala.MatchError: OEtrigger(org.clulab.reach.mentions.BioTextBoundMention@3b45b692) (of class org.clulab.reach.mentions.OEtrigger)

This is a very prevalent error in the log file. For the class of files that report this error, only the FRIES output is generated.

My educated guess about this is:

New trigger subclasses were added after the serialization code. A match statement somewhere is not comprehensive because it misses the patterns that correspond to the new sub-trigger types. Ergo, an exception is raised and propagates all the way up to ReachCLI, which recovers from it and moves on to the next paper without finishing writing the other output formats.

I attach the log and a file that triggers the error.

PMC5668567.nxml.txt error.log

kwalcock commented 3 years ago

Thanks for the files. I can confirm a crash.

MihaiSurdeanu commented 3 years ago

@kwalcock: please let me know if you need my help on this. A small reproducible test case would help.

kwalcock commented 3 years ago

The diagnosis seems to be correct. There were triggers/Modifications added as far back as July 2019 which do not have proper serialization support. IntelliJ does very poorly with multiple package objects with implicit conversions, so things were a little confusing. There should be a fix shortly.

kwalcock commented 3 years ago

The most recent change to #719 may take care of it. I'm still going to try for a less wordy version, though.

MihaiSurdeanu commented 3 years ago

Thanks @kwalcock !

enoriega commented 3 years ago

@kwalcock I can confirm I no longer see those error in the log. I think this is safe to be closed. Thanks!