clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Multiple problems with the indexcard output format #723

Closed enoriega closed 3 years ago

enoriega commented 3 years ago

There are a few bugs in the indexcard output format which I think are because of changes made to data structures posterior to the creation of the output format.

Below are all the exceptions that appear in the log after a few days running.

java.lang.IllegalArgumentException: requirement failed: Controllers of an Activation must be Entities!
java.lang.NegativeArraySizeException
java.lang.reflect.InvocationTargetException
java.lang.RuntimeException: ERROR: argument type 'event' not supported!
java.lang.RuntimeException: ERROR: event type conversion not supported!
java.lang.RuntimeException: ERROR: unknown event type: Disease in event:
java.lang.RuntimeException: ERROR: unknown event type: Family in event:
java.lang.RuntimeException: ERROR: unknown event type: Gene_or_gene_product in event:
java.lang.RuntimeException: ERROR: unknown event type: Simple_chemical in event:
java.util.NoSuchElementException: key not found: controlled
java.util.NoSuchElementException: key not found: controller
java.util.NoSuchElementException: key not found: theme
java.util.NoSuchElementException: next on empty iterator

Please find the log file and a couple papers to reproduce it attached to the issue. error.log PMC4543788.nxml.txt PMC5809884.nxml.txt PMC6086911.nxml.txt

@MihaiSurdeanu Since this output format doesn't seem relevant today, are the errors worth fixing?

MihaiSurdeanu commented 3 years ago

This is a format that really nobody uses anymore. I propose to remove it. @kwalcock : can you please do it when you get a chance?

kwalcock commented 3 years ago

Yes. Not having looked at it yet, I wonder whether it is easier to fix than to remove. Last time something was removed, it had to be added back. However, I can certainly follow instructions

MihaiSurdeanu commented 3 years ago

Either way... But, for historic background, this was a format that was used in an early DARPA eval, and was abandoned after.

kwalcock commented 3 years ago

Here are some more details about the exceptions that were thrown. Some don't seem connected to the output but were problems encountered before the output, which I call reading here. Some seem to have been for non-indexcard formats.

Exception Format Plan
java.lang.IllegalArgumentException: requirement failed: Controllers of an Activation must be Entities! fries  
java.lang.NegativeArraySizeException serial-json  
java.lang.reflect.InvocationTargetException reading  
java.lang.RuntimeException: ERROR: argument type 'event' not supported! indexcard  
java.lang.RuntimeException: ERROR: event type conversion not supported! indexcard Convert error to warning
java.lang.RuntimeException: ERROR: unknown event type: Disease in event: indexcard Convert error to warning
java.lang.RuntimeException: ERROR: unknown event type: Family in event: indexcard Convert error to warning
java.lang.RuntimeException: ERROR: unknown event type: Gene_or_gene_product in event: indexcard Convert error to warning
java.lang.RuntimeException: ERROR: unknown event type: Simple_chemical in event: indexcard Convert error to warning
java.util.NoSuchElementException: key not found: controlled reading  
java.util.NoSuchElementException: key not found: controller reading  
java.util.NoSuchElementException: key not found: theme reading & fries  
java.util.NoSuchElementException: next on empty iterator cmu  
MihaiSurdeanu commented 3 years ago

Hmm. Some of these seem errors in the format code. Some are legit exceptions in the data that should be handled. To make this more manageable to fix, @enoriega: can you please create a unit test for each of these exceptions, ideally using a single sentence per test? Than I can take a look at each, and hopefully either fix them or tell you what needs to be done.

Thanks!

kwalcock commented 3 years ago

If I had had the corresponding files, I would have already volunteered to do it. Feel free to reassign.

enoriega commented 3 years ago

I updated the table with a reference to a file for each error kind. The attached zip contains all the referenced input files. I'll get some example sentences for each error.

Exception Format Plan PMCID Sentence
java.lang.IllegalArgumentException: requirement failed: Controllers of an Activation must be Entities! fries   PMC4265014 N/A
java.lang.NegativeArraySizeException serial-json   PMC7176272  
java.lang.reflect.InvocationTargetException reading   PMC7040422  
java.lang.RuntimeException: ERROR: argument type 'event' not supported! indexcard   PMC3822968  
java.lang.RuntimeException: ERROR: event type conversion not supported! indexcard Convert error to warning PMC3822968  
java.lang.RuntimeException: ERROR: unknown event type: Disease in event: indexcard Convert error to warning PMC6539695  
java.lang.RuntimeException: ERROR: unknown event type: Family in event: indexcard Convert error to warning PMC5327768  
java.lang.RuntimeException: ERROR: unknown event type: Gene_or_gene_product in event: indexcard Convert error to warning PMC5985311  
java.lang.RuntimeException: ERROR: unknown event type: Simple_chemical in event: indexcard Convert error to warning PMC6213605  
java.util.NoSuchElementException: key not found: controlled reading   PMC5504966 Bacteria in the human gut can produce hydrogen gas , and hydrogen can be converted to methane in the gut by methane producing bacteria [ 15 ] .
java.util.NoSuchElementException: key not found: controller reading   PMC5809884 ( 2 ) Noise exposure led to enhanced JNK phosphorylation and IRS1 serine phosphorylation as well as reduced Akt phosphorylation in skeletal muscles in response to exogenous insulin stimulation .
java.util.NoSuchElementException: key not found: theme reading & fries   PMC6940835 Activated ANP is a peptide hormone consisting of 28 amino acids that binds to NPR1 , a receptor in target organs such as the kidneys and peripheral blood vessels , converting intracellular GTP into cGMP to promote the excretion of Na  , inhibit Na   reuptake , and induce vasodilation [ 16,17 ] .
java.util.NoSuchElementException: next on empty iterator cmu   PMC6681624 N/A

nxml.zip

kwalcock commented 3 years ago

Thank you. I'll get to them soon.

enoriega commented 3 years ago

Thanks @kwalcock. Some comments: The errors referred by the rows with N/A in the sentence column are not triggered by a sentence, but by the assembly procedure, which I believe is a form of aggregation of multiple interactions. The corresponding documents trigger the error. For the rows where the plan is to convert the error to a warning, I didn't bother to find a sentence. For the rows where the sentence field is empty, I haven't been able to reproduce the error yet. Maybe some of the most recent changes fixed them, but I am still trying to locate a culprit.

kwalcock commented 3 years ago

I'll update this as they are figured out.

Exception Format Plan PMCID Sentence
java.lang.IllegalArgumentException: requirement failed: Controllers of an Activation must be Entities! fries  Allow Regulations as controllers of an Activation PMC4265014 Related to these sentences: ADP promotes platelet activation through its receptors (P2Y1 and P2Y12. A novel finding is that nifedipine greatly inhibits the release of PPAR-β/-γ from activated platelets, thereby increasing the intracellular availability of PPAR-β/-γ which may enhance its cellular functions like the regulation of platelet activation.
java.util.NoSuchElementException: key not found: theme fries Just return the BioEventMention itself if there is no theme. PMC6940835 Activated ANP is a peptide hormone consisting of 28 amino acids that binds to NPR1 , a receptor in target organs such as the kidneys and peripheral blood vessels , converting intracellular GTP into cGMP to promote the excretion of Na  , inhibit Na   reuptake , and induce vasodilation [ 16,17 ] .
java.lang.NegativeArraySizeException serial-json unsolved PMC7176272 There seems to be an infinite loop somewhere.
java.lang.RuntimeException: ERROR: argument type 'event' not supported! indexcard Convert error to warning  PMC3822968  The platelet glycoprotein Ibα (GPIbα) and P-selectin glycoprotein ligand (PSGL-1) receptors bind to the endothelial P-selectin initiating platelet rolling, whereas the subsequent firm adhesion is mediated through αIIbβ3 integrin and P-selectin.
java.lang.RuntimeException: ERROR: unknown event type: Disease in event: indexcard Convert error to warning PMC6539695 Among them, nucleotide anti-mir21 drugs inhibit colon cancer cell metastasis up-regulating PDCD4-protein levels in in vitro experiments [100].
java.lang.RuntimeException: ERROR: unknown event type: Family in event: indexcard Convert error to warning PMC5327768 TGF-β signaling proceeds through two pathways, canonically through Smad7 to activate the Smad2/3 and Smad4 binding to activate transcription and a Smad independent pathway that proceeds through the p38 MAPK and JNK [52,53].  
java.lang.RuntimeException: ERROR: unknown event type: Gene_or_gene_product in event: indexcard Convert error to warning PMC5985311 Inhibition of α-klotho using a neutralizing antibody specifically blocks FGF23-mediated activation of AKT/eNOS and consequently the release of NO.
java.lang.RuntimeException: ERROR: unknown event type: Simple_chemical in event: indexcard Convert error to warning PMC6213605  
java.util.NoSuchElementException: next on empty iterator cmu Use "NONE" as mechanism type when there is no evidence   PMC6681624 Ankrd2 expression in the heart is potentially regulated by cardiac specific transcription factors Nkx2.5 , Hand2 , and Ankrd1 as demonstrated by their interaction with the ANKRD2 promoter or by dual luciferase assay [ 20 , 21 ] .
java.lang.reflect.InvocationTargetException reading Catch exception generated when trigger has no head  PMC7040422 Other mechanisms involved in asthma physiopathology are the inhalation of drugs , as well as respiratory viruses [ 8] , which promote an immune response mediated by IgG antibodies .
java.util.NoSuchElementException: key not found: controlled reading Account for missing controller and controlled  PMC5504966 Bacteria in the human gut can produce hydrogen gas , and hydrogen can be converted to methane in the gut by methane-producing bacteria [ 15 ] .
java.util.NoSuchElementException: key not found: controller reading Account for missing controller and controlled  PMC5809884 ( 2 ) Noise exposure led to enhanced JNK phosphorylation and IRS1 serine phosphorylation as well as reduced Akt phosphorylation in skeletal muscles in response to exogenous insulin stimulation .
kwalcock commented 3 years ago

This java.lang.NegativeArraySizeException for PMC7176272 is very suspicious. It doesn't occur anywhere near any of our code that could be subtracting wrong. The input file of 300KB takes a very, very long time to process. My computer ran overnight and I see in the log that Enrique worked on it for 22 hours. When I paused it periodically I noticed that the stack was very, very long. It looked like there was about one stack frame for every single one of some 4000+ mentions and it was building up some monster json structure. I couldn't easily tell if there was some kind of loop, but I wonder if there are some Mentions linked to each other in a circle. In generating the output there are buffers involved which are resizing. If something is resized to Integer.MAX_VALUE + 1, which is only 2,147,483,648 or 2GB, this exception can be thrown. I think the program is trying to build 2GB of json output in a string. It might take all night to do that. Has something like this happened before? I'll accept hints that anyone can offer before looking again.

enoriega commented 3 years ago

What you say sounds plausible and I think that this is a corner case too bizarre, so probably it's not worth fixing. We can instead keep this note in a "Knowledge Base" somewhere in the wiki in case it happens again eventually.

kwalcock commented 3 years ago

I haven't yet noticed in the serialization code anything that is looking out for loops, like a list of already visited Mentions being passed around. Perhaps a short unit test can at least show what would result if that were ever to happen.

MihaiSurdeanu commented 3 years ago

I have seen this in the past, but very infrequently... I agree that this sounds like an infinite loop. but not sure where it's coming from...

kwalcock commented 3 years ago

Addressed by PR #724 with one moved to issue #736.