clulab / reach

Reach Biomedical Information Extraction
Other
97 stars 39 forks source link

Catch errors #724

Closed kwalcock closed 3 years ago

kwalcock commented 3 years ago

Don't allow error in one format to block other formats

MihaiSurdeanu commented 3 years ago

A step in the right direction 👍

kwalcock commented 3 years ago

What should happen with a "conversion" event type or a "Gene_or_gene_product" label? The former is with the obsolete indexcard output but the other is not. The complete label is

"labels" : [ "Gene_or_gene_product", "MacroMolecule", "Equivalable", "BioChemicalEntity", "BioEntity", "Entity", "PossibleController" ],

and the JSON looks like

{
  "type" : "TextBoundMention",
  "id" : "T:-371853549",
  "text" : "MCP-4",
  "labels" : [ "Gene_or_gene_product", "MacroMolecule", "Equivalable", "BioChemicalEntity", "BioEntity", "Entity", "PossibleController" ],
  "tokenInterval" : {
    "start" : 0,
    "end" : 1
  },
  "characterStartOffset" : 22662,
  "characterEndOffset" : 22667,
  "sentence" : 126,
  "document" : "-1204715480",
  "keep" : true,
  "foundBy" : "ner-gene_or_gene_product-entities"
}
MihaiSurdeanu commented 3 years ago

What's the input text for this output?

kwalcock commented 3 years ago

The particular sentence is "MCP-4 signaling induces the release of histamine from IL-3-primed basophils and activates eosinophil respiratory burst." The JsonOutputter doesn't know how to convert the Gene_or_gene_product into an event type and I don't either.

MihaiSurdeanu commented 3 years ago

There are 3 events in this output:

EVENTS:   3

MENTION TEXT:  MCP-4 signaling induces the release of histamine from IL-3-primed basophils and activates eosinophil respiratory burst
LABELS:        List(Positive_activation, ActivationEvent, ComplexEvent, Event, PossibleController)
DISPLAY LABEL: Positive_activation
    ------------------------------
    RULE => Positive_activation_syntax_1_verb
    TYPE => CorefEventMention
    ------------------------------

    ------------------------------
    TRIGGER => activates
    controlled (List(BioProcess, BioEntity, Entity, PossibleController)) => respiratory burst
    controller (List(Positive_regulation, Regulation, ComplexEvent, Event, PossibleController)) => MCP-4 signaling induces the release of histamine
    CONTEXT: TissueType => List(tissuelist:TS-0279)
    CONTEXT: CellType => List(cl:CL:0000767)
    ------------------------------

MENTION TEXT:  MCP-4 signaling induces the release of histamine
LABELS:        List(Positive_regulation, Regulation, ComplexEvent, Event, PossibleController)
DISPLAY LABEL: Positive_regulation
    ------------------------------
    RULE => Positive_regulation_syntax_1_verb
    TYPE => CorefEventMention
    ------------------------------

    ------------------------------
    TRIGGER => induces
    controlled (List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController)) => MCP-4
    controller (List(Secretion, IncreaseAmount, Amount, SimpleEvent, Event, PossibleController)) => release of histamine
    CONTEXT: TissueType => List(tissuelist:TS-0279)
    CONTEXT: CellType => List(cl:CL:0000767)
    ------------------------------

MENTION TEXT:  release of histamine
LABELS:        List(Secretion, IncreaseAmount, Amount, SimpleEvent, Event, PossibleController)
DISPLAY LABEL: Secretion
    ------------------------------
    RULE => secretion_1
    TYPE => CorefEventMention
    ------------------------------

    ------------------------------
    TRIGGER => release
    theme (List(Simple_chemical, BioChemicalEntity, BioEntity, Entity, PossibleController)) => histamine
    CONTEXT: TissueType => List(tissuelist:TS-0279)
    CONTEXT: CellType => List(cl:CL:0000767)
    ------------------------------

do you know which one crashes the output JSON?

kwalcock commented 3 years ago

It must be working on the second one with

controlled (List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController)) => MCP-4

and it looks like this Gene_or_gene_product isn't something new, which I was thinking, but something with a history like

// FIXME: Why is this Gene_or_gene_product?  What about Protein, GENE, etc.?

None of this completely crashes reach. The particular output format is skipped for the particular paper.

MihaiSurdeanu commented 3 years ago

Right. But there are 3 events that are not included in the output? There is a Positive_activation, Positive_regulation, and Secretion that are not included? The latter event may be new, so I can see why it would not. But the first two are old ones, so they should be captured. Can you please attach the index_card output here?

kwalcock commented 3 years ago

It looks like the JsonOutputter which is having the problem is running in support the indexcard output in this case, so maybe it doesn't even matter. That JsonOutputter is also used by FriesOutput and SerialJsonOutput, and perhaps they never get into the situation.

I'm not sure what "Not included in the output" means. I'm looking at output from ReachCLI and the different outputters seem to be interested in different things. The message copied from the exception,

{
  "type" : "TextBoundMention",
  "id" : "T:-371853549",
  "text" : "MCP-4",
  "labels" : [ "Gene_or_gene_product", "MacroMolecule", "Equivalable", "BioChemicalEntity", "BioEntity", "Entity", "PossibleController" ],
  "..."

is just the local situation in which a RegulationIndexCard is trying to figure out its controlled and can't. Other mentions and other formats shouldn't be affected. It looks like this sentence results in two index cards, which are attached.

PMC4543788-UAZ-r1-1.json.txt PMC4543788-UAZ-r1-2.json.txt

kwalcock commented 3 years ago

Regarding "A step in the right direction", I am looking through the other exceptions which were not exemplified in the three sample files. I propose to either convert them to warnings like these first two or to leave them as exceptions if necessary, but not prevent the output in the other formats when they are thrown. The latter is what some of the code changes were for. If for some reason the little maintained indexcard format fails on some file, only the index cards will be missing and output in other formats is not affected. The index cards will degrade slowly.

MihaiSurdeanu commented 3 years ago

Nice. Thank you @kwalcock !

kwalcock commented 3 years ago

Here's the first draft of the tests. Many of the files are very large and processing them takes too long for a unit test. I'm narrowing things down and then looking for causes.

MihaiSurdeanu commented 3 years ago

Thanks!

On January 20, 2021 at 12:38:28 PM, Keith Alcock (notifications@github.com) wrote:

Here's the first draft of the tests. Many of the files are very large and processing them takes too long for a unit test. I'm narrowing things down and then looking for causes.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/pull/724#issuecomment-763884931, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TXX34BYOYWZBEAWJGLS24WLJANCNFSM4WDDPYAA .

kwalcock commented 3 years ago

With these changes the unit tests are all passing except for the one involving a possible infinite loop. However, they were made to pass using the simplest and dumbest change that I could think of. That may not be ideal. Better solutions would be linguistically motivated, but I don't know the situation well enough to do that. The code should be double checked.

MihaiSurdeanu commented 3 years ago

Thanks @kwalcock ! I think these need to be merged, even if less than ideal. Maybe create a separate issue for the infinite loop problem (if there isn't one)? I think this does happen because of the nested events that may exist in the extraction, even if it's not common.

kwalcock commented 3 years ago

@enoriega, it would be reassuring to know that you are aware of the changes and don't notice any blatant problems. I'm the newbie here.

enoriega commented 3 years ago

@kwalcock I read and the changes, I don't see anything that may be problematic, they LGTM. Thanks!

kwalcock commented 3 years ago

Thanks, all.