Closed kwalcock closed 3 years ago
A step in the right direction 👍
What should happen with a "conversion" event type or a "Gene_or_gene_product" label? The former is with the obsolete indexcard output but the other is not. The complete label is
"labels" : [ "Gene_or_gene_product", "MacroMolecule", "Equivalable", "BioChemicalEntity", "BioEntity", "Entity", "PossibleController" ],
and the JSON looks like
{
"type" : "TextBoundMention",
"id" : "T:-371853549",
"text" : "MCP-4",
"labels" : [ "Gene_or_gene_product", "MacroMolecule", "Equivalable", "BioChemicalEntity", "BioEntity", "Entity", "PossibleController" ],
"tokenInterval" : {
"start" : 0,
"end" : 1
},
"characterStartOffset" : 22662,
"characterEndOffset" : 22667,
"sentence" : 126,
"document" : "-1204715480",
"keep" : true,
"foundBy" : "ner-gene_or_gene_product-entities"
}
What's the input text for this output?
The particular sentence is "MCP-4 signaling induces the release of histamine from IL-3-primed basophils and activates eosinophil respiratory burst." The JsonOutputter doesn't know how to convert the Gene_or_gene_product into an event type and I don't either.
There are 3 events in this output:
EVENTS: 3
MENTION TEXT: MCP-4 signaling induces the release of histamine from IL-3-primed basophils and activates eosinophil respiratory burst
LABELS: List(Positive_activation, ActivationEvent, ComplexEvent, Event, PossibleController)
DISPLAY LABEL: Positive_activation
------------------------------
RULE => Positive_activation_syntax_1_verb
TYPE => CorefEventMention
------------------------------
------------------------------
TRIGGER => activates
controlled (List(BioProcess, BioEntity, Entity, PossibleController)) => respiratory burst
controller (List(Positive_regulation, Regulation, ComplexEvent, Event, PossibleController)) => MCP-4 signaling induces the release of histamine
CONTEXT: TissueType => List(tissuelist:TS-0279)
CONTEXT: CellType => List(cl:CL:0000767)
------------------------------
MENTION TEXT: MCP-4 signaling induces the release of histamine
LABELS: List(Positive_regulation, Regulation, ComplexEvent, Event, PossibleController)
DISPLAY LABEL: Positive_regulation
------------------------------
RULE => Positive_regulation_syntax_1_verb
TYPE => CorefEventMention
------------------------------
------------------------------
TRIGGER => induces
controlled (List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController)) => MCP-4
controller (List(Secretion, IncreaseAmount, Amount, SimpleEvent, Event, PossibleController)) => release of histamine
CONTEXT: TissueType => List(tissuelist:TS-0279)
CONTEXT: CellType => List(cl:CL:0000767)
------------------------------
MENTION TEXT: release of histamine
LABELS: List(Secretion, IncreaseAmount, Amount, SimpleEvent, Event, PossibleController)
DISPLAY LABEL: Secretion
------------------------------
RULE => secretion_1
TYPE => CorefEventMention
------------------------------
------------------------------
TRIGGER => release
theme (List(Simple_chemical, BioChemicalEntity, BioEntity, Entity, PossibleController)) => histamine
CONTEXT: TissueType => List(tissuelist:TS-0279)
CONTEXT: CellType => List(cl:CL:0000767)
------------------------------
do you know which one crashes the output JSON?
It must be working on the second one with
controlled (List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController)) => MCP-4
and it looks like this Gene_or_gene_product isn't something new, which I was thinking, but something with a history like
// FIXME: Why is this Gene_or_gene_product? What about Protein, GENE, etc.?
None of this completely crashes reach. The particular output format is skipped for the particular paper.
Right. But there are 3 events that are not included in the output? There is a Positive_activation, Positive_regulation, and Secretion that are not included? The latter event may be new, so I can see why it would not. But the first two are old ones, so they should be captured. Can you please attach the index_card output here?
It looks like the JsonOutputter which is having the problem is running in support the indexcard output in this case, so maybe it doesn't even matter. That JsonOutputter is also used by FriesOutput and SerialJsonOutput, and perhaps they never get into the situation.
I'm not sure what "Not included in the output" means. I'm looking at output from ReachCLI and the different outputters seem to be interested in different things. The message copied from the exception,
{
"type" : "TextBoundMention",
"id" : "T:-371853549",
"text" : "MCP-4",
"labels" : [ "Gene_or_gene_product", "MacroMolecule", "Equivalable", "BioChemicalEntity", "BioEntity", "Entity", "PossibleController" ],
"..."
is just the local situation in which a RegulationIndexCard is trying to figure out its controlled
and can't. Other mentions and other formats shouldn't be affected. It looks like this sentence results in two index cards, which are attached.
Regarding "A step in the right direction", I am looking through the other exceptions which were not exemplified in the three sample files. I propose to either convert them to warnings like these first two or to leave them as exceptions if necessary, but not prevent the output in the other formats when they are thrown. The latter is what some of the code changes were for. If for some reason the little maintained indexcard format fails on some file, only the index cards will be missing and output in other formats is not affected. The index cards will degrade slowly.
Nice. Thank you @kwalcock !
Here's the first draft of the tests. Many of the files are very large and processing them takes too long for a unit test. I'm narrowing things down and then looking for causes.
Thanks!
On January 20, 2021 at 12:38:28 PM, Keith Alcock (notifications@github.com) wrote:
Here's the first draft of the tests. Many of the files are very large and processing them takes too long for a unit test. I'm narrowing things down and then looking for causes.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/clulab/reach/pull/724#issuecomment-763884931, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TXX34BYOYWZBEAWJGLS24WLJANCNFSM4WDDPYAA .
With these changes the unit tests are all passing except for the one involving a possible infinite loop. However, they were made to pass using the simplest and dumbest change that I could think of. That may not be ideal. Better solutions would be linguistically motivated, but I don't know the situation well enough to do that. The code should be double checked.
Thanks @kwalcock ! I think these need to be merged, even if less than ideal. Maybe create a separate issue for the infinite loop problem (if there isn't one)? I think this does happen because of the nested events that may exist in the extraction, even if it's not common.
@enoriega, it would be reassuring to know that you are aware of the changes and don't notice any blatant problems. I'm the newbie here.
@kwalcock I read and the changes, I don't see anything that may be problematic, they LGTM. Thanks!
Thanks, all.
Don't allow error in one format to block other formats