clulab / bioresources

Data resources from the biomedical domain
Apache License 2.0
3 stars 1 forks source link

Update FamPlex groundings and overrides #27

Closed bgyori closed 4 years ago

bgyori commented 4 years ago

This PR addresses #22 as follows:

Question/issue for @MihaiSurdeanu: the "FamilyOrComplex" label is new. I found that the KBLoader doesn't recognize it, see its output below:

[info] Running org.clulab.processors.bionlp.ner.KBLoader ../bioresources/src/main/resources/org/clulab/reach/kb/ner/model.ser.gz
23:24:54.092 [run-main-0] DEBUG o.c.processors.bionlp.ner.KBLoader - Loading LexiconNER from knowledge bases...
23:24:54.097 [run-main-0] INFO  org.clulab.sequences.LexiconNER - Beginning to load the KBs for the rule-based bio NER...
23:24:54.210 [run-main-0] INFO  org.clulab.sequences.LexiconNER - Loaded OVERRIDE matchers for all labels.  The number of entries added to the first layer was 425.
23:24:54.763 [run-main-0] INFO  org.clulab.sequences.LexiconNER - Loaded matcher for label Gene_or_gene_product. The number of entries added to the first layer was 62280.
23:24:55.126 [run-main-0] INFO  org.clulab.sequences.LexiconNER - Loaded matcher for label Family. The number of entries added to the first layer was 22655.
23:24:55.138 [run-main-0] INFO  org.clulab.sequences.LexiconNER - Loaded matcher for label Cellular_component. The number of entries added to the first layer was 588.
23:24:56.986 [run-main-0] INFO  org.clulab.sequences.LexiconNER - Loaded matcher for label Simple_chemical. The number of entries added to the first layer was 74691.
23:24:57.022 [run-main-0] INFO  org.clulab.sequences.LexiconNER - Loaded matcher for label Site. The number of entries added to the first layer was 551.
23:24:57.023 [run-main-0] INFO  org.clulab.sequences.LexiconNER - Loaded matcher for label BioProcess. The number of entries added to the first layer was 61.
23:24:57.030 [run-main-0] INFO  org.clulab.sequences.LexiconNER - Loaded matcher for label Species. The number of entries added to the first layer was 1027.
23:24:57.225 [run-main-0] INFO  org.clulab.sequences.LexiconNER - Loaded matcher for label CellLine. The number of entries added to the first layer was 66279.
23:24:57.230 [run-main-0] INFO  org.clulab.sequences.LexiconNER - Loaded matcher for label TissueType. The number of entries added to the first layer was 681.
23:24:57.253 [run-main-0] INFO  org.clulab.sequences.LexiconNER - Loaded matcher for label CellType. The number of entries added to the first layer was 1127.
23:24:57.367 [run-main-0] INFO  org.clulab.sequences.LexiconNER - Loaded matcher for label Organ. The number of entries added to the first layer was 4375.
23:24:57.367 [run-main-0] INFO  org.clulab.sequences.LexiconNER - KB loading completed.

and doesn't produce a ner/FamilyOrComplex.tsv.gz as expected. What needs to be changed for it to do that?

MihaiSurdeanu commented 4 years ago

This list will have to be adjusted in processors: https://github.com/clulab/processors/blob/master/corenlp/src/main/resources/reference.conf#L67

MihaiSurdeanu commented 4 years ago

Can you please run the reach unit tests with this branch?

bgyori commented 4 years ago

First, I made modifications in processors to regenerate the NER files with ner_kb.sh. I then made a bunch of updates in Reach to remove the old BE files and use the new FamPlex file with the new FamilyOrComplex label. I got to a point where tests are running without compliation errors. But unsurprisingly, due to the relabeling of a lot fo entities that were perviously Family or Complex, many tests are failing (179 of them). It looks like certain rules or actions also make assumptions about the type of entity, and unless FamilyOrComplex is propagated into those places and handled, it will fail. On example is We analyze the Mek-Ras-Akt1 complex which is not extracted since Mek and Ras now resolve to FamilyOrComplex.

I am wondering if it's worth going through and handling FamilyOrComplex everywhere necessary in Reach. Or to alternatively, stick all of FamPlex under an existing label like Gene_or_gene_product (since I don't think the distinction between Gene_or_gene_product and Complex/Family matters mechanistically).

MihaiSurdeanu commented 4 years ago

If there is no mechanistic difference, I would be in favor of folding all of protein/family/complex under GGP.

If we're all Ok with this plan, I propose:

  1. @bgyori adjusts the bioresources project along these lines.
  2. @MihaiSurdeanu changes processors and reach to account for this. In fact, @bgyori, please do not modify processors any more. I would like to move BioNLPProcessor as a subproject in reach today, to get rid of at least one unnecessary dependency.
bgyori commented 4 years ago

Okay, that sounds good, I will do that as soon as I can today.

bgyori commented 4 years ago

I tried a couple of options and settled on one in which:

This results in a relatively small number of test failures in Reach that are hopefully not too much work to sort out:

[info] *** 27 TESTS FAILED ***
[error] Failed tests:
[error]     org.clulab.reach.TestGrounding
[error]     org.clulab.reach.TestComplexResolutions
[error]     org.clulab.reach.TestFamilyResolutions
[error]     org.clulab.reach.TestOverrides

@MihaiSurdeanu what do you think of this as an interim solution? For a subsequent release, we could try to do a more comprehensive refactoring of protein-like entities but I think this should be good for now.

MihaiSurdeanu commented 4 years ago

Sounds good. I'll work this branch then.

MihaiSurdeanu commented 4 years ago

I am working on this. But I am coupling this fix with a software restructuring of processors and reach. It looks like it will take more than 1 day. Hopefully not much more...

MihaiSurdeanu commented 4 years ago

I merged this is master, and released 1.1.30.