clulab / bioresources

Data resources from the biomedical domain
Apache License 2.0
3 stars 1 forks source link

Improve overrides and fix outdated GO IDs #30

Closed bgyori closed 4 years ago

bgyori commented 4 years ago

This PR makes several changes to the NER-Grounding-Override.tsv:

The PR also updates bio_process.tsv and GO-subcellular-locations.tsv to replace deprecated GO IDs with current ones. (Note that I have been planning to implement a script to fully update these resource files from GO, not just replace the outdated IDs in the old file but this is a good first step for the next release).

@MihaiSurdeanu if you see any surprising effects on the Reach tests you have been updating, let me know.

bgyori commented 4 years ago

Actually, this might change the NER files so I'll add those

bgyori commented 4 years ago

Here is the summary of the results and my analysis. Let me try to fix what I can and I'll comment if there are any left that I'm unsure how to fix.

[info] *** 9 TESTS FAILED ***
[error] Failed tests:
[error]     org.clulab.reach.TestModelEntities
[error]     org.clulab.reach.TestFamilyResolutions
[error]     org.clulab.reach.TestOverrides
[info] - should fail for protein complex Bioentities *** FAILED *** (2 milliseconds)
[info]   Some(ArrayBuffer(<KBResolution: Complex IV, be, COX, , <IMKBMetaInfo: be, famplex.tsv.gz, https://identifiers.org/fplx/, , sp=false, f=true, p=false>>, <KBResolution: cytochrome c oxidase, be, COX, , <IMKBMetaInfo: be, famplex.tsv.gz, https://identifiers.org/fplx/, , sp=false, f=true, p=false>>, <KBResolution: COX, be, COX, , <IMKBMetaInfo: be, famplex.tsv.gz, https://identifiers.org/fplx/, , sp=false, f=true, p=false>>)) was not empty (TestFamilyResolutions.scala:325)

-> this test is trying to make a distinction between FamPlex families and complexes but there's no need to do so, labeling them as Family is good.

[info] - should have labeled all mentions as Family in the text ACOX, BMP, Cadherin, CRISP, DDR,
[info]                 COX4, COX6a, COX6b, COX7a, COX7b,
[info]                 COX8, DVL, ETS, FGF, FLOT,
[info]                 GATA, HSP90, IGFBP, IL1, IRS,
[info]                 MAF, NOTCH, PKI, RAS, SAA,
[info]                 and TGFB are unchanged Families. *** FAILED *** (0 milliseconds)
[info]   25 was not equal to 26 (TestOverrides.scala:170)
[info] - should have display labeled all mentions as Some(Family) *** FAILED *** (0 milliseconds)
[info]   25 was not equal to 26 (TestOverrides.scala:175)
[info] - should match expected grounding IDs in the text ACOX, BMP, Cadherin, CRISP, DDR,
[info]                 COX4, COX6a, COX6b, COX7a, COX7b,
[info]                 COX8, DVL, ETS, FGF, FLOT,
[info]                 GATA, HSP90, IGFBP, IL1, IRS,
[info]                 MAF, NOTCH, PKI, RAS, SAA,
[info]                 and TGFB are unchanged Families. *** FAILED *** (0 milliseconds)
[info]   false was not true (TestOverrides.scala:188)

-> Here I reviewed MAF yesterday and it is the symbol of a gene so I think it will currently resolve to that instead of a family. There are also changes to the groundings here, for instance, COX4 will now resolve to FamPlex's COX4.

[info] - should match expected grounding IDs in the text SMAD, SMAD2/3, SMAD 2/3, and TGFB are important Families. *** FAILED *** (0 milliseconds)
[info]   false was not true (TestOverrides.scala:188)

-> TGFB is a FamPlex family

[info] - should match expected grounding IDs in the text Cas8, EGF, EGFR, ErbB, ERK5, GSK3beta are GGPs. *** FAILED *** (1 millisecond)
[info]   false was not true (TestOverrides.scala:188)

-> ERbB is actually a family

[info] - should have labeled all mentions as Gene_or_gene_product in the text Cas8, EGF, EGFR, ErbB, ERK5, GSK3beta are GGPs. *** FAILED *** (0 milliseconds)
[info]   5 was not equal to 6 (TestOverrides.scala:170)
[info] - should have display labeled all mentions as Some(Protein) *** FAILED *** (0 milliseconds)
[info]   5 was not equal to 6 (TestOverrides.scala:175)

-> Same as the one before, ERbB is a family

[info] - should have GPP label *** FAILED *** (32 milliseconds)
[info]   0 was not equal to 1 (TestModelEntities.scala:87)

-> SAPK is actually a family

bgyori commented 4 years ago

I fixed almost all the tests and pushed the changes to the bio31 branch of Reach. I have one question though (for the one remaining failure): there is a test for the string "FLOT" with the expectation that it should ground to FamPlex FLOT. The current grounding entries are relevant here: In famplex.tsv

485 FLOT»···FLOT
1757 flotillin»··FLOT

In PFAM-families.tsv

12927 Flot»···PF15975                                                                 
12928 Flotillin»··PF15975                                                             

There is currently nothing for this entity in NER-Grounding-Override.tsv.

Why could it be that "FLOT" gets grounded to PF15975 under these circumstances (unless I'm missing something unrelated to the grounding logic)?

MihaiSurdeanu commented 4 years ago

There were a couple of issues:

--- a/main/src/main/scala/org/clulab/reach/grounding/ReachEntityLookup.scala
+++ b/main/src/main/scala/org/clulab/reach/grounding/ReachEntityLookup.scala
@@ -106,9 +106,9 @@ class ReachEntityLookup {
   )

   val familySeq: KBSearchSequence = extraKBs ++ Seq(
+    staticProteinFamilyOrComplex,           // FamPlex families and complexes
     StaticProteinFamily,                    // PFAM families
     StaticProteinFamily2,                   // InterPro families
-    staticProteinFamilyOrComplex,           // FamPlex families and complexes
     ModelGendProteinAndFamily
   )

diff --git a/processors/src/main/resources/reference.conf b/processors/src/main/resources/reference.conf
index edf821d76..cf84b9c50 100644
--- a/processors/src/main/resources/reference.conf
+++ b/processors/src/main/resources/reference.conf
@@ -8,6 +8,7 @@ kbloader: {
   # NB: Order is important as it indicates priority!
   nerKBs: [
     "org/clulab/reach/kb/ner/Gene_or_gene_product.tsv.gz",
+    "org/clulab/reach/kb/ner/FamilyOrComplex.tsv.gz",
     "org/clulab/reach/kb/ner/Family.tsv.gz",
     "org/clulab/reach/kb/ner/Cellular_component.tsv.gz",
     "org/clulab/reach/kb/ner/Simple_chemical.tsv.gz",

I had to prioritize FamPlex over the other Family KBs in ReachEntityLookup. Add, a hidden issue, add FamilyOrComplex to the NER list of KBs. Note that these are configured separately because grounding is a different component. Also, the last issue did not impact FLOT because PFAM had the string. But it was a bug nonetheless.

FLOT is now correctly grounded. Please check. Is my assumption that FamPlex should be prioritized over PFAM for grounding correct?

bgyori commented 4 years ago

I see, I don't fully understand why FamilyOrComplex comes into the picture though since we decided to put FamPlex entries under Family, and in ner_kb.config, we have:

famplex»···Family

I thought that what that meant was that ner/Family.tsv.gz would contain entries from both FamPlex and Pfam. So where does ner/FamilyOrComplex.tsv.gz come into the picture?

MihaiSurdeanu commented 4 years ago

You're right. The change in reference.conf is not needed. I suspect FamilyOrComplex.tsv.gz was a left over from a previous version, which confused me. I'll revert that change. But the grounding change is what was needed for this issue.

bgyori commented 4 years ago

I just ran the tests again, and I think the reordering of the priority now broke another override test:

[info] - should match expected grounding IDs in the text Activin A, Activin AB, Inhibin A, Inhibin B,
[info]                 AMPK alpha1beta1gamma1, AMPK alpha1beta1gamma2, AMPK alpha1beta1gamma3,
[info]                 AMPK alpha1beta2gamma1, AMPK alpha1beta2gamma2, AMPK alpha2beta1gamma1,
[info]                 AMPK alpha2beta2gamma1, AMPK alpha2beta2gamma2, AMPK alpha2beta2gamma3,
[info]                 AMPK a1b1g1, AMPK a1b1g2, AMPK a1b1g3,
[info]                 AMPK a1b2g1, AMPK a1b2g2, AMPK a2b1g1,
[info]                 AMPK a2b2g1, AMPK a2b2g2, AMPK a2b2g3,
[info]                 alpha1beta1gamma1, alpha1beta1gamma2, alpha1beta1gamma3,
[info]                 alpha1beta2gamma1, alpha1beta2gamma2, alpha2beta1gamma1,
[info]                 alpha2beta2gamma1, alpha2beta2gamma2, and alpha2beta2gamma3
[info]                 are important complexes. *** FAILED *** (10 milliseconds)
[info]   false was not true (TestOverrides.scala:188)

I'll look into this and see what needs to be changed.

MihaiSurdeanu commented 4 years ago

Thank you!

On Mon, Apr 13, 2020 at 11:51 Benjamin M. Gyori notifications@github.com wrote:

I just ran the tests again, and I think the reordering of the priority now broke another override test:

[info] - should match expected grounding IDs in the text Activin A, Activin AB, Inhibin A, Inhibin B, [info] AMPK alpha1beta1gamma1, AMPK alpha1beta1gamma2, AMPK alpha1beta1gamma3, [info] AMPK alpha1beta2gamma1, AMPK alpha1beta2gamma2, AMPK alpha2beta1gamma1, [info] AMPK alpha2beta2gamma1, AMPK alpha2beta2gamma2, AMPK alpha2beta2gamma3, [info] AMPK a1b1g1, AMPK a1b1g2, AMPK a1b1g3, [info] AMPK a1b2g1, AMPK a1b2g2, AMPK a2b1g1, [info] AMPK a2b2g1, AMPK a2b2g2, AMPK a2b2g3, [info] alpha1beta1gamma1, alpha1beta1gamma2, alpha1beta1gamma3, [info] alpha1beta2gamma1, alpha1beta2gamma2, alpha2beta1gamma1, [info] alpha2beta2gamma1, alpha2beta2gamma2, and alpha2beta2gamma3 [info] are important complexes. FAILED (10 milliseconds) [info] false was not true (TestOverrides.scala:188)

I'll look into this and see what needs to be changed.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clulab/bioresources/pull/30#issuecomment-613039024, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TUKSMN6UBDKP5I7OB3RMNNJRANCNFSM4MGISRGQ .

bgyori commented 4 years ago

Okay, I pushed a change for that test and now all the Reach tests are passing.

MihaiSurdeanu commented 4 years ago

I confirm that tests pass. I'll release next.

MihaiSurdeanu commented 4 years ago

Bioresources has been released, and the reach branch was merged in master. Thank you for your help @bgyori !