clulab / reach

Reach Biomedical Information Extraction
Other
96 stars 39 forks source link

Update key named entity resource files to latest state #804

Closed bgyori closed 6 months ago

bgyori commented 6 months ago

This PR updates some of the key named entity resource files. In the process, I updated the scripts I wrote previously to do the updates automatically, e.g., to adapt to a new UniProt API. The immediate motivation was to recognize some drug names that appear in recent releases of ChEBI, but the updates are more generally useful. It's worth noting that some resource files (e.g., Uberon.tsv) haven't been updated for many years - this would require some additional work.

MihaiSurdeanu commented 6 months ago

Thanks @bgyori ! LGTM. @kwalcock ?

kwalcock commented 6 months ago

I can press the merge button, but won't know all the consequences. I'll tag @enoriega to make sure he's aware of the changes and then wait for the tests to pass. Thanks, @bgyori.

kwalcock commented 6 months ago

@bgyori, if it isn't obvious how this relates to the recent changes, I can look into it. Perhaps these were failing before somehow.

[error] Failed tests:
[error]     org.clulab.reach.TestMentionSerialization
[error]     org.clulab.reach.PolaritySuite
[error]     org.clulab.reach.TestRegulationEvents
bgyori commented 6 months ago

Named entity resource changes can often break tests like the above so I would definitely investigate. I don't think I have access to Jenkins to look at the detailed logs. Could you copy/paste the errors for these three tests?

kwalcock commented 6 months ago
[info] TestMentionSerialization:
[info] Mek was not phosphorylized by AKT1
[info] - should produce 2 mentions (4 milliseconds)
[info] Serializer
[info] - should write serialized non-context mentions (96 milliseconds)
[info] Serializer
[info] - should load serialized non-context mentions (13 milliseconds)
[info] Mouse AKT2 phosphorylates PTHR2 in chicken adenoid.
[info] - should produce 7 mentions (0 milliseconds)
[info] Serializer
[info] - should write serialized context mentions (36 milliseconds)
[info] Serializer
[info] - should load serialized context mentions (17 milliseconds)
[info] Tbet Rag2 mice (Garrett et al., 2010) as well as Bacteroides spp. (Bloom et al., 2011), Helicobacter spp. (Fox et al., 2011), and Bilophila wadsworthia (Devkota et al., 2012) in Il10 have been shown to enhance intestinal inflammation.The acute dextran sulfate sodium
[info] - should produce 8 mentions and 4 triggers *** FAILED *** (32 milliseconds)
[info]   Vector(org.clulab.reach.mentions.CorefTextBoundMention@eb0144cc, org.clulab.reach.mentions.CorefTextBoundMention@6d043956, org.clulab.reach.mentions.CorefTextBoundMention@679c6d36, org.clulab.reach.mentions.CorefTextBoundMention@23e44370, org.clulab.reach.mentions.CorefEventMention@31df2986, org.clulab.reach.mentions.CorefEventMention@cd15dbea) had size 6 instead of expected size 8 (TestMentionSerialization.scala:64)
[info] Serializer
[info] - should write serialized modifications (26 milliseconds)
[info] Serializer
[info] - should load serialized modifications *** FAILED *** (35 milliseconds)
[info]   Vector(org.clulab.reach.mentions.CorefTextBoundMention@eb0144cc, org.clulab.reach.mentions.CorefTextBoundMention@6d043956, org.clulab.reach.mentions.CorefTextBoundMention@679c6d36, org.clulab.reach.mentions.CorefTextBoundMention@23e44370, org.clulab.reach.mentions.CorefEventMention@31df2986, org.clulab.reach.mentions.CorefEventMention@cd15dbea) had size 6 instead of expected size 8 (TestMentionSerialization.scala:75)
kwalcock commented 6 months ago
[info] TestRegulationEvents:
[info] Phosphorylation of ASPP2 by MAPK is required for RAS induced increased binding to p53 and increased transactivation of pro-apoptotic genes.
[info] - should have an up-regulated phosphorylation (284 milliseconds)
[info] The ubiquitinated Ras protein phosphorylates AKT.
[info] - should contain a regulation (75 milliseconds)
[info] Interestingly, we observed two conserved putative MAPK phosphorylation sites in ASPP1 and ASPP2
[info] - should contain 2 phosphorylations and 2 regulations (320 milliseconds)
[info] We thus tested whether RAS activation may regulate ASPP2 phosphorylation
[info] - should contain 1 phosphorylation and no regulation (100 milliseconds)
[info] MAPK1 was clearly able to phosphorylate the ASPP2 fragment in vitro
[info] - should contain 1 regulation (109 milliseconds)
[info] Under the same conditions, ASPP2 (693-1128) fragment phosphorylated by AKT1 had very low levels of incorporated 32P
[info] - should contain 1 regulation (236 milliseconds)
[info] The phosphorylated ASPP2 fragment by MAPK1 was digested by trypsin and fractioned on a high performance liquid chromatography.
[info] - should contain 1 regulation (220 milliseconds)
[info] Hence ASPP2 can be phosphorylated at serine 827 by MAPK1 in vitro.
[info] - should contain 1 regulation (131 milliseconds)
[info] ASPP1 fails to upregulate the phosphorylation of ASPP2.
[info] - should contains 1 regulation and 1 phosphorylation event (105 milliseconds)
[info] ASPP1 fails to downregulate the phosphorylation of ASPP2.
[info] - should contains 1 downregulation and 1 phosphorylation event (102 milliseconds)
[info] ASPP1 downregulates the phosphorylation of ASPP2.
[info] - should contains 1 downregulation and 1 phosphorylation event (73 milliseconds)
[info] The inhibition of ASPP1 increases the phosphorylation of ASPP2.
[info] - should contain 1 downregulation and NO upregulation events (109 milliseconds)
[info] the phosphorylation of ASPP2 is increased by the inhibition of ASPP1.
[info] - should contain 1 downregulation and NO upregulation events (130 milliseconds)
[info] We observed increased ERBB3 binding to PI3K following MEK inhibition (Figure 1D).
[info] - should contain 1 negative regulation and NO positive activation or regulation events (159 milliseconds)
[info] the inhibition of ASPP1 decreases ASPP2 phosphorylation.
[info] - should contain 1 positive regulation, and NO negative regulations or activations (84 milliseconds)
[info] ASPP1 is an activator of the ubiquitination of ASPP2
[info] - should contain 1 positive regulation, and NO negative regulations or activations (96 milliseconds)
[info] ASPP1 is an inhibitor of the ubiquitination of ASPP2
[info] - should contain 1 negative regulation, and NO positive regulations or activations (106 milliseconds)
[info] The phosphorylation of ASPP1 inhibits the ubiquitination of ASPP2
[info] - should contain a controller with a PTM (110 milliseconds)
[info] The binding of ASPP1 and ASPP2 promotes the phosphorylation of MEK
[info] - should contain a controller with a complex (198 milliseconds)
[info] Human deoxycytidine kinase is phosphorylated by ASPP2 on serine 128.
[info] - should contain exactly one positive regulation and one phosphorylation with site (115 milliseconds)
[info] Human deoxycytidine kinase is phosphorylated on serine 128 by ASPP2.
[info] - should contain exactly one positive regulation and one phosphorylation with site (121 milliseconds)
[info] histone 2B phosphorylated by AKT1 had high levels of incorporated 32P, suggesting that AKT1 was active; while under the same conditions, ASPP2 (693-1128) fragment
[info] - should contain 1 phosphorylation and 1 positive regulation (409 milliseconds)
[info] The binding of BS1 and BS2 promotes the phosphorylation of MEK
[info] - should contain one positive regulation (200 milliseconds)
[info] ASPP1 aids in the translocation of Kras to the membrane
[info] - should contain one positive regulation (110 milliseconds)
[info] rapamycin blocked the serum-stimulated phosphorylation of ERK
[info] - should contain one regulation controlled by rapamycin (113 milliseconds)
[info] rapamycin inhibition of the phosphorylation of ERK
[info] - should contain one regulation controlled by rapamycin (72 milliseconds)
[info] B-Raf phosphorylates MEK2 and MEK1 on Ser221 and Ser217
[info] - should contain 4 phosphorylations and 4 regulations (GUS) (106 milliseconds)
[info] Note that only K650M and K650E-FGFR3 mutants cause STAT1 phosphorylation
[info] - should contain 1 phospho and 2 pos reg (150 milliseconds)
[info] Note that only K650M, K660M, and K650E-FGFR3 mutants cause STAT1 phosphorylation on Y123 and T546
[info] - should contain 2 phospho and 6 pos reg (469 milliseconds)
[info] p53-phosphorylation of ERK
[info] - should contain 1 phospho and 1 pos reg (44 milliseconds)
[info] p53 can be acetylated by p300 and CBP at multiple lysine residues ( K164 , 370 , 372 , 373 , 381 , 382 and 386 ) .
[info] - should contain 16 positive regulations due to the multiple controllers and multiple sites (434 milliseconds)
[info] Taken together , these data suggest that decreased PTPN13 expression enhances EphrinB1 and Erk1 and phosphorylation in epithelial cells .
[info] - should contain 2 negative regulations (not positive) (456 milliseconds)
[info] These data are consistent with EphrinB1 being a PTPN13 phosphatase substrate and suggest that decreased PTPN13 expression in BL breast cancer cell lines increases phosphorylation of EphrinB1 .
[info] - should contain 1 negative regulation (not positive) (518 milliseconds)
[info] - should contain 1 positive regulation and 1 phosphorylation !!! IGNORED !!!
[info] Our model, in which E2-induced SRC-3 phosphorylation occurs in a complex with ER
[info] - should contain 1 positive regulation and 1 phosphorylation (168 milliseconds)
[info] Cells expressing ErbB3 show tyrosine phosphorylation in response to treatment with RAS
[info] - should contain 1 positive regulation and 1 phosphorylation (128 milliseconds)
[info] Cells expressing ErbB3 show tyrosine phosphorylation in response to RAS treatment
[info] - should contain 1 positive regulation and 1 phosphorylation (114 milliseconds)
[info] Cells expressing ErbB3 show tyrosine phosphorylation in response to RAS inhibition
[info] - should contain 1 negative regulation and 1 phosphorylation (115 milliseconds)
[info] Together these data demonstrate that E2-induced SRC-3 phosphorylation is dependent on a direct interaction between SRC-3 and ERalpha and can occur outside of the nucleus.
[info] - should contain 1 phosphorylation, 1 positive regulation, and 1 binding (375 milliseconds)
[info] Akt inhibits the phosphorylation of AFT by BEF.
[info] - should contain a regulation of a regulation (103 milliseconds)
[info] The phosphorylation of AFT by BEF is inhibited by the ubiquitination of Akt.
[info] - should contain a regulation of a regulation (200 milliseconds)
[info] We first assayed the ability of the endogenous EGFR to be tyrosine autophosphorylated in response to EGF
[info] - should contain 1 PosReg of a phosphorylation (195 milliseconds)
[info] the ability of the exogenous ErbB3 receptor to be tyrosine phosphorylated in response to stimulation with either EGF or neuregulin (NRG)
[info] - should contain 2 PosReg of a phosphorylation (326 milliseconds)
[info] Both Gab1 and Gab1 F446/472/589 are tyrosine phosphorylated in response to EGF treatment
[info] - should contain 2 PosReg of 2 phosphorylation (167 milliseconds)
[info] The endogenous EGFR is tyrosine phosphorylated in response to EGF in all cell lines.
[info] - should contain 1 PosReg of 1 phosphorylation (182 milliseconds)
[info] As shown in Figure, the endogenous Gab1 present in WT MEFs is tyrosine phosphorylated in response to EGF treatment.
[info] - should contain 1 PosReg of 1 phosphorylation (260 milliseconds)
[info] We first assayed the ability of the mutant Gab1 proteins to become tyrosine phosphorylated in response to EGF.
[info] - should contain 1 PosReg of 1 phosphorylation (236 milliseconds)
[info] The phosphorylation of AKT1 following MEK activation.
[info] - should contain 1 positive regulation (81 milliseconds)
[info] We observed the phosphorylation of AKT1 following activation by MEK.
[info] - should contain 1 positive regulation (127 milliseconds)
[info] The phosphorylation of AKT1 following inhibition of MEK.
[info] - should contain 1 negative regulation (92 milliseconds)
[info] p53–ASPP2 complex in these cells following RAS activation
[info] - should contain 1 binding and 1 positive regulation event (102 milliseconds)
[info] Apoptosis promotes the phosphorylation of p53.
[info] - should contain no regulations (67 milliseconds)
[info] RAS1 activates AKT-induced apoptosis
[info] - should contain 1 activation and 1 positive regulation of that activation (62 milliseconds)
[info] Indeed, expression of RARbeta2 has been shown to restore retinoic acid induced apoptosis
[info] - should contain 1 Transcription and 1 positive activation, and 1 positive regulation *** FAILED *** (155 milliseconds)
[info]   Vector() had size 0 instead of expected size 1 (TestRegulationEvents.scala:536)
[info] We observed increased ERBB3 binding to PI3K following MEK inhibition (Figure 1D), and accordingly, MEK inhibition substantially increased tyrosine phosphorylated ERBB3 levels (Figure 1A).
[info] - should contain 1 amount, 1 binding, and 2 negative regulations (678 milliseconds)
[info] Up-regulation of MKP3 expression by active Ras expression
[info] - should contain 1 positive regulation and 2 transcriptions (146 milliseconds)
[info] ATP reduced GSH depletion
[info] - should recognize depletion as a positive activation (42 milliseconds)
[info] ATP can deplete GSH in cells
[info] - should recognize deplete as a negative activation (58 milliseconds)
[info] ATP depletes GSH rapidly in cells
[info] - should recognize depletes as a negative activation (65 milliseconds)
[info] glucose triggers insulin release
[info] - should recognize as a secretion (52 milliseconds)
[info] SRF induces TAZ transcription
[info] - should contain an EventMention but no RelationMention (57 milliseconds)
[info] TestKBKeyTransforms:
[info] canonicalKey(identical)
[info] - should return identical string (0 milliseconds)
[info] canonicalKey(a non-identical)
[info] - should return a non-identical string (2 milliseconds)
[info] canonicalKey(A-B and/or C)
[info] - should return abandorc (0 milliseconds)
[info] canonicalKey(MAN_human)
[info] - should return man (0 milliseconds)
[info] canonicalKey(WO-MAN_HUMAN)
[info] - should return woman (0 milliseconds)
[info] stripAllSuffixes(seq0, string one)
[info] - should return string one (1 millisecond)
[info] stripAllSuffixes(seq0, a string/one-two)
[info] - should return a string/one-two (0 milliseconds)
[info] stripAllSuffixes(seq1, string one)
[info] - should return string (1 millisecond)
[info] stripAllSuffixes(seq2, string two)
[info] - should return string (0 milliseconds)
[info] stripAllSuffixes(seq2, string one two one two)
[info] - should return string (0 milliseconds)
[info] stripAllSuffixes(seq2, string one one one)
[info] - should return string (0 milliseconds)
[info] stripAllKeysSuffixes
[info] - should strip the right suffixes (0 milliseconds)
[info] toKeyCandidates(string)
[info] - should return correct results (1 millisecond)
[info] toKeyCandidates(sequences[string])
[info] - should return correct results (0 milliseconds)
[info] applyAllTransforms(XXX, identityKT)
[info] - should return identical strings (1 millisecond)
[info] applyAllTransforms(XXX, various KTs)
[info] - should do the right things separately (2 milliseconds)
[info] applyAllTransforms(XXX, multiple KTs)
[info] - should do the right things (2 milliseconds)
[info] applyAllTransforms(stripFamilyPostAttributives, various strings)
[info] - should return stems (1 millisecond)
[info] applyAllTransforms(stripFamilyPostAttributives, _family)
[info] - should not return stems (0 milliseconds)
[info] applyAllTransforms(stripGeneNameAffixes, various strings)
[info] - should strip suffixes (2 milliseconds)
[info] applyAllTransforms(stripGeneNameAffixes, various strings)
[info] - should strip prefixes (1 millisecond)
[info] applyAllTransforms(stripGeneNameAffixes, various strings)
[info] - should strip prefixes too (1 millisecond)
[info] applyAllTransforms(stripGeneNameAffixes, various strings)
[info] - should handle multiple affixes (1 millisecond)
[info] applyAllTransforms(stripMutantProtein, various strings)
[info] - should return stems (2 milliseconds)
[info] applyAllTransforms(stripOrganPostAttributives, various strings)
[info] - should return stems (1 millisecond)
[info] applyAllTransforms(stripOrganPostAttributives, various strings)
[info] - should work un-cased (0 milliseconds)
[info] applyAllTransforms(stripProteinDomainKey, various strings)
[info] - should return stems (1 millisecond)
[info] applyAllTransforms(stripProteinPostAttributives, various strings)
[info] - should return stems (1 millisecond)
kwalcock commented 6 months ago

The PolaritySuite failure may follow from failures of individual tests. Please search the output above for "fail".

bgyori commented 6 months ago

The first failure is interesting:

[info] Indeed, expression of RARbeta2 has been shown to restore retinoic acid induced apoptosis
[info] - should contain 1 Transcription and 1 positive activation, and 1 positive regulation *** FAILED *** (155 milliseconds)
[info]   Vector() had size 0 instead of expected size 1 (TestRegulationEvents.scala:536)

What happens here is that after the resource update, "retinoic acid induced apoptosis" is recognized as an atomic named entity since it is a Gene Ontology term. This is correct, though one could argue that the granulal semantics internal to this term (i.e., that it represents a positive regulation of apoptosis by retinoic acid) is lost:

MENTION TEXT:  retinoic acid induced apoptosis
LABELS:        List(BioProcess, BioEntity, Entity, PossibleController)
DISPLAY LABEL: BioProcess
    ------------------------------
    RULE => ner-bioprocess-entities
    TYPE => CorefTextBoundMention
    ------------------------------
    GROUNDING: <KBResolution: programmed cell death in response to retinoic acid, go, GO:0160059, human, <IMKBMetaInfo: uaz, bio_process.tsv, , , sp=false, f=false, p=false>>

CONTEXT: NONE
    ------------------------------

With this one, I think the test could be modified: instead of "should contain 1 Transcription and 1 positive activation, and 1 positive regulation" we could make it "should contain 1 Transcription and 1 positive regulation".

enoriega commented 6 months ago

Upon this change it looks like sulfate by itself is no longer being recognized as a participant in two relations. If this looks correct, we can update the test to expect fewer events.

TestMentionSerialization

Before

image

After

image
bgyori commented 6 months ago

I also diagnosed the second failure, on the sentence (actually, one and a half sentences...):

Tbet Rag2 mice (Garrett et al., 2010) as well as Bacteroides spp. (Bloom et al., 2011), Helicobacter spp. (Fox et al., 2011), and Bilophila wadsworthia (Devkota et al., 2012) in Il10 have been shown to enhance intestinal inflammation.The acute dextran sulfate sodium

I fixed one issue on this branch related to FamPlex and NER overrides. Beyond that, the result after this PR is actually an improvement: we now correctly recognize "dextran sulfate sodium" as a single named entity (previously this was incorrectly broken up into two), and two previously incorrectly extracted events are now not there. I will update the test accordingly.

bgyori commented 6 months ago

Looks like something is still failing after the changes - could someone with access check what failed this time? Thanks!

enoriega commented 6 months ago

@bgyori @kwalcock I will give it a look this afternoon

kwalcock commented 6 months ago

@bgyori

[info] Indeed, expression of RARbeta2 has been shown to restore retinoic acid induced apoptosis
[info] - should contain 1 Transcription, and 1 positive regulation *** FAILED *** (226 milliseconds)
[info]   Vector() had size 0 instead of expected size 1 (TestRegulationEvents.scala:535)
enoriega commented 6 months ago

@bgyori @kwalcock It appears that the update in the KBs results on getting retinoic acid induced apoptosis be labeled as a BioProcess and not being picked up as a positive regulation. Whereas on the master branch it is labeled as a positive activation resulting on the positive regulation.

enoriega commented 6 months ago

This is the test's code:

val sent57 = "Indeed, expression of RARbeta2 has been shown to restore retinoic acid induced apoptosis"
  sent57 should "contain 1 Transcription and 1 positive activation, and 1 positive regulation" in {
    val mentions = getBioMentions(sent57)
    mentions.filter(_ matches "Transcription") should have size (1)
    mentions.filter(_ matches "Positive_activation") should have size (1)
    mentions.filter(_ matches "Positive_regulation") should have size (1)
  }

And the expected positive regulation missing in the PR's version is expression of RARbeta2 has been shown to restore retinoic acid induced apoptosis

bgyori commented 6 months ago

Thanks, I looked at this test earlier, let me check again.

bgyori commented 6 months ago

@enoriega I actually changed that test earlier, I accidentally left the regulation instead of the activation event there, I changed it now in the latest commit.

kwalcock commented 6 months ago

Thanks for keeping the project up-to-date and healthy!