clulab / bioresources

Data resources from the biomedical domain
Apache License 2.0
3 stars 1 forks source link

Integrate UniProt fragments #42

Closed bgyori closed 3 years ago

bgyori commented 3 years ago

This PR extends the update_uniprot_proteins.py script to download and process protein chains and peptides into grounding entries. The approach taken here is to put these into the existing uniprot-proteins.tsv file with IDs formatted as [UniProtID]#[FragmentID] which is now "officially" supported by UniProt. Examples:

Angiotensin-2   P01019#PRO_0000032458   Homo sapiens
Angiotensin-2   P11859#PRO_0000032462   Mus musculus

This adds a total of around 50k new rows to the grounding file.

This PR does not yet touch the NER files and I have not yet written any tests and tried this against Reach. @JakeWolfe and @MihaiSurdeanu would you be able to pick this up from here?

MihaiSurdeanu commented 3 years ago

Thank you @bgyori !

@JakeWolfe : can you please generate the NER files in this branch? And then publishLocal, and write a few unit tests in Reach to make sure that this new entities are visible?

@bgyori : should we keep the Protein Ontology fragments then?

bgyori commented 3 years ago

I suggest we keep the Protein Ontology for now, we can test how Reach works after these additions and see if we need to remove or tweak anything further.

MihaiSurdeanu commented 3 years ago

@JakeWolfe : any update on this? I'd like to release soon. Thank you!

JakeWolfe commented 3 years ago

@MihaiSurdeanu I will get to this by the end of tonight.

JakeWolfe commented 3 years ago

Testing in reach after generating the new NER files and publishing locally is showing failed unit tests in TestEntities, specifically:

"ADAMTS18/ClvPrd is a protein fragment": (2 was not equal to 1) in reference to the number of mentions found "CCL3L/ClvPrd is a protein fragment": (2 was not equal to 1) in reference to the number of mentions found

It seems to be splitting those mentions into two mentions instead of the expected one previously, this can be verified by the MetaInfo of the mention of ADAMTS18 and CCL3L referencing uniprot-proteins.tsv.gz when this was a protein ontology test. This would seem to indicate that there is overlap between the two files.

@MihaiSurdeanu, @bgyori : Please let me know how to proceed

MihaiSurdeanu commented 3 years ago

Thanks @JakeWolfe!

To understand this problem please check and confirm that:

bgyori commented 3 years ago

I think ClvPrd is just a code for "cleavage product" used in the Protein Ontology and doesn't really have any "real" significance when it comes to recognizing entities in text. For instance "ADAMTS18/ClvPrd" is just a synthetic name given to an arbitrary "ADAMTS18 cleavage product".

MihaiSurdeanu commented 3 years ago

Thanks @bgyori !

@JakeWolfe: can you please check why we're identifying ClvPrd as proteins? Which KB contains it? @bgyori: we should remove it from there?

JakeWolfe commented 3 years ago

@MihaiSurdeanu I am now looking into this.

MihaiSurdeanu commented 3 years ago

Thanks!

On Tue, Sep 22, 2020 at 11:00 Jake Wolfe notifications@github.com wrote:

@MihaiSurdeanu https://github.com/MihaiSurdeanu I am now looking into this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clulab/bioresources/pull/42#issuecomment-696884846, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TRDBBRBPR2FMCJLOALSHDQ3JANCNFSM4RD7QWGQ .

bgyori commented 3 years ago

Did you run into any issues with the changes? I'm happy to help, just let me know!

MihaiSurdeanu commented 3 years ago

@kwalcock: can you please pick up this thread? In particular, we want to merge this in master, but a couple of test in Reach are failing. To understand this problem, we need to check and confirm that:

Can you please double check this?

kwalcock commented 3 years ago

If case anyone else is listening, here's the written acknowledgement.

MihaiSurdeanu commented 3 years ago

I appreciate it @kwalcock !

MihaiSurdeanu commented 3 years ago

From @JakeWolfe:

ADAMSTS18 was found in Gene_or_gene_product and Gene_or_gene_product_OLD ClvPrd is in neither Gene_or_gene_product and Gene_or_gene_product_OLD

ADAMTS18/ClvPrd is in neither

CCL3L1 and CCL3L3 was found in Gene_or_gene_product and Gene_or_gene_product_OLD, but not CCL3L

CCL3L/ClvPrd is in neither

MihaiSurdeanu commented 3 years ago

@JakeWolfe: if you run the reach shell (the "shell" script), what is the actual output on the two failing sentences? Thanks!

JakeWolfe commented 3 years ago

@MihaiSurdeanu TEXT: ADAMTS18 / ClvPrd is a protein fragment TOKENS: (0,ADAMTS18,NN), (1,and,CC), (2,ClvPrd,NN), (3,is,VBZ), (4,a,DT), (5,protein,NN), (6,fragment,NN) ENTITY LABELS: (ADAMTS18,B-Gene_or_gene_product), (and,O), (ClvPrd,B-Gene_or_gene_product), (is,O), (a,O), (protein,O), (fragment,O)

LEMMAS: adamts18 and clvprd be a protein fragment roots: 6 outgoing: 0: (1,cc) (2,conj_and) 1: 2: 3: 4: 5: 6: (0,nsubj) (2,nsubj) (3,cop) (4,det) (5,compound) incoming: 0: (6,nsubj) 1: (0,cc) 2: (0,conj_and) (6,nsubj) 3: (6,cop) 4: (6,det) 5: (6,compound) 6:

ENTITIES: 2

MENTION TEXT: ADAMTS18 LABELS: List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController) DISPLAY LABEL: Protein

    RULE => ner-gene_or_gene_product-entities
    TYPE => CorefTextBoundMention
    ------------------------------
    GROUNDING: <KBResolution: ADAMTS21, uniprot, Q8TE60, homo sapiens, <IMKBMetaInfo: uniprot, uniprot-proteins.tsv.gz, http://identifiers.org/uniprot/, MIR:00100164, sp=true, f=false, p=true>>

CONTEXT: NONE

MENTION TEXT: ClvPrd LABELS: List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController) DISPLAY LABEL: Protein

    RULE => ner-gene_or_gene_product-entities
    TYPE => CorefTextBoundMention
    ------------------------------
    GROUNDING: <KBResolution: ClvPrd, uaz, UAZ00001, , <IMKBMetaInfo: uaz, , , , sp=false, f=false, p=false>>

CONTEXT: NONE

EVENTS: 0

==================================================

TEXT: CCL3L / ClvPrd is a protein fragment TOKENS: (0,CCL3L,NN), (1,and,CC), (2,ClvPrd,NN), (3,is,VBZ), (4,a,DT), (5,protein,NN), (6,fragment,NN) ENTITY LABELS: (CCL3L,B-Gene_or_gene_product), (and,O), (ClvPrd,B-Gene_or_gene_product), (is,O), (a,O), (protein,O), (fragment,O)

LEMMAS: ccl3l and clvprd be a protein fragment roots: 6 outgoing: 0: (1,cc) (2,conj_and) 1: 2: 3: 4: 5: 6: (0,nsubj) (2,nsubj) (3,cop) (4,det) (5,compound) incoming: 0: (6,nsubj) 1: (0,cc) 2: (0,conj_and) (6,nsubj) 3: (6,cop) 4: (6,det) 5: (6,compound) 6:

ENTITIES: 2

MENTION TEXT: CCL3L LABELS: List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController) DISPLAY LABEL: Protein

    RULE => ner-gene_or_gene_product-entities
    TYPE => CorefTextBoundMention
    ------------------------------
    GROUNDING: <KBResolution: CCL3L, uaz, UAZ00002, , <IMKBMetaInfo: uaz, , , , sp=false, f=false, p=false>>

CONTEXT: NONE

MENTION TEXT: ClvPrd LABELS: List(Gene_or_gene_product, MacroMolecule, Equivalable, BioChemicalEntity, BioEntity, Entity, PossibleController) DISPLAY LABEL: Protein

    RULE => ner-gene_or_gene_product-entities
    TYPE => CorefTextBoundMention
    ------------------------------
    GROUNDING: <KBResolution: ClvPrd, uaz, UAZ00001, , <IMKBMetaInfo: uaz, , , , sp=false, f=false, p=false>>

CONTEXT: NONE

EVENTS: 0

==================================================

kwalcock commented 3 years ago

FWIW, the failing tests that I see when running reach on a locally published version of bioresources from the uniprot_fragments branch are these in TestHyphenedEvents and TestApi:

[info] - should have a positive activation of levels of EM by TFs, TWIST1, SNAIL1, SLUG, ZEB1, FOXC2 and CD45 *** FAILED *** (238 milliseconds)
[info]   false was not true (TestHyphenedEvents.scala:17)

[info] - should return 9 positive activation and 3 phosphorylation results from NXML test *** FAILED *** (10 seconds, 407 milliseconds)
[info]   Vector(org.clulab.reach.mentions.CorefEventMention@ed6620fa, org.clulab.reach.mentions.CorefEventMention@1aa155ea, org.clulab.reach.mentions.CorefEventMention@15fde8d6, org.clulab.reach.mentions.CorefEventMention@f71c0834, org.clulab.reach.mentions.CorefEventMention@32c124b2, org.clulab.reach.mentions.CorefEventMention@efacda5c, org.clulab.reach.mentions.CorefEventMention@31d5ece8) had size 7 instead of expected size 9 (TestApi.scala:124)

In reach it is processors/build.sbt that was changed to use the local copy of bioprocessors. Perhaps I've missed some detail.

MihaiSurdeanu commented 3 years ago

Hmm... @JakeWolfe: can you please list here the steps you used to test this PR?

JakeWolfe commented 3 years ago

@MihaiSurdeanu @kwalcock In the uniprot_fragments branch of bioresources I built the KBs using sbt publishLocal, I went to reach and changed the bioresources version to the snapshot produced by publishLocal, then ran the shell commands. In addition I run the TestEntites unit test.

kwalcock commented 3 years ago

So may things could go wrong. I wonder about "I built the KBs using sbt publishLocal". I didn't rebuild anything in the kb directory that I know of. There are instructions for rebuilding the ner directory using reach and the ner_kb.sh script at https://github.com/clulab/bioresources. I assume that's how the new files in ner got to github. Are those the same as your local version (i.e., are we using the same files)?

bioresources/master is set to version 1.1.34-SNAPSHOT so in the uniprot_fragments I used 1.1.35-SNAPSHOT just in case. That could get messed up.

On one or the other of these projects when testing with Windows, it's important to add -Dfile.encoding=UTF-8. I'm not sure which project it was.

I do notice that some CR/LFs are creeping into our files and hope that's not somehow involved.

JakeWolfe commented 3 years ago

I am running on windows so these tests may have been affected. The NER files in the uniprot_fragments repository is the same as the the one that I am testing on.

How should we move forward with testing?

kwalcock commented 3 years ago

I reran the tests without the -Dfile.encoding=UTF-8 and there was no change. I suppose it could have been important when generating the files in ner.

Tomorrow I'm going to run the tests on a pristine git clone and on Linux and see if it makes a difference. I want to make sure that we're fixing the right problem and not some strange configuration issue. Chances are I can't make the exact same mistakes twice.

MihaiSurdeanu commented 3 years ago

Every time a file in the src/main/resources/org/clulab/reach/kb/ folder changes, we need to rerun the ner_kb.sh script to regenerate the files under .../kb/ner, which is what processors actually uses. Has this been done for this PR?

JakeWolfe commented 3 years ago

Yes, the KBs were updated in push 7900aff

kwalcock commented 3 years ago

This is completely weird. It sure looks like KBgenerator.scala (https://github.com/clulab/reach/blob/master/processors/src/main/scala/org/clulab/processors/bionlp/ner/KBGenerator.scala) expects the knowledge bases to have three columns:

https://github.com/clulab/reach/blob/5777d66448f1ccb737e69bedae5f8f8073e562b7/processors/src/main/scala/org/clulab/processors/bionlp/ner/KBGenerator.scala#L28-L30

https://github.com/clulab/reach/blob/5777d66448f1ccb737e69bedae5f8f8073e562b7/processors/src/main/scala/org/clulab/processors/bionlp/ner/KBGenerator.scala#L151-L152

but the file I see (protein-ontology-fragments.tsv after gunzip) has only two columns:

14-3-3 protein gamma proteolytic cleavage product   PR:000021868
155 kDa platelet multimerin (human) PR:000050084
2K  PR:000036831
2K fragment PR:000036831

This results in exceptions for both Linux and Windows:

17:31:11.709 [run-main-0] INFO  o.c.p.bionlp.BioNLPProcessor - Converting protein-ontology-fragments...
[error] (run-main-0) java.lang.ArrayIndexOutOfBoundsException: 2
[error] java.lang.ArrayIndexOutOfBoundsException: 2
[error]     at org.clulab.processors.bionlp.ner.KBGenerator$.containsValidSpecies(KBGenerator.scala:151)
[error]     at org.clulab.processors.bionlp.ner.KBGenerator$.convertKB(KBGenerator.scala:98)
[error]     at org.clulab.processors.bionlp.ner.KBGenerator$$anonfun$main$2.apply(KBGenerator.scala:53)
[error]     at org.clulab.processors.bionlp.ner.KBGenerator$$anonfun$main$2.apply(KBGenerator.scala:52)
[error]     at scala.collection.immutable.List.foreach(List.scala:392)
[error]     at org.clulab.processors.bionlp.ner.KBGenerator$.main(KBGenerator.scala:52)
[error]     at org.clulab.processors.bionlp.ner.KBGenerator.main(KBGenerator.scala)
[error]     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
[error]     at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
[error]     at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
[error]     at java.lang.reflect.Method.invoke(Method.java:498)
[error] stack trace is suppressed; run last Compile / bgRunMain for the full output
[error] Nonzero exit code: 1
[error] (Compile / runMain) Nonzero exit code: 1
[error] Total time: 434 s (07:14), completed Oct 8, 2020 5:31:11 PM
sbt:reach-exe> 

It looks like this KBGenerator file was copied in from somewhere else on 4/6/2020 with this commit: https://github.com/clulab/reach/commit/3df3e696e3ce435af0071bca56c899dcf6fd012c, and I don't know what it looked like before.

However, it also looks like this protein-ontology-fragments.tsv.gz was also added recently, with https://github.com/clulab/bioresources/commit/a43b98c4d8dff3c133a1bfbdd2e5ec1d086f56c2 and that it wasn't previously compressed which might imply that it didn't even go through this process. I don't know.

Something is mixed up, maybe me.

kwalcock commented 3 years ago

I get it. There's also a PR on reach, https://github.com/clulab/reach/pull/693, that contains the updated code needed to process the file.

kwalcock commented 3 years ago

For the record, I'm trying first with this proonto branch merged into master in reach. This seems to make a difference because proonto was using procVer 8.1.1 while master had 8.2.1 (and I specified 8.2.2, which didn't make a difference over 8.2.1). The change from 8.1.1 to 8.2.1 affects the tokenization. The number of moving parts is more than optimal.

kwalcock commented 3 years ago

On Linux, I get these errors, which are close to what I got with Windows. The first might be explained by tokenization differences in processors.

[info] TestHyphenedEvents:
[info] The EM-inducing TFs (TWIST1, SNAIL1, SLUG, ZEB1, and FOXC2) in the CD45 - cells were determined using qRT-PCR.
[info] - should have a positive activation of levels of EM by TFs, TWIST1, SNAIL1, SLUG, ZEB1, FOXC2 and CD45 *** FAILED *** (350 milliseconds)
[info]   false was not true (TestHyphenedEvents.scala:17)

This one might just need more time:

[info] TestApiServer:
[info] The class under test
[info] - should return correct JSON version string (268 milliseconds)
[info] - should GET text, default output (28 seconds, 705 milliseconds)
...
[info] - should POST upload nxml, default output *** FAILED *** (1 second, 38 milliseconds)
[info]   Request was neither completed nor rejected within 1 second (DynamicVariable.scala:58)

This is similar to before:

[info] TestApi:
[info] - should return Reach results from a FriesEntry (21 seconds, 978 milliseconds)
...
[info] - should return 9 positive activation and 3 phosphorylation results from NXML test *** FAILED *** (5 seconds, 220 milliseconds)
[info]   Vector(org.clulab.reach.mentions.CorefEventMention@ed6620fa, org.clulab.reach.mentions.CorefEventMention@1aa155ea, org.clulab.reach.mentions.CorefEventMention@15fde8d6, org.clulab.reach.mentions.CorefEventMention@f71c0834, org.clulab.reach.mentions.CorefEventMention@32c124b2, org.clulab.reach.mentions.CorefEventMention@efacda5c, org.clulab.reach.mentions.CorefEventMention@31d5ece8) had size 7 instead of expected size 9 (TestApi.scala:124)

None of these is TestEntities, which notes no problems:

[info] TestEntities:
[info] ReachSystem
[info] - should extract mentions from FriesEntry (1 minute, 23 seconds)
[info] - should extract mentions from text (218 milliseconds)
[info] - should extract mentions from document (10 seconds, 926 milliseconds)
[info] - should not extract mentions from document without id (173 milliseconds)
[info] - should not extract mentions from document without original text (88 milliseconds)
[info] - should extract grounded entities only (82 milliseconds)
[info] - should extract an empty list without entities (102 milliseconds)
[info] It has recently been shown that oncogenic RAS can enhance the apoptotic function of p53 via ASPP1 and ASPP2
[info] - should contain 4 entities (636 milliseconds)
[info] We hypothesized that MEK inhibition activates AKT by inhibiting ERK activity, which blocks an inhibitory threonine phosphorylation on the JM domains of EGFR and HER2, thereby increasing ERBB3 phosphorylation.
[info] - should contain at least 4 entities (2 seconds, 291 milliseconds)
[info] To test this hypothesis, we transiently transfected CHO-KI cells, which do not express ERBB receptors endogenously, with wildtype ERBB3 with either wild-type EGFR or EGFR T669A.
[info] - should contain at least 3 entities (508 milliseconds)
[info] See Figure S31 and Table R15
[info] - should not contain any sites (83 milliseconds)
[info] The K-Ras substrate and mTOR substrates shouldn't be found.
[info] - should not contain any entities (because of substrate constraint) (94 milliseconds)
[info] In some cases, the presence of Ras inhibits autophagy.
[info] - should contain 1 BioProcess entity ("autophagy") (111 milliseconds)
[info] the MEK family
[info] - should contain 1 Family entity for "MEK" even if the entity tag is B-Gene_or_gene_product (73 milliseconds)
[info] the MEK protein family
[info] - should contain 1 Family entity for "MEK" even if the entity tag is B-Gene_or_gene_product (50 milliseconds)
[info] Our model, in which E2-induced SRC-3 phosphorylation occurs in a complex with ER
[info] - should not contain any sites and it should have 1 simple chemical (234 milliseconds)
[info] Ras inhibitor was added to the solution.
[info] - should contain a Simple_chemical and nothing else (104 milliseconds)
[info] Akt inhibitor was added to the solution.
[info] - should contain a Simple_chemical and nothing else (75 milliseconds)
[info] Adenylate cyclase inhibitor was added to the solution.
[info] - should contain a BioProcess and nothing else (86 milliseconds)
[info] Vascular endothelial cell growth inhibitor was added to solution.
[info] - should contain a Gene_or_gene_product and nothing else (160 milliseconds)
[info] p13 BID is a protein fragment
[info] - should contain a p13 BID and nothing else (65 milliseconds)
[info] Abeta is a protein fragment
[info] - should contain a Abeta and nothing else (58 milliseconds)
[info] inactivated P-factor is a protein fragment
[info] - should contain a inactivated P-factor and nothing else (98 milliseconds)
[info] interleukin-1 alpha proteolytic cleavage product is a protein fragment
[info] - should contain a interleukin-1 alpha proteolytic cleavage product and nothing else (107 milliseconds)
[info] Non-structural protein 5 is a protein fragment
[info] - should contain a Non-structural protein 5 and nothing else (80 milliseconds)
[info] preM is a protein fragment
[info] - should contain a preM and nothing else (66 milliseconds)
[info] Run completed in 1 minute, 41 seconds.
[info] Total number of tests run: 26
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 26, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
kwalcock commented 3 years ago

In https://github.com/clulab/reach/pull/693/commits/cb7170848621bae9855e158ee9173f7b1519b9b7, Ben made changes that seem to apply to the very tests that @JakeWolf reports as failing. Jake, did your test include these changes? I don't know why any of this was done, but will ask Ben if necessary. They might explain the different results that Jake and I are seeing. I am using his changes.

JakeWolfe commented 3 years ago

I have been using the current master version of reach. I can rerun these tests on that branch if needed, assuming that both will be pushed at the same time.

kwalcock commented 3 years ago

There are two things changing, but I'm not sure how they will be coordinated myself. Maybe Mihai will weigh in. It would be great to see if you using that branch* would get us to the same results, which is still a couple of tests failing, but with less weirdness.

By that branch I mean proonto merged locally with reach master. which will also get the processors update. That processors update may be causing problems. I also used this merged version to generate the ner files (for the local publish of bioresources) before using it again to do the testing.

So, proonto is headed to master, so it needs to work together with that eventually.

MihaiSurdeanu commented 3 years ago

@kwalcock: I suggest the following sequence of events:

  1. Release bioresources with these latest additions. I think the failing tests are reach or processors issues.
  2. Have a reach branch that includes: (a) both clulab/reach#693 and clulab/reach@cb71708, (b) update to the latest bioresources above.

Then we can work in this branch to understand what is left to be fixed. I am happy to look at the code then. Thank you for looking into this complicated situation!

bgyori commented 3 years ago

Hopefully, I'm not confusing things further by saying that I made a number of changes to the Protein Ontology resource file here in bioresources before it was merged in https://github.com/clulab/bioresources/pull/41, and pushed the commit https://github.com/clulab/reach/pull/693/commits/cb7170848621bae9855e158ee9173f7b1519b9b7 to the PR https://github.com/clulab/reach/pull/693 to adapt the tests to these changes. In addition to the possible issue with the processor versions, it's possible @JakeWolfe didn't use the latest state of the PR branch in reach for testing, causing these failures.

kwalcock commented 3 years ago

There are unpublished branches of bioresources and reach called proonto. These steps can be followed to reproduce the failing tests:

  1. git clone http://github.com/clulab/bioresources
  2. cd bioresources
  3. git checkout proonto
  4. cd ..
  5. git clone http://github.com/clulab/reach
  6. cd reach
  7. git checkout proonto
  8. cd ../bioresources/
  9. ./ner_kb.sh
  10. sbt publishLocal
  11. cd ../reach
  12. vim processors/build.sbt # I corrected a mistake here
  13. <change bioresources dependency to 1.1.34-SNAPSHOT>
  14. sbt main/test
  15. sbt export/test

Fixes can be pushed back to these same branches.

kwalcock commented 3 years ago

BTW these are the errors in TestHyphenedEvents and TestApi, not the ones related to TestEntities.

JakeWolfe commented 3 years ago

@MihaiSurdeanu I am not familiar with that part of reach, how should we approach fixing this?

MihaiSurdeanu commented 3 years ago

I will take over from here. Thanks Keith!

On Mon, Oct 12, 2020 at 23:13 Jake Wolfe notifications@github.com wrote:

@MihaiSurdeanu https://github.com/MihaiSurdeanu I am not familiar with that part of reach, how should we approach fixing this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clulab/bioresources/pull/42#issuecomment-707474084, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TXQWGBWTL7SEMO7BXDSKPHWZANCNFSM4RD7QWGQ .

kwalcock commented 3 years ago

Anyone, I do have a bunch of outstanding commits for reach that update sbt and the plugin dependencies to new versions and make changes to the scripts. If they are needed sooner, please let me know, but I didn't want to make the situation more complicated by trying to merge them now.

MihaiSurdeanu commented 3 years ago

Please wait on those. But did you merge Ben's other branch in here?

On Tue, Oct 13, 2020, 5:13 PM Keith Alcock notifications@github.com wrote:

Anyone, I do have a bunch of outstanding commits for reach that update sbt and the plugin dependencies to new versions and make changes to the scripts. If they are needed sooner, please let me know, but I didn't want to make the situation more complicated by trying to merge them now.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clulab/bioresources/pull/42#issuecomment-708038014, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TTFU7ZBH4U7NHUWYRLSKTGJNANCNFSM4RD7QWGQ .

kwalcock commented 3 years ago

Both Ben's changes to bioresources and the ones to reach are in the proonto branches. The ones to reach are a little suspect because they seem to solve the problems (with things like ClvPrd) by changing the tests, but they are there.

bgyori commented 3 years ago

Note that I took out synonyms containing ClvPrd in bioresources first (this is merged into master) and then updated the corresponding Reach tests to remove checking for these. I'm pretty sure the only issue is that Jake was not using the commit in which I updated these tests.

On Tue, Oct 13, 2020, 7:30 PM Keith Alcock notifications@github.com wrote:

Both Ben's changes to bioresources and the ones to reach are in the proonto branches. The ones to reach are a little suspect because they seem to solve the problems (with things like ClvPrd) by changing the tests, but they are there.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clulab/bioresources/pull/42#issuecomment-708064223, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABNJ6SFY7KGAFWYKRQ5O3TTSKTPIPANCNFSM4RD7QWGQ .

bgyori commented 3 years ago

To be more specific and hopefully avoid further confusion... Here is the specific commit that makes a change to the script generating the protein-ontology-fragments.tsv.gz file:

https://github.com/clulab/bioresources/pull/41/commits/2e6cb2a1e94ca8cdd4aae46a52b9ee311489fca2#diff-688598b0a184d86ab4292c4b56c1fa02b4704b102f672c14e51c3e8dad69308bR69-R73

    # Remove entries like "YWHAB/ClvPrd", these are not useful
    # synonyms
    if re.match(r'^([^/]+)/(ClvPrd|UnMod|iso:\d+/UnMod)$', synonym):
        return False

As you can see here, I "Remove entries like "YWHAB/ClvPrd". Given this change, I don't see why my change in Reach in https://github.com/clulab/reach/pull/693/commits/cb7170848621bae9855e158ee9173f7b1519b9b7 to replace tests checking for ClvPrd might be considered suspect.

MihaiSurdeanu commented 3 years ago

Also @kwalcock, When is this file generated: src/main/resources/org/clulab/reach/kb/ner/model.ser.gz This is a CompactLexicon that you implemented. But it is not refreshed when runing ner_kb.sh. Do we need it?

MihaiSurdeanu commented 3 years ago

To be more specific and hopefully avoid further confusion... Here is the specific commit that makes a change to the script generating the protein-ontology-fragments.tsv.gz file:

2e6cb2a#diff-688598b0a184d86ab4292c4b56c1fa02b4704b102f672c14e51c3e8dad69308bR69-R73

    # Remove entries like "YWHAB/ClvPrd", these are not useful
    # synonyms
    if re.match(r'^([^/]+)/(ClvPrd|UnMod|iso:\d+/UnMod)$', synonym):
        return False

As you can see here, I "Remove entries like "YWHAB/ClvPrd". Given this change, I don't see why my change in Reach in clulab/reach@cb71708 to replace tests checking for ClvPrd might be considered suspect.

I agree.

MihaiSurdeanu commented 3 years ago

Hey @kwalcock: the shell script in this branch fails with this error:

[info] Running org.clulab.reach.ReachShell
Loading ReachSystem ...
[dynet] random seed: 4052734219
[dynet] allocating memory: 512,512,512,512MB
[dynet] memory allocation done.
19:09:03.452 [run-main-0] INFO  o.c.p.m.DeepLearningPolarityClassifier - Loading saved model SavedLSTM_WideBound_u_tag ...
19:09:04.524 [run-main-0] INFO  o.c.p.m.DeepLearningPolarityClassifier - Loading model finished!
[error] (run-main-0) java.lang.ExceptionInInitializerError
java.lang.ExceptionInInitializerError
    at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile$1.apply(ReachEntityLookup.scala:35)
    at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile$1.apply(ReachEntityLookup.scala:33)
    at scala.Option.map(Option.scala:146)
    at org.clulab.reach.grounding.ReachEntityLookup.org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile(ReachEntityLookup.scala:33)
    at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$1.apply(ReachEntityLookup.scala:80)
    at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$1.apply(ReachEntityLookup.scala:80)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
    at org.clulab.reach.grounding.ReachEntityLookup.<init>(ReachEntityLookup.scala:80)
    at org.clulab.reach.ReachSystem.<init>(ReachSystem.scala:37)
    at org.clulab.reach.ReachShell$.delayedEndpoint$org$clulab$reach$ReachShell$1(ReachShell.scala:27)
    at org.clulab.reach.ReachShell$delayedInit$body.apply(ReachShell.scala:15)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at org.clulab.reach.ReachShell$.main(ReachShell.scala:15)
    at org.clulab.reach.ReachShell.main(ReachShell.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
Caused by: java.io.IOException: Stream closed
    at java.io.BufferedInputStream.getInIfOpen(BufferedInputStream.java:159)
    at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
    at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
    at java.util.zip.CheckedInputStream.read(CheckedInputStream.java:59)
    at java.util.zip.GZIPInputStream.readUByte(GZIPInputStream.java:266)
    at java.util.zip.GZIPInputStream.readUShort(GZIPInputStream.java:258)
    at java.util.zip.GZIPInputStream.readHeader(GZIPInputStream.java:164)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:79)
    at java.util.zip.GZIPInputStream.<init>(GZIPInputStream.java:91)
    at org.clulab.reach.grounding.ReachKBUtils$.sourceFromResource(ReachKBUtils.scala:67)
    at org.clulab.reach.grounding.TsvIMKBFactory.org$clulab$reach$grounding$TsvIMKBFactory$$loadFromKBDir(TsvIMKBFactory.scala:45)
    at org.clulab.reach.grounding.TsvIMKBFactory$$anonfun$make$1.apply(TsvIMKBFactory.scala:23)
    at org.clulab.reach.grounding.TsvIMKBFactory$$anonfun$make$1.apply(TsvIMKBFactory.scala:22)
    at scala.Option.foreach(Option.scala:257)
    at org.clulab.reach.grounding.TsvIMKBFactory.make(TsvIMKBFactory.scala:22)
    at org.clulab.reach.grounding.ReachIMKBMentionLookups$.staticProteinFragmentKBML(ReachIMKBMentionLookups.scala:181)
    at org.clulab.reach.grounding.ReachIMKBMentionLookups$.<init>(ReachIMKBMentionLookups.scala:37)
    at org.clulab.reach.grounding.ReachIMKBMentionLookups$.<clinit>(ReachIMKBMentionLookups.scala)
    at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile$1.apply(ReachEntityLookup.scala:35)
    at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile$1.apply(ReachEntityLookup.scala:33)
    at scala.Option.map(Option.scala:146)
    at org.clulab.reach.grounding.ReachEntityLookup.org$clulab$reach$grounding$ReachEntityLookup$$addAdHocFile(ReachEntityLookup.scala:33)
    at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$1.apply(ReachEntityLookup.scala:80)
    at org.clulab.reach.grounding.ReachEntityLookup$$anonfun$1.apply(ReachEntityLookup.scala:80)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
    at scala.collection.Iterator$class.foreach(Iterator.scala:891)
    at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
    at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
    at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
    at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
    at scala.collection.AbstractTraversable.flatMap(Traversable.scala:104)
    at org.clulab.reach.grounding.ReachEntityLookup.<init>(ReachEntityLookup.scala:80)
    at org.clulab.reach.ReachSystem.<init>(ReachSystem.scala:37)
    at org.clulab.reach.ReachShell$.delayedEndpoint$org$clulab$reach$ReachShell$1(ReachShell.scala:27)
    at org.clulab.reach.ReachShell$delayedInit$body.apply(ReachShell.scala:15)
    at scala.Function0$class.apply$mcV$sp(Function0.scala:34)
    at scala.runtime.AbstractFunction0.apply$mcV$sp(AbstractFunction0.scala:12)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.App$$anonfun$main$1.apply(App.scala:76)
    at scala.collection.immutable.List.foreach(List.scala:392)
    at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:35)
    at scala.App$class.main(App.scala:76)
    at org.clulab.reach.ReachShell$.main(ReachShell.scala:15)
    at org.clulab.reach.ReachShell.main(ReachShell.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)

It works in master. Can you please take a look?

kwalcock commented 3 years ago

Are you using proonto on both bioresources and reach? Between git and sbt it is extremely easy for things to get mixed up. It was necessary to use "sbt reload", "sbt clean", and "git status" several times when I experimented and to even remove the .ivy2 directory. "git status" on jenny where I was working shows only the *.gz files having been modified and processors/build.sbt. Plus there would be the SNAPSHOT bioresources in .ivy2 local. I would remove the 1.1.33 version from cache just to be safe.

I think the error message is actually complaining that a file is missing. I got it once with a mismatched pair of bioresources and reach. Not much help, I know.

kwalcock commented 3 years ago

Thanks, @bgyori. I suspect myself more than anyone or any thing else.

MihaiSurdeanu commented 3 years ago

Thanks @kwalcock: I'll start from scratch...

MihaiSurdeanu commented 3 years ago

Thanks @kwalcock: I'll start from scratch...

Thanks! That fixed it.