Closed bgyori closed 4 years ago
Awesome. Looks great @bgyori !
And, yes, I know that the last bullet here is very convoluted... I am thinking of ways to simplify it. I welcome suggestions :)
So I ran ner_kb.sh, technically it shows that all the NER gz files have changed. I assume that that is just a diff of some metadata baked into the gz files and not a diff in the actual TSV files themselves.
This might be caused by my recent changes (improvements hopefully) in the processors tokenizer. Are you using the latest master?
Also, if you want to control what you change, you can edit ner_kb.config and comment out the other KBs.
I believe there is a date stamp on the top of the tsv files that is updated with every run. To be extra sure you can also ungzip all the files and compare the before and after.
I'm almost finished with an automated pipeline to run REACH tests based on modified bioresources. One question came up which might explain errors I'm getting: does REACH strictly require Scala 2.11 (i.e., 2.12 won't work)?
Yes, there is a requirement for scala 2.11, but I don't remember why. @marcovzla, @myedibleenso, @kwalcock : do you remember why?
Okay, here's the actual issue I ran into: as recommended, I am using the latest master of processors
, but unless I'm missing something, I don't think that's fully compatible with the latest reach
.
Namely, in this commit https://github.com/clulab/processors/commit/c2f535d84130d5c007fc9cde553a8a24119cf060
modelscorenlp
is removed, however, REACH is still looking for this as a dependency, see https://github.com/clulab/reach/blob/master/main/build.sbt#L22, and errors when I try to build with the locally published latest processors version. Should I post this as a REACH issue to be resolved? Or should I revert to an older version of processors?
Oh, good catch. We changed the packaging of processors (the lib names are different). This is easy to fix, but we need to release processors 8.0.0 for this. @kwalcock : Ok to release processors with you?
Let's continue with the assumption that Reach will use the latest processors for the bioresources changes.
Let me know how to proceed - is a new release going to be possible? Or is there a workaround? For now, I can continue working on updating resources but can't test the changes end-to-end.
@kwalcock : when do you think we can release processors? As far as I'm concerned, this can be today. And i can do it, if you're too busy.
Meanwhile, @bgyori: you can publish processors master locally (sbt publishLocal), and change the processors version number to the locally published snapshot, so you can test locally.
Ah, I think there may be a misunderstanding: there is actually a code-level incompatibility between the processors master and the reach master branches as they currently stand. I have reproduced this by building and publishing processors locally and then updating the version to the snapshot in the reach build configurations. So in itself, releasing processors won't solve the problem - some of the REACH code will also have to be adapted. My comment above, as far as I understand, points out the main issue which is that REACH is looking for modelscorenlp
which processors doesn't build anymore.
Sorry, I forgot about this! I'll fix this today.
Hi @bgyori: this is almost fixed (see below why I say almost here). To test your changes:
@kwalcock : I think we should wait with releasing processors 8.0.0 until after Ben integrates his changes into bioresources, so we can upgrade the version number in processors.
Acknowledged. I has almost responded to the previous messages but then saw that there were complications.
Yes, sometimes the path to a resource needs to be absolute and therefore requires the added "/". I don't quite understand when this is necessary. It may depend on where the resource is located relative to the class that is asking to load it.
On Mon, Mar 2, 2020 at 9:31 PM Mihai Surdeanu notifications@github.com wrote:
Hi @bgyori https://github.com/bgyori: this is almost fixed (see below why I say almost here). To test your changes:
- checkout the very latest processors master; run "sbt +publishLocal" (the + is important to cross-compile for scala 2.11)
- in reach, checkout the branch "proc800", which has been adjusted to work with the latest processors. When you run the tests, you'll see that 2 currently fail: TestPolarity and TestCoreference. Please ignore these for now, while we look at them.
@kwalcock https://github.com/kwalcock : I think we should wait with releasing processors 8.0.0 until after Ben integrates his changes into bioresources, so we can upgrade the version number in processors.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clulab/bioresources/pull/24?email_source=notifications&email_token=ACCHCORWOGVMAMTRLBFT3HDRFSB2XA5CNFSM4K5R4YX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSBQVI#issuecomment-593762389, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCHCOVGOIIWQ56EQK4VNTTRFSB2XANCNFSM4K5R4YXQ .
@kwalcock: FYI, I think the older getResourceAsStream requires the "/". The newer (since 2.12) fromResource does not.
@bgyori: all tests should now pass in this branch, using processors 8.0.0. Please use it to test your changes to bioresources.
I'm now picking this up again and working on the latest processors master and the proc800 branch of reach. I've now gotten back to my earlier issue with Scala 2.11 vs 2.12. The container I'm working in has Scala 2.11.12 installed. When I do publishLocal on processors, it produces folders like .ivy2/local/org.clulab/processors-corenlp_2.12
, and when I then try to do publishLocal in reach, I get errors like
[error] (main/*:update) sbt.ResolveException: unresolved dependency: org.clulab#processors-main_2.11;8.0.0-SNAPSHOT: not found
[error] unresolved dependency: org.clulab#processors-corenlp_2.11;8.0.0-SNAPSHOT: not found
Notice that reach is looking for 2.11 here. Do I need to change a setting somewhere to make them compatible?
Sorry, I seem to have missed the comment on +publishLocal
. I'm looking into that now.
Yeah, that's it.
On Wed, Mar 4, 2020 at 20:19 Benjamin M. Gyori notifications@github.com wrote:
Sorry, I seem to have missed the comment on +publishLocal. I'm looking into that now.
— You are receiving this because you commented.
Reply to this email directly, view it on GitHub https://github.com/clulab/bioresources/pull/24?email_source=notifications&email_token=AAI75TQJO4CSGGEF7AJCFHDRF4K5XA5CNFSM4K5R4YX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN3RRUY#issuecomment-595007699, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TV3V6GIPTIGUGPLYMLRF4K5XANCNFSM4K5R4YXQ .
There seems to be one test that fails, it looks like because one more entry matches compared to the number expected before:
[info] ProteinKBL resolveBySpecies
[info] - should work with alternate lookups *** FAILED *** (19 milliseconds)
[info] ArrayBuffer(<KBResolution: F42G4.3, uniprot, Q9U3F4, caenorhabditis elegans, <IMKBMetaInfo: uniprot, uniprot-proteins.tsv.gz, http://identifiers.org/uniprot/, MIR:00100164, sp=true, f=false, p=true>>, <KBResolution: zyx-1, uniprot, Q9U3F4, caenorhabditis elegans, <IMKBMetaInfo: uniprot, uniprot-proteins.tsv.gz, http://identifiers.org/uniprot/, MIR:00100164, sp=true, f=false, p=true>>, <KBResolution: Zyxin, uniprot, Q9U3F4, caenorhabditis elegans, <IMKBMetaInfo: uniprot, uniprot-proteins.tsv.gz, http://identifiers.org/uniprot/, MIR:00100164, sp=true, f=false, p=true>>) had size 3 instead of expected size 2 (TestProteinResolutions.scala:287)
It seems to find one more species for that string? Do you know what is that?
I'm not totally sure what function being tested (IMKBLookup.resolveBySpecies
) is expected to do but my interpretation is that the protein Q9U3F4 (https://www.uniprot.org/uniprot/Q9U3F4) now simply has 3 synonyms available compared to 2 before. I confirmed this at the level of the old and new uniprot-proteins.tsv files, as follows.
Old file:
zyx-1 Q9U3F4 Caenorhabditis elegans
Zyxin Q9U3F4 Caenorhabditis elegans
New file:
F42G4.3 Q9U3F4 Caenorhabditis elegans
zyx-1 Q9U3F4 Caenorhabditis elegans
Zyxin Q9U3F4 Caenorhabditis elegans
Thanks for checking! Then it's fine to adjust the condition in the test, so it passes.
Okay, I pushed a branch called biores_uniprot_update
in the reach repo with the test change. Though I suppose that shouldn't be merged until there's a new release of bioresources (and processors probably). From what I can tell, this branch should be ready to go otherwise.
I think this branch can be merged, what do you think @MihaiSurdeanu? As noted above, there is a corresponding open branch in the reach repo which, when a new release of bioresources is out, can be merged to fix the test condition that changed.
This is a first attempt to update one of the key resources (uniprot-proteins.tsv.gz) and add a script to make the update process reproducible. I tried to match all the conventions in the old file as closely as possible and the diff (at the level of tsv files, not gz) looked reasonable. Couple of questions: