clulab / bioresources

Data resources from the biomedical domain
Apache License 2.0
3 stars 1 forks source link

Uniprot update #24

Closed bgyori closed 4 years ago

bgyori commented 4 years ago

This is a first attempt to update one of the key resources (uniprot-proteins.tsv.gz) and add a script to make the update process reproducible. I tried to match all the conventions in the old file as closely as possible and the diff (at the level of tsv files, not gz) looked reasonable. Couple of questions:

MihaiSurdeanu commented 4 years ago

Awesome. Looks great @bgyori !

And, yes, I know that the last bullet here is very convoluted... I am thinking of ways to simplify it. I welcome suggestions :)

bgyori commented 4 years ago

So I ran ner_kb.sh, technically it shows that all the NER gz files have changed. I assume that that is just a diff of some metadata baked into the gz files and not a diff in the actual TSV files themselves.

MihaiSurdeanu commented 4 years ago

This might be caused by my recent changes (improvements hopefully) in the processors tokenizer. Are you using the latest master?

Also, if you want to control what you change, you can edit ner_kb.config and comment out the other KBs.

kwalcock commented 4 years ago

I believe there is a date stamp on the top of the tsv files that is updated with every run. To be extra sure you can also ungzip all the files and compare the before and after.

bgyori commented 4 years ago

I'm almost finished with an automated pipeline to run REACH tests based on modified bioresources. One question came up which might explain errors I'm getting: does REACH strictly require Scala 2.11 (i.e., 2.12 won't work)?

MihaiSurdeanu commented 4 years ago

Yes, there is a requirement for scala 2.11, but I don't remember why. @marcovzla, @myedibleenso, @kwalcock : do you remember why?

bgyori commented 4 years ago

Okay, here's the actual issue I ran into: as recommended, I am using the latest master of processors, but unless I'm missing something, I don't think that's fully compatible with the latest reach.

Namely, in this commit https://github.com/clulab/processors/commit/c2f535d84130d5c007fc9cde553a8a24119cf060 modelscorenlp is removed, however, REACH is still looking for this as a dependency, see https://github.com/clulab/reach/blob/master/main/build.sbt#L22, and errors when I try to build with the locally published latest processors version. Should I post this as a REACH issue to be resolved? Or should I revert to an older version of processors?

MihaiSurdeanu commented 4 years ago

Oh, good catch. We changed the packaging of processors (the lib names are different). This is easy to fix, but we need to release processors 8.0.0 for this. @kwalcock : Ok to release processors with you?

Let's continue with the assumption that Reach will use the latest processors for the bioresources changes.

bgyori commented 4 years ago

Let me know how to proceed - is a new release going to be possible? Or is there a workaround? For now, I can continue working on updating resources but can't test the changes end-to-end.

MihaiSurdeanu commented 4 years ago

@kwalcock : when do you think we can release processors? As far as I'm concerned, this can be today. And i can do it, if you're too busy.

Meanwhile, @bgyori: you can publish processors master locally (sbt publishLocal), and change the processors version number to the locally published snapshot, so you can test locally.

bgyori commented 4 years ago

Ah, I think there may be a misunderstanding: there is actually a code-level incompatibility between the processors master and the reach master branches as they currently stand. I have reproduced this by building and publishing processors locally and then updating the version to the snapshot in the reach build configurations. So in itself, releasing processors won't solve the problem - some of the REACH code will also have to be adapted. My comment above, as far as I understand, points out the main issue which is that REACH is looking for modelscorenlp which processors doesn't build anymore.

MihaiSurdeanu commented 4 years ago

Sorry, I forgot about this! I'll fix this today.

MihaiSurdeanu commented 4 years ago

Hi @bgyori: this is almost fixed (see below why I say almost here). To test your changes:

@kwalcock : I think we should wait with releasing processors 8.0.0 until after Ben integrates his changes into bioresources, so we can upgrade the version number in processors.

kwalcock commented 4 years ago

Acknowledged. I has almost responded to the previous messages but then saw that there were complications.

Yes, sometimes the path to a resource needs to be absolute and therefore requires the added "/". I don't quite understand when this is necessary. It may depend on where the resource is located relative to the class that is asking to load it.

On Mon, Mar 2, 2020 at 9:31 PM Mihai Surdeanu notifications@github.com wrote:

Hi @bgyori https://github.com/bgyori: this is almost fixed (see below why I say almost here). To test your changes:

  • checkout the very latest processors master; run "sbt +publishLocal" (the + is important to cross-compile for scala 2.11)
  • in reach, checkout the branch "proc800", which has been adjusted to work with the latest processors. When you run the tests, you'll see that 2 currently fail: TestPolarity and TestCoreference. Please ignore these for now, while we look at them.

@kwalcock https://github.com/kwalcock : I think we should wait with releasing processors 8.0.0 until after Ben integrates his changes into bioresources, so we can upgrade the version number in processors.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clulab/bioresources/pull/24?email_source=notifications&email_token=ACCHCORWOGVMAMTRLBFT3HDRFSB2XA5CNFSM4K5R4YX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOENSBQVI#issuecomment-593762389, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACCHCOVGOIIWQ56EQK4VNTTRFSB2XANCNFSM4K5R4YXQ .

MihaiSurdeanu commented 4 years ago

@kwalcock: FYI, I think the older getResourceAsStream requires the "/". The newer (since 2.12) fromResource does not.

MihaiSurdeanu commented 4 years ago

@bgyori: all tests should now pass in this branch, using processors 8.0.0. Please use it to test your changes to bioresources.

bgyori commented 4 years ago

I'm now picking this up again and working on the latest processors master and the proc800 branch of reach. I've now gotten back to my earlier issue with Scala 2.11 vs 2.12. The container I'm working in has Scala 2.11.12 installed. When I do publishLocal on processors, it produces folders like .ivy2/local/org.clulab/processors-corenlp_2.12, and when I then try to do publishLocal in reach, I get errors like

[error] (main/*:update) sbt.ResolveException: unresolved dependency: org.clulab#processors-main_2.11;8.0.0-SNAPSHOT: not found

[error] unresolved dependency: org.clulab#processors-corenlp_2.11;8.0.0-SNAPSHOT: not found

Notice that reach is looking for 2.11 here. Do I need to change a setting somewhere to make them compatible?

bgyori commented 4 years ago

Sorry, I seem to have missed the comment on +publishLocal. I'm looking into that now.

MihaiSurdeanu commented 4 years ago

Yeah, that's it.

On Wed, Mar 4, 2020 at 20:19 Benjamin M. Gyori notifications@github.com wrote:

Sorry, I seem to have missed the comment on +publishLocal. I'm looking into that now.

— You are receiving this because you commented.

Reply to this email directly, view it on GitHub https://github.com/clulab/bioresources/pull/24?email_source=notifications&email_token=AAI75TQJO4CSGGEF7AJCFHDRF4K5XA5CNFSM4K5R4YX2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEN3RRUY#issuecomment-595007699, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TV3V6GIPTIGUGPLYMLRF4K5XANCNFSM4K5R4YXQ .

bgyori commented 4 years ago

There seems to be one test that fails, it looks like because one more entry matches compared to the number expected before:

[info] ProteinKBL resolveBySpecies
[info] - should work with alternate lookups *** FAILED *** (19 milliseconds)
[info]   ArrayBuffer(<KBResolution: F42G4.3, uniprot, Q9U3F4, caenorhabditis elegans, <IMKBMetaInfo: uniprot, uniprot-proteins.tsv.gz, http://identifiers.org/uniprot/, MIR:00100164, sp=true, f=false, p=true>>, <KBResolution: zyx-1, uniprot, Q9U3F4, caenorhabditis elegans, <IMKBMetaInfo: uniprot, uniprot-proteins.tsv.gz, http://identifiers.org/uniprot/, MIR:00100164, sp=true, f=false, p=true>>, <KBResolution: Zyxin, uniprot, Q9U3F4, caenorhabditis elegans, <IMKBMetaInfo: uniprot, uniprot-proteins.tsv.gz, http://identifiers.org/uniprot/, MIR:00100164, sp=true, f=false, p=true>>) had size 3 instead of expected size 2 (TestProteinResolutions.scala:287)

see https://github.com/clulab/reach/blob/master/main/src/test/scala/org/clulab/reach/TestProteinResolutions.scala#L287.

MihaiSurdeanu commented 4 years ago

It seems to find one more species for that string? Do you know what is that?

bgyori commented 4 years ago

I'm not totally sure what function being tested (IMKBLookup.resolveBySpecies) is expected to do but my interpretation is that the protein Q9U3F4 (https://www.uniprot.org/uniprot/Q9U3F4) now simply has 3 synonyms available compared to 2 before. I confirmed this at the level of the old and new uniprot-proteins.tsv files, as follows.

Old file:

zyx-1   Q9U3F4  Caenorhabditis elegans
Zyxin   Q9U3F4  Caenorhabditis elegans

New file:

F42G4.3 Q9U3F4  Caenorhabditis elegans
zyx-1   Q9U3F4  Caenorhabditis elegans
Zyxin   Q9U3F4  Caenorhabditis elegans
MihaiSurdeanu commented 4 years ago

Thanks for checking! Then it's fine to adjust the condition in the test, so it passes.

bgyori commented 4 years ago

Okay, I pushed a branch called biores_uniprot_update in the reach repo with the test change. Though I suppose that shouldn't be merged until there's a new release of bioresources (and processors probably). From what I can tell, this branch should be ready to go otherwise.

bgyori commented 4 years ago

I think this branch can be merged, what do you think @MihaiSurdeanu? As noted above, there is a corresponding open branch in the reach repo which, when a new release of bioresources is out, can be merged to fix the test condition that changed.