Integrate UniProt fragments

bgyori commented 3 years ago

This PR extends the update_uniprot_proteins.py script to download and process protein chains and peptides into grounding entries. The approach taken here is to put these into the existing uniprot-proteins.tsv file with IDs formatted as [UniProtID]#[FragmentID] which is now "officially" supported by UniProt. Examples:

Angiotensin-2   P01019#PRO_0000032458   Homo sapiens
Angiotensin-2   P11859#PRO_0000032462   Mus musculus

This adds a total of around 50k new rows to the grounding file.

This PR does not yet touch the NER files and I have not yet written any tests and tried this against Reach. @JakeWolfe and @MihaiSurdeanu would you be able to pick this up from here?

MihaiSurdeanu commented 3 years ago

@bgyori: the failing test passes when I replace "EM" with a known protein such as "KRas". So, "EM" is no longer recognized as a GGP in this branch. Is this on purpose? Thanks!

kwalcock commented 3 years ago

It looks like that model.ser.gz should be refreshed by ner_kb.sh with

# generate the serialized LexiconNER model now
sbt 'runMain org.clulab.processors.bionlp.ner.KBLoader ../bioresources/src/main/resources/org/clulab/reach/kb/ner/model.ser.gz'

I think I've noticed it sometimes not being refreshed and hope it was because the files used to build it hadn't changed. That it doesn't change might be a sign of something wrong. Here's what I think happens:

The first time reach is called, it depends on the published bioresources 1.1.33. This seems to result in an unchanged model.ser.gz. After that, reach is updated to depend on bioresources 1.1.34-SNAPSHOT. If ner_kb.sh is called again, the file model.ser.gz will change. It is probably this version that should be published for real. I'm assuming that another round doesn't result in any more changes, but I'm not yet sure that's the case. Perhaps something can be done about the circle.

bgyori commented 3 years ago

@bgyori: the failing test passes when I replace "EM" with a known protein such as "KRas". So, "EM" is no longer recognized as a GGP in this branch. Is this on purpose? Thanks!

I looked into this, and found that "EM" was previously incorrectly grounded to PUBCHEM:6426949, and is one of a group of two-letter acronyms that are listed by CHEBI and PubChem as synonyms for pairs of amino acids. Since these are virtually never correct as synonyms for the purposes of text mining, I removed them in https://github.com/clulab/bioresources/pull/36.

MihaiSurdeanu commented 3 years ago

Thanks! Then I think we can finally merge this branches in their respective masters. @kwalcock: can you please do the honors?

Thank you @bgyori, @kwalcock, and @JakeWolfe for your help with this thorny branch!

kwalcock commented 3 years ago

I haven't noticed any updates getting as far as github that address the failing tests.

MihaiSurdeanu commented 3 years ago

The update is in the proonto branch of Reach. This bioresources branch is fine as is.

kwalcock commented 3 years ago

My bad. Thanks.

kwalcock commented 3 years ago

I'm planning to merge this even though I'm not completely sure the .gz files will be the right ones in the end. They were created using the reach branch proonto, but we'd probably rather have the current reach master create them. In addition, I need to make some changes to files in both projects. I doubt that everything can be done in a single commit anyway. This particular master won't be published however until bioresources and reach are synchronized. I will edit the CHANGES file when we're ready to publish. I shouldn't be too long.

MihaiSurdeanu commented 3 years ago

I think the .gz files are fine, as nothing changed in the generation part of the code. Agree with everything else!

On Wed, Oct 14, 2020 at 17:16 Keith Alcock notifications@github.com wrote:

I'm planning to merge this even though I'm not completely sure the .gz files will be the right ones in the end. They were created using the reach branch proonto, but we'd probably rather have the current reach master create them. In addition, I need to make some changes to files in both projects. I doubt that everything can be done in a single commit anyway. This particular master won't be published however until bioresources and reach are synchronized. I will edit the CHANGES file when we're ready to publish. I shouldn't be too long.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clulab/bioresources/pull/42#issuecomment-708689556, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TTBL2NO33YSWNHIYXDSKYPNPANCNFSM4RD7QWGQ .

kwalcock commented 3 years ago

Also @kwalcock, When is this file generated: src/main/resources/org/clulab/reach/kb/ner/model.ser.gz This is a CompactLexicon that you implemented. But it is not refreshed when runing ner_kb.sh. Do we need it?

I too wonder whether we need it and would like to get rid of it. It is a serialized object, so it is dependent on Java and Scala version, even though it is being saved in bioresources which is independent of those versions. The code that does the serialization lives in reach and the object that is serialized is in processors. Chances that these all line up so that it is a usable resource seems very slim. I added a test for the file and it will fail if the Scala version is changed. I don't know if anything did manage to use it, though, and might break without it.

It seems like the file Gene_or_gene_produce-OLD.tsv.gz is something better for github history than for maven and should be deleted.

MihaiSurdeanu commented 3 years ago

Agree, let's remove both?

On Thu, Oct 15, 2020 at 18:00 Keith Alcock notifications@github.com wrote:

Also @kwalcock https://github.com/kwalcock, When is this file generated: src/main/resources/org/clulab/reach/kb/ner/model.ser.gz This is a CompactLexicon that you implemented. But it is not refreshed when runing ner_kb.sh. Do we need it?

I too wonder whether we need it and would like to get rid of it. It is a serialized object, so it is dependent on Java and Scala version, even though it is being saved in bioresources which is independent of those versions. The code that does the serialization lives in reach and the object that is serialized is in processors. Chances that these all line up so that it is a usable resource seems very slim. I added a test for the file and it will fail if the Scala version is changed. I don't know if anything did manage to use it, though, and might break without it.

It seems like the file Gene_or_gene_produce-OLD.tsv.gz is something better for github history than for maven and should be deleted.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/clulab/bioresources/pull/42#issuecomment-709633605, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI75TUMMGAHJBDTXHBYX63SK55JTANCNFSM4RD7QWGQ .

clulab / bioresources

Integrate UniProt fragments #42