Load all Swiss-Prot entries in NEO

pgaudet commented 2 years ago

Hi @kltm

The 'ultimate' goal is to have all Swiss-Prot (reviewed) entries. The file is in the same GOA ftp, it's called uniprot_reviewed.gpi.gz

The bacteria and viruses file was to test a smaller set, but we'll need everything. This file is about double the size of uniprot_reviewed_virus_bacteria.gpi.gz.

Thanks, Pascale

pgaudet commented 2 years ago

Full URL is ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed.gpi.gz

cmungall commented 2 years ago

See my latest comments in https://github.com/geneontology/go-site/issues/1431

I think loading the reviewed file for SARS-CoV-2 is a bad idea as we lose the important proteins that do important work

I suspect this problem would remain for other viruses too, I have no idea how we would do useful annotation of them without entries for the polyproteins.

We have fixed the problem for SARS2 with my curated file. However, if we are serious about doing other viruses that have similar genomes then I think we need to programmatically extract the correct entries. This would be a project:

write a python script that takes a GPI file that has polyprotein entries (PRO IDs) and takes the longest protein for each bona-fide pp
(optional) map the interpro function predictions to the polyprotein level

pgaudet commented 2 years ago

I suspect this problem would remain for other viruses too,

@pmasson55 says that this is not typical for all viruses. With Patrick we should look at which viruses need this special processing.

Thanks Pascale

gillespm commented 2 years ago

Hi All, I was talking with Peter D'eustachio about this and have two comments that hopefully will be of use.

Lots of viruses that humans care about (infect humans) use the polyprotein strategy, there are a number of papers out there, mostly written from a drug targeting protease point of view. Here are two examples:
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7150265/
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3660988/
There is another class of viral polyprotein that you wouldn't really call a polyprotein, but operates in the same way or at least has the same "subfragment" problem. Influenza is an example of this, where host proteases are used to activate viral proteins. In fact this later mechanism, cleavage of the HA protein intracellulary is one of the things that made the 1918 influenza virus so pathogenic. These generally use host proteases.

pmasson55 commented 2 years ago

Hi All,

Concerning SwissProt viral entries, I would say it concerns about 10% of the total viral entries ( about 1500 out of 15 000 approximately). They are not as complex as SARS-COV-2 entries. Most of the time there is only one polyprotein and not a long and a short version of the same polyprotein. So I think that if we can handle protein processing (being able to annotate chains inside polyproteins) I guess we cover 99% of the viruses....

kltm commented 2 years ago

Okay, picking up work from #77 here, where there are a few more details. Noting that the working branch is now: https://github.com/geneontology/neo/tree/issue-82-add-all-reviewed .

The current blocking issue is that while we were hoping to have a drop-in replacement work, there is some issue with the owltools solr loader that is preventing a load completion. Essentially, after somewhere between \~500k-\~1m documents loaded, we get an error like:

[2022-02-04T23:45:34.383Z] 2022-02-04 23:45:29,869 INFO  (FlexSolrDocumentLoader:47) Processed 1000 flex ontology docs at 674000 and committing...
[2022-02-04T23:45:37.612Z] 2022-02-04 23:45:36,895 INFO  (FlexCollection:253) Loaded: 675000 of 1520950, elapsed: 2:23:28.058, eta: 2:49:11.400
[2022-02-04T23:45:37.612Z] 2022-02-04 23:45:36,896 INFO  (FlexSolrDocumentLoader:47) Processed 1000 flex ontology docs at 675000 and committing...
[2022-02-04T23:46:33.662Z] Exception in thread "main" org.apache.solr.common.SolrException: [was class java.io.IOException] java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms??java.lang.RuntimeException: [was class java.io.IOException] java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms?   at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)?    at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)?    at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)?  at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)?   at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)?   at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)? at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)?   at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)?    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)?   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)?   at org.apa
[2022-02-04T23:46:33.662Z]
[2022-02-04T23:46:33.662Z] [was class java.io.IOException] java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms??java.lang.RuntimeException: [was class java.io.IOException] java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms?    at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)?    at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)?    at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)?  at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)?   at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)?   at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)? at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)?   at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)?    at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)?   at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)?   at org.apa
[2022-02-04T23:46:33.662Z]
[2022-02-04T23:46:33.662Z] request: http://localhost:8080/solr/update?wt=javabin&version=2
[2022-02-04T23:46:33.662Z]  at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:427)

After running this several times, the error occurs usually between two and three hours in to what should be an approximately five hour load, given the number of documents. Note that these initial numbers are from #77, where the full number of documents would have been 1520942 (compared to our current load of 1168920 documents).

Given that we know that solr can typically handle many more documents (in the main go pipeline) and is being loaded in batch anyways, it feels to me unlikely that it is solr choking out directly. I suspect that there is some kind of memory handling issue or incorrectly passed parameter to the owltools loader that eventually causes memory thrashing and then the error. As a next step, I'll rerun this and make note of memory and disk usage as it approaches the limit. If it is not in owltools directly, this should still give us information about where to look next.

kltm commented 2 years ago

Talking to @pgaudet we'll be asking upstream to filter out the sars-cov-2 entries.

kltm commented 2 years ago

Okay, I'm managed to spend a little time with this and have some observations:

I actually managed to load the entire reviewed file when it was the only thing I loaded.
owltools seems to be the weak link, with memory
- peaking at 211G, over 192G on CLI (don't know what it "normally" looks like)
- solr never got over 82G, over 128G on CLI
- loading it took about same amount of time as the rest: ~3hrs vs ~3hrs+
- total number of entities about the same

All told (unless I just happened to be stupendously lucky this time), I think that the issue is that owltools can do one or the other with the memory given, but will eventually thrash out if it tries to do both. I think the most expedient next steps would be:

[ ] try it with a full load but with more memory
[ ] see how hard it would be to set it up to load: first the reviewed, breathe, then the rest
worst case, a new docker image that just them separately (although I'd rather keep this in the pipeline if at all possible and not hide this weirdness layers down)

kltm commented 2 years ago

Okay, I'm trying to just add in again the uniprot_reviewed to what we have (bumping ecocyc out for the moment). With that, we're still having problems like we've had before (i.e. #80 ) with:

15:31:31  Exception in thread "main" org.semanticweb.owlapi.model.OWLOntologyStorageException: org.obolibrary.oboformat.model.FrameStructureException: multiple name tags not allowed. in frame:Frame(UniProtKB:Q8IUB2 id( UniProtKB:Q8IUB2)synonym( WFDC3 BROAD)xref( Ensembl:ENSG00000124116)xref( HGNC:HGNC:15957)synonym( WFDC3 RELATED)synonym( WAP14 RELATED)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/Protein)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine)xref( EMBL:AL050348)synonym( Q8IUB2 RELATED)synonym( NP_542181.1 RELATED)synonym( 15957 RELATED)synonym( AL050348 RELATED)name( WFDC3 Hsap)xref( RefSeq:NP_542181.1)synonym( ENSG00000124116 RELATED)xref( HGNC:15957)name( WFDC3 NCBITaxon:9606)synonym( HGNC:15957 RELATED)is_a( CHEBI:36080)relationship( in_taxon NCBITaxon:9606))
15:31:31    at org.semanticweb.owlapi.oboformat.OBOFormatRenderer.render(OBOFormatRenderer.java:90)
15:31:31    at org.semanticweb.owlapi.oboformat.OBOFormatStorer.storeOntology(OBOFormatStorer.java:42)
15:31:31    at org.semanticweb.owlapi.util.AbstractOWLStorer.storeOntology(AbstractOWLStorer.java:155)
15:31:31    at org.semanticweb.owlapi.util.AbstractOWLStorer.storeOntology(AbstractOWLStorer.java:119)
15:31:31    at uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.saveOntology(OWLOntologyManagerImpl.java:1525)
15:31:31    at uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.saveOntology(OWLOntologyManagerImpl.java:1502)
15:31:31    at owltools.io.ParserWrapper.saveOWL(ParserWrapper.java:289)
15:31:31    at owltools.io.ParserWrapper.saveOWL(ParserWrapper.java:209)
15:31:31    at owltools.cli.CommandRunner.runSingleIteration(CommandRunner.java:3712)
15:31:31    at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:76)
15:31:31    at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:68)
15:31:31    at owltools.cli.CommandLineInterface.main(CommandLineInterface.java:12)
15:31:31  Caused by: org.obolibrary.oboformat.model.FrameStructureException: multiple name tags not allowed. in frame:Frame(UniProtKB:Q8IUB2 id( UniProtKB:Q8IUB2)synonym( WFDC3 BROAD)xref( Ensembl:ENSG00000124116)xref( HGNC:HGNC:15957)synonym( WFDC3 RELATED)synonym( WAP14 RELATED)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/Protein)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine)xref( EMBL:AL050348)synonym( Q8IUB2 RELATED)synonym( NP_542181.1 RELATED)synonym( 15957 RELATED)synonym( AL050348 RELATED)name( WFDC3 Hsap)xref( RefSeq:NP_542181.1)synonym( ENSG00000124116 RELATED)xref( HGNC:15957)name( WFDC3 NCBITaxon:9606)synonym( HGNC:15957 RELATED)is_a( CHEBI:36080)relationship( in_taxon NCBITaxon:9606))
15:31:31    at org.obolibrary.oboformat.model.Frame.checkMaxOneCardinality(Frame.java:424)
15:31:31    at org.obolibrary.oboformat.model.Frame.check(Frame.java:405)
15:31:31    at org.obolibrary.oboformat.model.OBODoc.check(OBODoc.java:390)
15:31:31    at org.obolibrary.oboformat.writer.OBOFormatWriter.write(OBOFormatWriter.java:183)
15:31:31    at org.semanticweb.owlapi.oboformat.OBOFormatRenderer.render(OBOFormatRenderer.java:88)
15:31:31    ... 11 more
15:31:32  Makefile:30: recipe for target 'neo.obo' failed
15:31:32  make: *** [neo.obo] Error 1

Taking a look at the files:

bbop@wok:/var/lib/jenkins/workspace/peline_issue-neo-82-all-reviewed/neo/mirror$ zgrep Q8IUB2 *.gz
goa_human.gpi.gz:UniProtKB  Q8IUB2  WFDC3   WAP four-disulfide core domain protein 3    WFDC3|WAP14 protein taxon:9606      HGNC:15957  db_subset=Swiss-Prot
goa_human_isoform.gpi.gz:UniProtKB  F2Z2G4  WFDC3   WAP four-disulfide core domain protein 3    WFDC3   protein taxon:9606  UniProtKB:Q8IUB2    HGNC:15957  db_subset=TrEMBL
goa_human_isoform.gpi.gz:UniProtKB  F2Z2G5  WFDC3   WAP domain-containing protein   WFDC3   protein taxon:9606  UniProtKB:Q8IUB2    HGNC:15957  db_subset=TrEMBL
goa_human_isoform.gpi.gz:UniProtKB  H0Y2V5  WFDC3   WAP four-disulfide core domain protein 3    WFDC3   protein taxon:9606  UniProtKB:Q8IUB2    HGNC:15957  db_subset=TrEMBL
uniprot_reviewed.gpi.gz:UniProtKB   Q8IUB2  WFDC3   WAP four-disulfide core domain protein 3    WFDC3|WAP14 protein taxon:9606      EMBL:AL050348|RefSeq:NP_542181.1|HGNC:HGNC:15957|Ensembl:ENSG00000124116    db_subset=Swiss-Prot|taxon_name=Homo sapiens|taxon_common_name=Human|proteome=gcrpCan

@balhoff I'm betting there will be a lot of collisions like this and getting them on a one-by-one basis will take a long time. Is there a way to just have these clobber or skip, or do we need to write a filter script to take care of these up front?

cmungall commented 2 years ago

I suggest making a new issue for this and coordinating with Alex

For the goa_human vs goa_human_isoform issue:

the uniprot files are a bit different from the rest, the GPI specs are AFAIK silent on the matter of how a set of GPs should be partitioned across files, but I would strongly recommend making it a requirement that for GPIs loaded into Neo that uniqueness should be guaranteed. For uniprot this means

EITHER

goa_X_isoform includes BOTH isoforms AND all reference entities
goa_X_isoform includes ONLY isoforms AND no reference entities

My preference would for 2

I suggest a uniprot-specific one-line script up front that reports and filters any line in goa_X_isoform that does not follow \w+\-\d+ in col2

For uniprot_reviewed, I think the easiest thing is to filter out any already-covered taxon

kltm commented 2 years ago

Apparently a lot of overlap in the first pass with species we already have:

   567013 /tmp/uniprot_reviewed.gpi
   388714 /tmp/naively_filtered_file.gpi

Will bolt this in and see if there are any collisions left.

kltm commented 2 years ago

Breakup pipeline command from make clean all to make clean and make all to get around an ordering issue.

touch trigger
wget http://s3.amazonaws.com/go-build/metadata/datasets.json -O datasets.json && touch datasets.json
--2022-04-05 15:33:57--  http://s3.amazonaws.com/go-build/metadata/datasets.json
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.48.62
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.48.62|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 81874 (80K) [application/json]
Saving to: ‘datasets.json’

datasets.json       100%[===================>]  79.96K   406KB/s    in 0.2s    

2022-04-05 15:33:58 (406 KB/s) - ‘datasets.json’ saved [81874/81874]

./build-neo-makefile.py -i datasets.json > Makefile-gafs.tmp && mv Makefile-gafs.tmp Makefile-gafs
rm trigger datasets.json mirror/*gz target/*.obo || echo "not all files present, perhaps last build did not complete"

kltm commented 2 years ago

Okay, I think we're getting a little further along with the collisions. Added an additional manual filter list to pick up the things that are "manual" in the Makefile (not datasets.json). Temporary; seeing if that can get us through the owltools conversion.

kltm commented 2 years ago

@pgaudet @vanaukenk Okay, we have had some success with the new NEO load with more entities. The formula for this is, similar to how we handle things in the main pipeline:

(all currently loaded files: sgd pombase mgi zfin rgd dictybase fb tair wb goa_human goa_human_complex goa_human_rna goa_human_isoform goa_pig xenbase pseudocap ecocyc goa_sars-cov-2)
+
(uniprot_reviewed - (lines with taxa represented in what we currently load above))

To see how this looks, I've put it on to amigo-staging: https://amigo-staging.geneontology.io/amigo/search/ontology

The load we currently have, for comparison, is here: http://noctua-amigo.berkeleybop.org/amigo/search/ontology

vanaukenk commented 2 years ago

Thanks for the update @kltm Is the goal to ultimately have a four-letter abbreviation for each of the taxa? Some still just show the NCBITaxon id. (I searched on sod1 as an example).

pgaudet commented 2 years ago

I dont understand where these links go - did you want to show entities? I dont know how to get to entities from there.

pgaudet commented 2 years ago

@vanaukenk Are we going to make our own 4-letter taxa public? Should we not show something more standard?

kltm commented 2 years ago

@vanaukenk My understanding for the moment was that we were going to start out initially with the taxon id and then iterate from there.

@pgaudet Those links go to the two NEO loads, as seen through the AmiGO ontology interface; one for the newer load we're experimenting with and one for the current load. Remember to remove the "GO" filter to see all the entities available.

kltm commented 2 years ago

Shout out to @cmungall for finding this. In the newest NEO load (and maybe some of these are in the older one), at the bottom is a list of kinds of entities that were not correctly converted to CURIEs--1350337 in total. Some of those are probably not practically important as nobody would be curating to them, but some seem important:

http://purl.obolibrary.org/obo/AGI_LocusCode_XYZ: 28986 http://identifiers.org/wormbase/XYZ : 152 http://identifiers.org/uniprot/XYZ : 49 http://purl.bioontology.org/ontology/provisional/XYZ : 17 http://identifiers.org/mgi/MGI:XYZ : 4

Samples of complete list:

alters_location_of
anastomoses_with
anteriorly_connected_to
attached_to
channel_for
channels_from
...
synapsed_by
Tmp_new_group
transitively_anteriorly_connected_to
...
transitively_proximally_connected_to
trunk_part_of
TS01
...
TS28
xunion_of
http://identifiers.org/mgi/MGI:106910
http://identifiers.org/uniprot/A0A5F9CQZ0
http://identifiers.org/wormbase/B0035.8%7CWB%3AF54E12.4%7CWB%3AF55G1.3%7CWB%3AH02I12.6
http://purl.bioontology.org/ontology/provisional/1ddd2e2d-2ace-4c87-8ec6-d3b5730b3e7c
http://purl.obolibrary.org/obo/D96882F1-8709-49AB-BCA9-772A67EA6C33
http://semanticscience.org/resource/SIO_000658
http://www.geneontology.org/formats/oboInOwl#Subset
http://www.w3.org/2002/07/owl#topObjectProperty
http://xmlns.com/foaf/0.1/image

@balhoff @cmungall Is this something where owltools needs a different CURIE map? Post filter? Or is this better handed by circling back to #83?

kltm commented 2 years ago

go forward with what we have--there should be no "blockers" for our current use cases
iterate on things like species code; outlier non-compacting identifiers (trace back to source)
add some minimal tests to the project; @pgaudet @vanaukenk, could I get some help with test identifiers?

kltm commented 2 years ago

Now have https://github.com/geneontology/go-annotation/issues/4105 and #88 to trace entities. For QC: #89

kltm commented 2 years ago

From managers' discussion, this is now live.

geneontology / neo

Load all Swiss-Prot entries in NEO #82