Closed pgaudet closed 2 years ago
Full URL is ftp://ftp.ebi.ac.uk/pub/contrib/goa/uniprot_reviewed.gpi.gz
See my latest comments in https://github.com/geneontology/go-site/issues/1431
I think loading the reviewed file for SARS-CoV-2 is a bad idea as we lose the important proteins that do important work
I suspect this problem would remain for other viruses too, I have no idea how we would do useful annotation of them without entries for the polyproteins.
We have fixed the problem for SARS2 with my curated file. However, if we are serious about doing other viruses that have similar genomes then I think we need to programmatically extract the correct entries. This would be a project:
I suspect this problem would remain for other viruses too,
@pmasson55 says that this is not typical for all viruses. With Patrick we should look at which viruses need this special processing.
Thanks Pascale
Hi All, I was talking with Peter D'eustachio about this and have two comments that hopefully will be of use.
Lots of viruses that humans care about (infect humans) use the polyprotein strategy, there are a number of papers out there, mostly written from a drug targeting protease point of view. Here are two examples:
There is another class of viral polyprotein that you wouldn't really call a polyprotein, but operates in the same way or at least has the same "subfragment" problem. Influenza is an example of this, where host proteases are used to activate viral proteins. In fact this later mechanism, cleavage of the HA protein intracellulary is one of the things that made the 1918 influenza virus so pathogenic. These generally use host proteases.
Hi All,
Concerning SwissProt viral entries, I would say it concerns about 10% of the total viral entries ( about 1500 out of 15 000 approximately). They are not as complex as SARS-COV-2 entries. Most of the time there is only one polyprotein and not a long and a short version of the same polyprotein. So I think that if we can handle protein processing (being able to annotate chains inside polyproteins) I guess we cover 99% of the viruses....
Okay, picking up work from #77 here, where there are a few more details. Noting that the working branch is now: https://github.com/geneontology/neo/tree/issue-82-add-all-reviewed .
The current blocking issue is that while we were hoping to have a drop-in replacement work, there is some issue with the owltools solr loader that is preventing a load completion. Essentially, after somewhere between \~500k-\~1m documents loaded, we get an error like:
[2022-02-04T23:45:34.383Z] 2022-02-04 23:45:29,869 INFO (FlexSolrDocumentLoader:47) Processed 1000 flex ontology docs at 674000 and committing...
[2022-02-04T23:45:37.612Z] 2022-02-04 23:45:36,895 INFO (FlexCollection:253) Loaded: 675000 of 1520950, elapsed: 2:23:28.058, eta: 2:49:11.400
[2022-02-04T23:45:37.612Z] 2022-02-04 23:45:36,896 INFO (FlexSolrDocumentLoader:47) Processed 1000 flex ontology docs at 675000 and committing...
[2022-02-04T23:46:33.662Z] Exception in thread "main" org.apache.solr.common.SolrException: [was class java.io.IOException] java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms??java.lang.RuntimeException: [was class java.io.IOException] java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms? at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)? at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)? at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)? at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)? at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)? at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)? at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)? at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)? at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)? at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)? at org.apa
[2022-02-04T23:46:33.662Z]
[2022-02-04T23:46:33.662Z] [was class java.io.IOException] java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms??java.lang.RuntimeException: [was class java.io.IOException] java.util.concurrent.TimeoutException: Idle timeout expired: 30000/30000 ms? at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)? at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)? at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)? at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)? at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:315)? at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)? at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)? at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:58)? at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)? at org.apache.solr.core.SolrCore.execute(SolrCore.java:1376)? at org.apa
[2022-02-04T23:46:33.662Z]
[2022-02-04T23:46:33.662Z] request: http://localhost:8080/solr/update?wt=javabin&version=2
[2022-02-04T23:46:33.662Z] at org.apache.solr.client.solrj.impl.CommonsHttpSolrServer.request(CommonsHttpSolrServer.java:427)
After running this several times, the error occurs usually between two and three hours in to what should be an approximately five hour load, given the number of documents. Note that these initial numbers are from #77, where the full number of documents would have been 1520942 (compared to our current load of 1168920 documents).
Given that we know that solr can typically handle many more documents (in the main go pipeline) and is being loaded in batch anyways, it feels to me unlikely that it is solr choking out directly. I suspect that there is some kind of memory handling issue or incorrectly passed parameter to the owltools loader that eventually causes memory thrashing and then the error. As a next step, I'll rerun this and make note of memory and disk usage as it approaches the limit. If it is not in owltools directly, this should still give us information about where to look next.
Talking to @pgaudet we'll be asking upstream to filter out the sars-cov-2 entries.
Okay, I'm managed to spend a little time with this and have some observations:
All told (unless I just happened to be stupendously lucky this time), I think that the issue is that owltools can do one or the other with the memory given, but will eventually thrash out if it tries to do both. I think the most expedient next steps would be:
Okay, I'm trying to just add in again the uniprot_reviewed to what we have (bumping ecocyc out for the moment). With that, we're still having problems like we've had before (i.e. #80 ) with:
15:31:31 Exception in thread "main" org.semanticweb.owlapi.model.OWLOntologyStorageException: org.obolibrary.oboformat.model.FrameStructureException: multiple name tags not allowed. in frame:Frame(UniProtKB:Q8IUB2 id( UniProtKB:Q8IUB2)synonym( WFDC3 BROAD)xref( Ensembl:ENSG00000124116)xref( HGNC:HGNC:15957)synonym( WFDC3 RELATED)synonym( WAP14 RELATED)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/Protein)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine)xref( EMBL:AL050348)synonym( Q8IUB2 RELATED)synonym( NP_542181.1 RELATED)synonym( 15957 RELATED)synonym( AL050348 RELATED)name( WFDC3 Hsap)xref( RefSeq:NP_542181.1)synonym( ENSG00000124116 RELATED)xref( HGNC:15957)name( WFDC3 NCBITaxon:9606)synonym( HGNC:15957 RELATED)is_a( CHEBI:36080)relationship( in_taxon NCBITaxon:9606))
15:31:31 at org.semanticweb.owlapi.oboformat.OBOFormatRenderer.render(OBOFormatRenderer.java:90)
15:31:31 at org.semanticweb.owlapi.oboformat.OBOFormatStorer.storeOntology(OBOFormatStorer.java:42)
15:31:31 at org.semanticweb.owlapi.util.AbstractOWLStorer.storeOntology(AbstractOWLStorer.java:155)
15:31:31 at org.semanticweb.owlapi.util.AbstractOWLStorer.storeOntology(AbstractOWLStorer.java:119)
15:31:31 at uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.saveOntology(OWLOntologyManagerImpl.java:1525)
15:31:31 at uk.ac.manchester.cs.owl.owlapi.OWLOntologyManagerImpl.saveOntology(OWLOntologyManagerImpl.java:1502)
15:31:31 at owltools.io.ParserWrapper.saveOWL(ParserWrapper.java:289)
15:31:31 at owltools.io.ParserWrapper.saveOWL(ParserWrapper.java:209)
15:31:31 at owltools.cli.CommandRunner.runSingleIteration(CommandRunner.java:3712)
15:31:31 at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:76)
15:31:31 at owltools.cli.CommandRunnerBase.run(CommandRunnerBase.java:68)
15:31:31 at owltools.cli.CommandLineInterface.main(CommandLineInterface.java:12)
15:31:31 Caused by: org.obolibrary.oboformat.model.FrameStructureException: multiple name tags not allowed. in frame:Frame(UniProtKB:Q8IUB2 id( UniProtKB:Q8IUB2)synonym( WFDC3 BROAD)xref( Ensembl:ENSG00000124116)xref( HGNC:HGNC:15957)synonym( WFDC3 RELATED)synonym( WAP14 RELATED)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/Protein)property_value( https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine)xref( EMBL:AL050348)synonym( Q8IUB2 RELATED)synonym( NP_542181.1 RELATED)synonym( 15957 RELATED)synonym( AL050348 RELATED)name( WFDC3 Hsap)xref( RefSeq:NP_542181.1)synonym( ENSG00000124116 RELATED)xref( HGNC:15957)name( WFDC3 NCBITaxon:9606)synonym( HGNC:15957 RELATED)is_a( CHEBI:36080)relationship( in_taxon NCBITaxon:9606))
15:31:31 at org.obolibrary.oboformat.model.Frame.checkMaxOneCardinality(Frame.java:424)
15:31:31 at org.obolibrary.oboformat.model.Frame.check(Frame.java:405)
15:31:31 at org.obolibrary.oboformat.model.OBODoc.check(OBODoc.java:390)
15:31:31 at org.obolibrary.oboformat.writer.OBOFormatWriter.write(OBOFormatWriter.java:183)
15:31:31 at org.semanticweb.owlapi.oboformat.OBOFormatRenderer.render(OBOFormatRenderer.java:88)
15:31:31 ... 11 more
15:31:32 Makefile:30: recipe for target 'neo.obo' failed
15:31:32 make: *** [neo.obo] Error 1
Taking a look at the files:
bbop@wok:/var/lib/jenkins/workspace/peline_issue-neo-82-all-reviewed/neo/mirror$ zgrep Q8IUB2 *.gz
goa_human.gpi.gz:UniProtKB Q8IUB2 WFDC3 WAP four-disulfide core domain protein 3 WFDC3|WAP14 protein taxon:9606 HGNC:15957 db_subset=Swiss-Prot
goa_human_isoform.gpi.gz:UniProtKB F2Z2G4 WFDC3 WAP four-disulfide core domain protein 3 WFDC3 protein taxon:9606 UniProtKB:Q8IUB2 HGNC:15957 db_subset=TrEMBL
goa_human_isoform.gpi.gz:UniProtKB F2Z2G5 WFDC3 WAP domain-containing protein WFDC3 protein taxon:9606 UniProtKB:Q8IUB2 HGNC:15957 db_subset=TrEMBL
goa_human_isoform.gpi.gz:UniProtKB H0Y2V5 WFDC3 WAP four-disulfide core domain protein 3 WFDC3 protein taxon:9606 UniProtKB:Q8IUB2 HGNC:15957 db_subset=TrEMBL
uniprot_reviewed.gpi.gz:UniProtKB Q8IUB2 WFDC3 WAP four-disulfide core domain protein 3 WFDC3|WAP14 protein taxon:9606 EMBL:AL050348|RefSeq:NP_542181.1|HGNC:HGNC:15957|Ensembl:ENSG00000124116 db_subset=Swiss-Prot|taxon_name=Homo sapiens|taxon_common_name=Human|proteome=gcrpCan
@balhoff I'm betting there will be a lot of collisions like this and getting them on a one-by-one basis will take a long time. Is there a way to just have these clobber or skip, or do we need to write a filter script to take care of these up front?
I suggest making a new issue for this and coordinating with Alex
For the goa_human vs goa_human_isoform issue:
the uniprot files are a bit different from the rest, the GPI specs are AFAIK silent on the matter of how a set of GPs should be partitioned across files, but I would strongly recommend making it a requirement that for GPIs loaded into Neo that uniqueness should be guaranteed. For uniprot this means
EITHER
My preference would for 2
I suggest a uniprot-specific one-line script up front that reports and filters any line in goa_X_isoform that does not follow \w+\-\d+
in col2
For uniprot_reviewed, I think the easiest thing is to filter out any already-covered taxon
Apparently a lot of overlap in the first pass with species we already have:
567013 /tmp/uniprot_reviewed.gpi
388714 /tmp/naively_filtered_file.gpi
Will bolt this in and see if there are any collisions left.
Breakup pipeline command from make clean all
to make clean
and make all
to get around an ordering issue.
touch trigger
wget http://s3.amazonaws.com/go-build/metadata/datasets.json -O datasets.json && touch datasets.json
--2022-04-05 15:33:57-- http://s3.amazonaws.com/go-build/metadata/datasets.json
Resolving s3.amazonaws.com (s3.amazonaws.com)... 52.217.48.62
Connecting to s3.amazonaws.com (s3.amazonaws.com)|52.217.48.62|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 81874 (80K) [application/json]
Saving to: ‘datasets.json’
datasets.json 100%[===================>] 79.96K 406KB/s in 0.2s
2022-04-05 15:33:58 (406 KB/s) - ‘datasets.json’ saved [81874/81874]
./build-neo-makefile.py -i datasets.json > Makefile-gafs.tmp && mv Makefile-gafs.tmp Makefile-gafs
rm trigger datasets.json mirror/*gz target/*.obo || echo "not all files present, perhaps last build did not complete"
Okay, I think we're getting a little further along with the collisions. Added an additional manual filter list to pick up the things that are "manual" in the Makefile (not datasets.json). Temporary; seeing if that can get us through the owltools conversion.
@pgaudet @vanaukenk Okay, we have had some success with the new NEO load with more entities. The formula for this is, similar to how we handle things in the main pipeline:
(all currently loaded files: sgd pombase mgi zfin rgd dictybase fb tair wb goa_human goa_human_complex goa_human_rna goa_human_isoform goa_pig xenbase pseudocap ecocyc goa_sars-cov-2)
+
(uniprot_reviewed - (lines with taxa represented in what we currently load above))
To see how this looks, I've put it on to amigo-staging: https://amigo-staging.geneontology.io/amigo/search/ontology
The load we currently have, for comparison, is here: http://noctua-amigo.berkeleybop.org/amigo/search/ontology
Thanks for the update @kltm Is the goal to ultimately have a four-letter abbreviation for each of the taxa? Some still just show the NCBITaxon id. (I searched on sod1 as an example).
I dont understand where these links go - did you want to show entities? I dont know how to get to entities from there.
@vanaukenk Are we going to make our own 4-letter taxa public? Should we not show something more standard?
@vanaukenk My understanding for the moment was that we were going to start out initially with the taxon id and then iterate from there.
@pgaudet Those links go to the two NEO loads, as seen through the AmiGO ontology interface; one for the newer load we're experimenting with and one for the current load. Remember to remove the "GO" filter to see all the entities available.
Shout out to @cmungall for finding this. In the newest NEO load (and maybe some of these are in the older one), at the bottom is a list of kinds of entities that were not correctly converted to CURIEs--1350337 in total. Some of those are probably not practically important as nobody would be curating to them, but some seem important:
http://purl.obolibrary.org/obo/AGI_LocusCode_XYZ
: 28986
http://identifiers.org/wormbase/XYZ
: 152
http://identifiers.org/uniprot/XYZ
: 49
http://purl.bioontology.org/ontology/provisional/XYZ
: 17
http://identifiers.org/mgi/MGI:XYZ
: 4
Samples of complete list:
alters_location_of
anastomoses_with
anteriorly_connected_to
attached_to
channel_for
channels_from
...
synapsed_by
Tmp_new_group
transitively_anteriorly_connected_to
...
transitively_proximally_connected_to
trunk_part_of
TS01
...
TS28
xunion_of
http://identifiers.org/mgi/MGI:106910
http://identifiers.org/uniprot/A0A5F9CQZ0
http://identifiers.org/wormbase/B0035.8%7CWB%3AF54E12.4%7CWB%3AF55G1.3%7CWB%3AH02I12.6
http://purl.bioontology.org/ontology/provisional/1ddd2e2d-2ace-4c87-8ec6-d3b5730b3e7c
http://purl.obolibrary.org/obo/D96882F1-8709-49AB-BCA9-772A67EA6C33
http://semanticscience.org/resource/SIO_000658
http://www.geneontology.org/formats/oboInOwl#Subset
http://www.w3.org/2002/07/owl#topObjectProperty
http://xmlns.com/foaf/0.1/image
@balhoff @cmungall Is this something where owltools needs a different CURIE map? Post filter? Or is this better handed by circling back to #83?
Now have https://github.com/geneontology/go-annotation/issues/4105 and #88 to trace entities. For QC: #89
From managers' discussion, this is now live.
Hi @kltm
The 'ultimate' goal is to have all Swiss-Prot (reviewed) entries. The file is in the same GOA ftp, it's called uniprot_reviewed.gpi.gz
The bacteria and viruses file was to test a smaller set, but we'll need everything. This file is about double the size of uniprot_reviewed_virus_bacteria.gpi.gz.
Thanks, Pascale