geneontology / minerva

BSD 3-Clause "New" or "Revised" License
6 stars 8 forks source link

Investigate Scov2 polyprotein listed as a species in Noctua #486

Open vanaukenk opened 2 years ago

vanaukenk commented 2 years ago

During the QC checks for bringing Noctua up after the 2022-05-26 outage, I noticed a suspicious entry, pp1ab Scov2, in the list of species:

image

I thought pp1ab was a polyprotein and that's how it looks in noctua-amigo:

image

@balhoff @tmushayahama - can you take a look to see why this entry is included as a species? Thanks.

Also tagging @kltm

balhoff commented 2 years ago

@tmushayahama how is that list created? (What service does it call to get it?) 'pp1ab Scov2' does not look like a taxon at least in the latest NEO file.

vanaukenk commented 2 years ago

@balhoff @tmushayahama uses the taxon API from minerva (/taxa)

kltm commented 2 years ago

As a hint, noting that the /taxa API is returning: { id: "http://identifiers.org/uniprot/P0DTD1", label: "pp1ab Scov2" }

kltm commented 2 years ago

Noting this found in neo.obo:

[Term]
id: UniProtKB:P0DTD1-PRO_0000449619
name: nsp1 Scov2
synonym: "nsp1" BROAD []
synonym: "P0DTD1-PRO_0000449619" RELATED []
synonym: "protein" RELATED []
is_a: CHEBI:33695
relationship: has_gene_template PR:000050270%7CUniProtKB%3AP0DTD1-PRO_0000449635%7CPRO_0000449635
relationship: in_taxon UniProtKB:P0DTD1 ! pp1ab Scov2
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/GeneProduct
property_value: https://w3id.org/biolink/vocab/category https://w3id.org/biolink/vocab/MacromolecularMachine

I don't believe in_taxon is supposed to work like that.

kltm commented 2 years ago

Origin seems to be here: https://raw.githubusercontent.com/Knowledge-Graph-Hub/kg-covid-19/master/curated/ORFs/uniprot_sars-cov-2.gpi

kltm commented 2 years ago

It looks like the taxon is off by one for GPI 1.2?

UniProtKB   P0DTD1-PRO_0000449619   nsp1    Host translation inhibitor nsp1|P0DTD1(1-180)|rep/Clv:nsp1 (SARS2)|PRO_0000449619|nsp1 (SARS2)|UniProtKB:P0DTD1, 1-180|leader protein (SARS2)|UniProtKB:P0DTC1, 1-180|non-structural protein 1 (SARS2)|nsp-1|ns1|ns-1|host translation inhibitor nsp1|Severe acute respiratory syndrome (SARS) coronavirus nonstructural protein 1  protein taxon:2697049   UniProtKB:P0DTD1    PR:000050270|UniProtKB:P0DTD1-PRO_0000449635|PRO_0000449635

http://geneontology.org/docs/gene-product-information-gpi-format/

kltm commented 2 years ago

Related to https://github.com/geneontology/go-site/issues/1431

balhoff commented 2 years ago

@kltm it seems like you found the problem. But in the neo.owl I downloaded yesterday I saw in_taxon NCBITaxon:2697049. I wonder why the discrepancy?

kltm commented 2 years ago

@balhoff Yeah, there's some stuff I'm not sure about here, especially as that file has not been touched in years, so I'm not sure why it's a problem now. I'm tagging upstream contributors @cmungall and @justaddcoffee to confirm format for GPI 1.2.

kltm commented 2 years ago

From @cmungall , we can go ahead and manually fix this file ourselves upstream.

kltm commented 2 years ago

Working branch at https://github.com/Knowledge-Graph-Hub/kg-covid-19/tree/issue_geneontology_minerva-486-fix-sars-cov-2-gpi

kltm commented 2 years ago

If we understand this correctly, this should be fixed on next NEO release.

kltm commented 2 years ago

Hm. Apparently not. Still appearing on Noctua landing page.