geneontology / go-annotation

This repository hosts the tracker for issues pertaining to GO annotations.
BSD 3-Clause "New" or "Revised" License
34 stars 10 forks source link

GO-CAM taxon ID's are possibly incorrect for 4 models #4248

Closed nmarkari closed 1 year ago

nmarkari commented 2 years ago

The following are all cases of production models where one gocam has two different taxon IDs assigned to it. Most of the cases are because they represent processes from viral infections, and one taxon ID is for Homo sapiens and the other is for the virus, but 4 cases naively seem to possibly be mistakes.

In two cases (62183af000000536, 6205c24300000880) however, Sus scrofa (wild boar) and homo sapiens are both listed, but the gocam title itself says it is for human, and all the proteins appear to be human gene products based on the noctua visualization. I looked in the ttl file and indeed the taxon ID for sus scrofa is included. I'm not sure if these were errors during curation, or if one of the papers used as evidence utilized Sus scrofa, or if the model really was intended to represent both human and boar versions of the same pathway.

There's also one case where both human and mouse are listed: 60ad85f700000058. I'm not sure about this one.

Lastly, there's an odd case with 5e72450500004019 which represents a covid pathway, but there are two taxon IDs presumably representing the virus, one which is the ID for the virus and the other which is a uniprot ID for one of the viral proteins that is listed as a taxon ID for some reason. I'm unsure if that is a way of specifying a specific variant of covid or not. Naively, these 4 cases seem like mistakes, but I just wanted to pass this information along to the curation team to take a look! @vanaukenk @pgaudet

pgaudet commented 2 years ago

Hi @nmarkari

Sorry about the long delay in responding to this - I dont see any Sus scrofa protein IDs in 62183af000000536, 6205c24300000880 - can you check?

Likewise for 60ad85f700000058 - I only see mouse IDs.

For the Sars-CoV-2 pathway, 5e72450500004019 , I dont see either which two taxon IDs you refer to ?

Can you please have another look? It'd be great to understand what is not clear, if only to add documentation about this.

Thanks, Pascale

nmarkari commented 2 years ago

Hi @pgaudet

The gocam itself, not the proteins, is assigned to both homo sapiens and sus scrofa. See line 46 in the ttl file: https://github.com/geneontology/noctua-models/blob/2ada32d7bfbc6afe8df0821713b1ade01ab7d41e/models/62183af000000536.ttl

<https://w3id.org/biolink/vocab/in_taxon> <http://purl.obolibrary.org/obo/NCBITaxon_9606> , <http://purl.obolibrary.org/obo/NCBITaxon_9823>

Or, see the following query on noctua which retrieves that model as the 7th result when filtering by "Sus scrofa" for organism http://noctua.geneontology.org/workbench/noctua-landing-page/?offset=0&limit=50&taxon=NCBITaxon:9823&expand&debug

The other examples I listed are all of the same nature.

pgaudet commented 2 years ago

Thanks for the clarification @nmarkari This is really strange, I dont see any pig sequences in that model, or anywhere else in the .ttl file, but just, that species in incorrectly mentioned in the model.

@kltm or @balhoff Can you look into where that data may be coming from?

Thanks, Pascale

kltm commented 1 year ago

Looking at example 62183af000000536 (http://noctua.geneontology.org/editor/graph/gomodel:62183af000000536). Marked by two model-level annotations:

https://w3id.org/biolink/vocab/in_taxon: NCBITaxon:9606
https://w3id.org/biolink/vocab/in_taxon: NCBITaxon:9823

This is a hand-created model. https://github.com/geneontology/noctua-models/blob/master/models/62183af000000536.ttl This annotation has existed in all versions, so was either added manually by the curator (probably not) or automatically added by minerva (likely). So I guess the questions are:

1) how did minerva make this mistake 2) how frequent was this mistake 3) how to fix the mistake (fix going forward) 4) how to bulk update to clear the mistake (historical fix)

We'll need feedback from @balhoff here.

kltm commented 1 year ago

http://identifiers.org/uniprot/Q8N6T7 http://identifiers.org/uniprot/Q96EB6

kltm commented 1 year ago

@balhoff Casting a wider net:

sjcarbon@moiraine:~/local/src/git/noctua-models/models[master]$:) grep -o "obo/NCBITaxon_[0-9]*" *.ttl | sort | uniq | cut -d ':' -f 1 | uniq -c | grep -v "1 " | wc -l
223

Sampling these, some are intended (multi-species/gut bacteria); some are internal tests; some are as above; some seem random.

balhoff commented 1 year ago

There is some code added by Ben which looks at the in_taxon links in NEO and inserts the taxon annotations when a model is saved:

Have we ever had bugs in the taxon assignments in NEO?

balhoff commented 1 year ago

The way Minerva works, I think if the wrong version of a gene is ever entered in a model, then saved, its taxon will be added, then even if the gene is corrected, the taxon will never be removed. This could happen without git history evidence. This is just a guess at what could have happened.

kltm commented 1 year ago

@balhoff Hm. Something that is likely to keep happening then. I guess either a one-off cleanup periodically or code that purges, recalculates, and re-adds. I'm not sure what the overhead of something like would be. As far as I know, we've never had issues with bad taxon assignments.

balhoff commented 1 year ago

I think it's recalculating on every save, so maybe it should just purge existing triples each time.

kltm commented 1 year ago

Ah, yeah, that sounds like the right path then. We'll still need to do a one-time bulk cleanup, but that makes it easier.

balhoff commented 1 year ago

I made a Minerva issue: https://github.com/geneontology/minerva/issues/503

pgaudet commented 1 year ago

Can we close the issue here then ?

kltm commented 1 year ago

@pgaudet We still need to get a fix into production and purge the current model set.

pgaudet commented 1 year ago

Sorry, I meant, the issue is now in the Minerva tracker, not here?

kltm commented 1 year ago

The fix in minerva is currently being tested and we'll be talking to @vanaukenk soon. Otherwise, we'll still need to have a planfor updating the bad data, which is potentially separate from anything in minerva.

kltm commented 1 year ago

@vanaukenk I have a reduced list looking at "production" and iteratively filtering out

NCBITaxon:10299
NCBITaxon:2697049 
NCBITaxon:471871
NCBITaxon:301447
NCBITaxon:10254
NCBITaxon:265872
NCBITaxon:83334
NCBITaxon:83333
NCBITaxon:243365

I am down to 81 models for examination (from 224). If you'd like to sample them, I can continue filtering. Let me know the URL you'd like links to resolve to and I can create a list for you on another channel.

vanaukenk commented 1 year ago

@kltm I've looked at the first 25 models on the list of 81 models and the species assignments for those models are all legit, i.e. the taxa listed with the model are all represented in existing annotations.

ValWood commented 1 year ago

@vanaukenk close?

ValWood commented 1 year ago

@vanaukenk @pgaudet is this still current?

kltm commented 1 year ago

Okay, in discussion with @pgaudet , it seems not to be a current issue, closing for now. Reopen if it comes up again.