Closed nmarkari closed 1 year ago
Hi @nmarkari
Sorry about the long delay in responding to this - I dont see any Sus scrofa protein IDs in 62183af000000536, 6205c24300000880 - can you check?
Likewise for 60ad85f700000058 - I only see mouse IDs.
For the Sars-CoV-2 pathway, 5e72450500004019 , I dont see either which two taxon IDs you refer to ?
Can you please have another look? It'd be great to understand what is not clear, if only to add documentation about this.
Thanks, Pascale
Hi @pgaudet
The gocam itself, not the proteins, is assigned to both homo sapiens and sus scrofa. See line 46 in the ttl file: https://github.com/geneontology/noctua-models/blob/2ada32d7bfbc6afe8df0821713b1ade01ab7d41e/models/62183af000000536.ttl
<https://w3id.org/biolink/vocab/in_taxon> <http://purl.obolibrary.org/obo/NCBITaxon_9606> , <http://purl.obolibrary.org/obo/NCBITaxon_9823>
Or, see the following query on noctua which retrieves that model as the 7th result when filtering by "Sus scrofa" for organism http://noctua.geneontology.org/workbench/noctua-landing-page/?offset=0&limit=50&taxon=NCBITaxon:9823&expand&debug
The other examples I listed are all of the same nature.
Thanks for the clarification @nmarkari This is really strange, I dont see any pig sequences in that model, or anywhere else in the .ttl file, but just, that species in incorrectly mentioned in the model.
@kltm or @balhoff Can you look into where that data may be coming from?
Thanks, Pascale
Looking at example 62183af000000536 (http://noctua.geneontology.org/editor/graph/gomodel:62183af000000536). Marked by two model-level annotations:
https://w3id.org/biolink/vocab/in_taxon: NCBITaxon:9606
https://w3id.org/biolink/vocab/in_taxon: NCBITaxon:9823
This is a hand-created model. https://github.com/geneontology/noctua-models/blob/master/models/62183af000000536.ttl This annotation has existed in all versions, so was either added manually by the curator (probably not) or automatically added by minerva (likely). So I guess the questions are:
1) how did minerva make this mistake 2) how frequent was this mistake 3) how to fix the mistake (fix going forward) 4) how to bulk update to clear the mistake (historical fix)
We'll need feedback from @balhoff here.
@balhoff Casting a wider net:
sjcarbon@moiraine:~/local/src/git/noctua-models/models[master]$:) grep -o "obo/NCBITaxon_[0-9]*" *.ttl | sort | uniq | cut -d ':' -f 1 | uniq -c | grep -v "1 " | wc -l
223
Sampling these, some are intended (multi-species/gut bacteria); some are internal tests; some are as above; some seem random.
There is some code added by Ben which looks at the in_taxon links in NEO and inserts the taxon annotations when a model is saved:
Have we ever had bugs in the taxon assignments in NEO?
The way Minerva works, I think if the wrong version of a gene is ever entered in a model, then saved, its taxon will be added, then even if the gene is corrected, the taxon will never be removed. This could happen without git history evidence. This is just a guess at what could have happened.
@balhoff Hm. Something that is likely to keep happening then. I guess either a one-off cleanup periodically or code that purges, recalculates, and re-adds. I'm not sure what the overhead of something like would be. As far as I know, we've never had issues with bad taxon assignments.
I think it's recalculating on every save, so maybe it should just purge existing triples each time.
Ah, yeah, that sounds like the right path then. We'll still need to do a one-time bulk cleanup, but that makes it easier.
I made a Minerva issue: https://github.com/geneontology/minerva/issues/503
Can we close the issue here then ?
@pgaudet We still need to get a fix into production and purge the current model set.
Sorry, I meant, the issue is now in the Minerva tracker, not here?
The fix in minerva is currently being tested and we'll be talking to @vanaukenk soon. Otherwise, we'll still need to have a planfor updating the bad data, which is potentially separate from anything in minerva.
@vanaukenk I have a reduced list looking at "production" and iteratively filtering out
NCBITaxon:10299
NCBITaxon:2697049
NCBITaxon:471871
NCBITaxon:301447
NCBITaxon:10254
NCBITaxon:265872
NCBITaxon:83334
NCBITaxon:83333
NCBITaxon:243365
I am down to 81 models for examination (from 224). If you'd like to sample them, I can continue filtering. Let me know the URL you'd like links to resolve to and I can create a list for you on another channel.
@kltm I've looked at the first 25 models on the list of 81 models and the species assignments for those models are all legit, i.e. the taxa listed with the model are all represented in existing annotations.
@vanaukenk close?
@vanaukenk @pgaudet is this still current?
Okay, in discussion with @pgaudet , it seems not to be a current issue, closing for now. Reopen if it comes up again.
The following are all cases of production models where one gocam has two different taxon IDs assigned to it. Most of the cases are because they represent processes from viral infections, and one taxon ID is for Homo sapiens and the other is for the virus, but 4 cases naively seem to possibly be mistakes.
In two cases (62183af000000536, 6205c24300000880) however, Sus scrofa (wild boar) and homo sapiens are both listed, but the gocam title itself says it is for human, and all the proteins appear to be human gene products based on the noctua visualization. I looked in the ttl file and indeed the taxon ID for sus scrofa is included. I'm not sure if these were errors during curation, or if one of the papers used as evidence utilized Sus scrofa, or if the model really was intended to represent both human and boar versions of the same pathway.
There's also one case where both human and mouse are listed: 60ad85f700000058. I'm not sure about this one.
Lastly, there's an odd case with 5e72450500004019 which represents a covid pathway, but there are two taxon IDs presumably representing the virus, one which is the ID for the virus and the other which is a uniprot ID for one of the viral proteins that is listed as a taxon ID for some reason. I'm unsure if that is a way of specifying a specific variant of covid or not. Naively, these 4 cases seem like mistakes, but I just wanted to pass this information along to the curation team to take a look! @vanaukenk @pgaudet