Open ukemi opened 1 year ago
The most convenient fix would be to adjust https://github.com/geneontology/noctua-models/blob/master/util/collate-gpads.pl to group outputs by what our intent here is, rather than the resource ID. That hides a little mechanism in this script, but means we don't have to tweak things like naming rules in the main Makefile (always a chore) and can test/iterate quickly.
@ukemi What I'll need, however, would be to get an exact rule that we could run. How does this sound:
If the primary ID is PR
or RefSeq
and col 9 is MGI
, we bin with mgi.
I think that would work for now as long as other groups don't make annotations to mouse proteoforms using PR identifiers. Eventually, a 'species'-specific annotation file based on a GPI would completely solve this.
Commands to simulate GPAD production (on a large enough machine):
~/local/src/git/minerva/minerva-cli/bin/minerva-cli.sh --lego-to-gpad-sparql --ontology http://skyhook.berkeleybop.org/master/ontology/extensions/go-lego.owl --ontojournal /tmp/blazegraph-onto.jnl -i /tmp/blazegraph-2023-01-26.jnl --gpad-output /tmp/legacy/gpad
perl ~/local/src/git/go-site/script/collate-gpads.pl /tmp/legacy/gpad/*.gpad
@pgaudet I'm going to assume we're holding off on the release for this.
Testing on master
. @ukemi, I'll ask you for confirmation that we have it setup more correctly once we get a batch out of the oven.
Are we? This is not a new problem.
I dont mind waiting a few days, as we had agreed before, we can give groups a week to resolve issues, otherwise the fix will be in the next-next release. OK ?
Thanks, Pascale
@ukemi Okay, we have some initial results on this:
bbop@wok:/home/skyhook/master/annotations$ zcat mgi.gaf.gz | grep -v "^!" | cut -f 1 | less | sort | uniq -c
514875 MGI
2139 PR
1 RefSeq
379 UniProtKB
The files are located at http://skyhook.berkeleybop.org/master as they are in a release. Is this looking right (or more right) to you?
Thanks @kltm. @LiNiMGI and I will have to take a closer look. I'm wondering where the UniProt annotations are originating. We use the protein ontology for proteins and proteoforms. If Uniprot identifiers are GCRPs, then they should be converted to MGI gene identifiers. If they are really proteins, they should all have the PR prefix.
@kltm @LiNiMGI and I went through the gaf and everything looks ok except for the annotations to UniProt identifiers. Those annotations are coming from PAINT (@pgaudet). UniProtKB P01729 P01729 involved_in GO:0002377 PMID:21873635 IBA PANTHER:PTN000587099|MGI:MGI:98936 P Ig lambda-2 chain V region MOPC 315 UniProtKB:P01729|PTN002537241 protein taxon:10090 20170228 GO_Central
They shouldn't be in the file because they aren't valid annotation objects in our GPI file, for example P01637. I looked at a couple of them and they appear to be protein fragments from things like immunoglobulin regions that aren't associated with a gene at MGI. https://www.informatics.jax.org/sequence/P01637
@ukemi P01729 is in UniProt but has no MGI mapping. I wonder if that relates to the discussion we had at the last consortium meeting, (Fall 2022) where Maria explained that some mappings are missing. Since PAINT takes all the reference proteome entries, then if there are discrepancies it may mean that the mappings are not being provided to UniProt by the MOD, or (I suppose) that the MOD and UniProt disagree as to whether the protein exist, or that the entry should be in SP, not in TrEMBL.
My understanding from the discussion at the GOC meeting is that the MODs would work with UniProt to align these files. Is this OK with MGI ?
It is my understanding that the MGI group works with UniProt continuously on these alignments. But the bottom line here is these annotations should not be in the file until any issues are resolved. They are not 'official' annotatable objects according to our GPI are a dead end if they don't map to a gene.
OK, great. I'll check with @kltm at which point of the pipeline this should be handled.
@pgaudet IIRC, there is no point in the pipeline where a by-line GPI cross check occurs. Anything like that would have to be new functionality. Our baseline ingest is GAF-oriented where things like that are not considered.
@pgaudet I'd vote to close this out (annotations are now merged in with new sorting) and open a new ticket looking at cross checking (unrelated to original and will have a separate fix).
@LiNiMGI recently noticed that all of the mouse proteoform annotations are missing from GOC resources. In the GOC mouse gaf release (http://current.geneontology.org/products/pages/downloads.html) they are missing. The header of that file says that the mouse annotations: collated from production models in https://github.com/geneontology/noctua-models/ where col1 matches mgi. However, the MGI proteoform annotations are collated into a separate file: http://snapshot.geneontology.org/products/annotations/noctua_pr.gpad.gz. We also have a single annotation in : http://snapshot.geneontology.org/products/annotations/noctua_refseq.gpad.gz.
Here is an example of a line from the pr file: PR Q9QWY8-2 enables GO:0005096 PMID:9819391 ECO:0005801 20160923 MGI contributor=https://orcid.org/0000-0001-7476-6306|noctua-model-id=gomodel:5745387b00001874|model-state=production
You can see that the line does not have MGI in column 1 and would therefore not be collated. We need to modify the pipeline so that the annotations from the pr and refseq file are collated into the mouse file. Note that looking for MGI in column 10 will also be problematic since this will drop annotations made by other groups.
A knock-on effect of this is that the proteoform annotations are not available in AmiGO2.