Proteoform annotations are missing from the mouse gaf

ukemi commented 1 year ago

@LiNiMGI recently noticed that all of the mouse proteoform annotations are missing from GOC resources. In the GOC mouse gaf release (http://current.geneontology.org/products/pages/downloads.html) they are missing. The header of that file says that the mouse annotations: collated from production models in https://github.com/geneontology/noctua-models/ where col1 matches mgi. However, the MGI proteoform annotations are collated into a separate file: http://snapshot.geneontology.org/products/annotations/noctua_pr.gpad.gz. We also have a single annotation in : http://snapshot.geneontology.org/products/annotations/noctua_refseq.gpad.gz.

Here is an example of a line from the pr file: PR Q9QWY8-2 enables GO:0005096 PMID:9819391 ECO:0005801 20160923 MGI contributor=https://orcid.org/0000-0001-7476-6306|noctua-model-id=gomodel:5745387b00001874|model-state=production

You can see that the line does not have MGI in column 1 and would therefore not be collated. We need to modify the pipeline so that the annotations from the pr and refseq file are collated into the mouse file. Note that looking for MGI in column 10 will also be problematic since this will drop annotations made by other groups.

A knock-on effect of this is that the proteoform annotations are not available in AmiGO2.

kltm commented 1 year ago

The most convenient fix would be to adjust https://github.com/geneontology/noctua-models/blob/master/util/collate-gpads.pl to group outputs by what our intent here is, rather than the resource ID. That hides a little mechanism in this script, but means we don't have to tweak things like naming rules in the main Makefile (always a chore) and can test/iterate quickly. @ukemi What I'll need, however, would be to get an exact rule that we could run. How does this sound: If the primary ID is PR or RefSeq and col 9 is MGI, we bin with mgi.

ukemi commented 1 year ago

I think that would work for now as long as other groups don't make annotations to mouse proteoforms using PR identifiers. Eventually, a 'species'-specific annotation file based on a GPI would completely solve this.

kltm commented 1 year ago

Commands to simulate GPAD production (on a large enough machine):

~/local/src/git/minerva/minerva-cli/bin/minerva-cli.sh --lego-to-gpad-sparql --ontology http://skyhook.berkeleybop.org/master/ontology/extensions/go-lego.owl --ontojournal /tmp/blazegraph-onto.jnl -i /tmp/blazegraph-2023-01-26.jnl --gpad-output /tmp/legacy/gpad
perl ~/local/src/git/go-site/script/collate-gpads.pl /tmp/legacy/gpad/*.gpad

kltm commented 1 year ago

@pgaudet I'm going to assume we're holding off on the release for this.

kltm commented 1 year ago

Testing on master. @ukemi, I'll ask you for confirmation that we have it setup more correctly once we get a batch out of the oven.

pgaudet commented 1 year ago

Are we? This is not a new problem.

I dont mind waiting a few days, as we had agreed before, we can give groups a week to resolve issues, otherwise the fix will be in the next-next release. OK ?

Thanks, Pascale

kltm commented 1 year ago

@ukemi Okay, we have some initial results on this:

bbop@wok:/home/skyhook/master/annotations$ zcat mgi.gaf.gz | grep -v "^!" | cut -f 1 | less | sort | uniq -c
 514875 MGI
   2139 PR
      1 RefSeq
    379 UniProtKB

The files are located at http://skyhook.berkeleybop.org/master as they are in a release. Is this looking right (or more right) to you?

ukemi commented 1 year ago

Thanks @kltm. @LiNiMGI and I will have to take a closer look. I'm wondering where the UniProt annotations are originating. We use the protein ontology for proteins and proteoforms. If Uniprot identifiers are GCRPs, then they should be converted to MGI gene identifiers. If they are really proteins, they should all have the PR prefix.

ukemi commented 1 year ago

@kltm @LiNiMGI and I went through the gaf and everything looks ok except for the annotations to UniProt identifiers. Those annotations are coming from PAINT (@pgaudet). UniProtKB P01729 P01729 involved_in GO:0002377 PMID:21873635 IBA PANTHER:PTN000587099|MGI:MGI:98936 P Ig lambda-2 chain V region MOPC 315 UniProtKB:P01729|PTN002537241 protein taxon:10090 20170228 GO_Central

They shouldn't be in the file because they aren't valid annotation objects in our GPI file, for example P01637. I looked at a couple of them and they appear to be protein fragments from things like immunoglobulin regions that aren't associated with a gene at MGI. https://www.informatics.jax.org/sequence/P01637

pgaudet commented 1 year ago

@ukemi P01729 is in UniProt but has no MGI mapping. I wonder if that relates to the discussion we had at the last consortium meeting, (Fall 2022) where Maria explained that some mappings are missing. Since PAINT takes all the reference proteome entries, then if there are discrepancies it may mean that the mappings are not being provided to UniProt by the MOD, or (I suppose) that the MOD and UniProt disagree as to whether the protein exist, or that the entry should be in SP, not in TrEMBL.

My understanding from the discussion at the GOC meeting is that the MODs would work with UniProt to align these files. Is this OK with MGI ?

ukemi commented 1 year ago

It is my understanding that the MGI group works with UniProt continuously on these alignments. But the bottom line here is these annotations should not be in the file until any issues are resolved. They are not 'official' annotatable objects according to our GPI are a dead end if they don't map to a gene.

pgaudet commented 1 year ago

OK, great. I'll check with @kltm at which point of the pipeline this should be handled.

kltm commented 1 year ago

@pgaudet IIRC, there is no point in the pipeline where a by-line GPI cross check occurs. Anything like that would have to be new functionality. Our baseline ingest is GAF-oriented where things like that are not considered.

kltm commented 1 year ago

@pgaudet I'd vote to close this out (annotations are now merged in with new sorting) and open a new ticket looking at cross checking (unrelated to original and will have a separate fix).

geneontology / pipeline

Proteoform annotations are missing from the mouse gaf #313