Closed ukemi closed 1 year ago
I think we have a folder for obsolete entities, but maybe this is just UniProt ?
@dustine32
@pgaudet Yep, we just check against the UniProt IDs using UniProt's GPI file to see what IDs are still valid. Since this UniProtKB:D3Z423 is still active in UniProt, it is in their GPI and so our PAINT IBA script doesn't filter it out of the generated GAF.
To fix in PAINT we would need to also factor MGI's GPI into the obsoletion process but we don't currently handle any GPIs other than UniProt's. I'm actually thinking we could set this "must be in GPI" constraint up as a GO Rule in the GO pipeline as it currently handles GPIs somewhat, though I'm not sure how good it currently is in pairing submitted GAFs with submitted GPIs (as opposed to generated GAF-derived GPIs). @dougli1sqrd or @kltm would maybe be able to quickly answer? Anyhow, I think this would be a good rule to propose and discuss.
Since the MODs are the ultimate authority on what constitutes annotatable objects for their organisms, you should probably work out how to use the MOD GPIs in factoring the UniProtKB to gene assignments.
@thomaspd Thoughts?
I think we need an easily maintainable solution to this problem. The main action is to encourage each MOD to coordinate with UniProt to make sure the ID mapping is complete and correct. This mapping is updated every year, in the UniProt QFO Reference Proteomes release. It is possible for additional discrepancies to arise during that year, but I think we expect this to be small. If I understand correctly, the PAINT annotations for unmapped IDs are being dropped (currently by MGI, but they would also be filtered out by GO Central if they were run through GORules). This is probably acceptable only if the number is small. If the number is large, then we can look into repairing the unmapped IDs, but this will take effort and should be done as a pipeline step so it can also be applied to other annotation streams besides PAINT, such as SynGO, UniProt, etc.
@hdrabkin @ukemi Looking into the size of the problem, can you provide some numbers?
We're baby-stepping through this process and will look at incorporating the GPI next. Thanks!
@dustine32 This is the list of original withdrawn markers that we detected:
1700101G07Rik | MGI:3588248 Cypt10 | MGI:3616452 Cypt7 | MGI:3616446 Cypt8 | MGI:3616448 Cypt9 | MGI:3616450 Gm10325 | MGI:3642320 Gm10332 | MGI:3642276 Gm16503 | MGI:3642127
But only one of them had PAINT annotation I believe it was Gm10332 that was getting PAINT annotations
Thanks for the quick response @hdrabkin and @ukemi !
The mapping to Gm10332/MGI:3642276 (UniProtKB:D3Z423) is still in UniProt, so we'll still be pulling this into PANTHER17.0 unless you have it removed from UniProt. This could just be a sync lag issue and it'll "clear up" soon. How recently was Gm10332 | MGI:3642276
withdrawn at MGI?
According to our EI, the last modification date for the record was 2016-11-01. However, we noticed that it was still kept as a synonym, so really not sure. Also, the modified dates might not be triggerd if the withdrawal was done in mass by a script.
Verified the deletion/withdrawal date was 2016!
Even more argument to check against the markers in the GPI file.
@dustine32 @kltm Can this be moved to the pipeline repo ? and added to pipeline maintenance?
@pgaudet What exactly is the ask here again for the GO pipeline? Since we are not GPAD/GPI-driven yet, I suspect we'd have to do some conditional work in older code. But even that is later than source, which is now: noctua-models + PAINT + remainder_upstreams. Dependong on the original of the issue, some places may make things easier to filter than others.
It looks like we could add a check in the pipeline to remove any annotated entity that is not in the GPI of the upstream.
@ukemi Did I understand correctly what you meant?
Yes, that would work but we should also put something in place that doesn't allow annotation to entities that are not in the GPI in the first place. Clearly this one has been withdrawn for a while.
Sure - these should not be loaded in Noctua, but I think the entities come from the GPI files? right @kltm ?
Looks like they (ie this retired MGI marker) are also loaded in Panther @dustine32
AFAIK, they are already not loaded into Noctua. But they shouldn't be allowed other places either.
Where does Panther get it's list of MGI genes? Is a drop/reload done?
MGI provides two reports: http://www.informatics.jax.org/downloads/reports/index.html#marker MRK_List1.rpt (including withdrawn marker symbols) MRK_List2.rpt (excluding withdrawn marker symbols)
The bottom line here is that the GPI is the source of truth for annotatable objects at MGI. If it isn't in the GPI it should not be allowed. If it is in the GPI and shouldn't be, MGI should fix it on our end.
@hdrabkin For PANTHER, we start from UniProtKB IDs (For mouse, UP000000589_10090.fasta
available here) and then use UniProt's .idmapping
and .gene2acc
files (in same above link) to map UniProtIDs to MGI IDs. It seems like a roundabout way to get MOD mappings (vs pulling MGI GPI) but this ensures being in sync with the Reference Proteome data used to build PANTHER trees. Some more info in these slides.
@dustine32 it looks like all of the ids listed above are withdrawn markers but UniProt still has them associated with a sequence and an MGI id. Our GPI file would be what is used I think.
Right, @hdrabkin! I see for Gm10332 | MGI:3642276
, it still maps to UniProtKB:D3Z423 via the UniProt ID mapper and the D3Z423 page even links out to an MGI gene page that doesn't exist.
If this example MGI:3642276
was withdrawn at MGI in 2016 we should check with @alexsign what file exactly is used to source MOD->UniProt mappings at UniProt. I just found this https://github.com/geneontology/go-annotation/issues/3282#issuecomment-920938722 by @alexsign that points to an MGI file MRK_SwissProt_TrEMBL.rpt
. When I check this file for MGI:3642276
I get this line:
$ curl -L http://www.informatics.jax.org/downloads/reports/MRK_SwissProt_TrEMBL.rpt | grep MGI:3642276
MGI:3642276 Gm10332 W withdrawn N/A 14 D3Z423
Obviously, it says this MGI ID is "withdrawn" but I don't know if UniProt's ingest is parsing this and the MGI:3642276
-> D3Z423
mapping is still there.
This is why I was suggesting that the pipeline adds a step where the GPIs are checked. Or maybe only the PAINT pipeline needs to add that.
Another option of course is for UniProt to do that; we can ask Maria
Hi @ukemi @pgaudet is this still an issue? Does it need to be escalated to UniProt? Currently has no assignee
Hi @ukemi this is from 2022, I presume it is fixed or somebody would have comaplained/escalated, so closing. Reopen with appropriate assignee if the ticket is still required.
We recently noticed that we are picking up annotations from the GOC master file (http://snapshot.geneontology.org/annotations/mgi.gaf.gz) that are to withdrawn markers in MGI. We have put in a work order to fix this on our end, but it should be done at the annotation source. Since the plan is to have the GOC be the official supplier of mouse annotations at some point, we need to be sure that there are no annotations to withdrawn MGI markers. Here is an example of the gaf2.2 line to a withdrawn marker from the file:
MGI MGI:3642276 Rnf212b enables GO:0019789 PMID:21873635 IBA PANTHER:PTN001099470|SGD:S000004386 F RING finger protein 212B UniProtKB:D3Z423|PTN001862305 protein taxon:10090 20170228 GO_Central
Here is the header of the GOC gaf file: !gaf-version: 2.2 ! !generated-by: GOC ! !date-generated: 2021-03-18T23:15 ! !Header from source association file: !================================= ! !generated-by: GOC ! !date-generated: 2021-03-18T10:12 ! !Header from mgi source association file: !================================= !generated-by: MGI !date-generated: 2021-03-18 !================================= ! !Header copied from paint_mgi_valid.gaf !================================= !Created on Sun Feb 28 15:42:27 2021. !generated-by: PANTHER !date-generated: 2021-02-28 !PANTHER version: v.15.0. !GO version: 2021-02-01. ! !=================================