geneontology / go-annotation

This repository hosts the tracker for issues pertaining to GO annotations.
BSD 3-Clause "New" or "Revised" License
34 stars 10 forks source link

Withdrawn MGI Markers should not have PAINT annotations #3702

Closed ukemi closed 1 year ago

ukemi commented 3 years ago

We recently noticed that we are picking up annotations from the GOC master file (http://snapshot.geneontology.org/annotations/mgi.gaf.gz) that are to withdrawn markers in MGI. We have put in a work order to fix this on our end, but it should be done at the annotation source. Since the plan is to have the GOC be the official supplier of mouse annotations at some point, we need to be sure that there are no annotations to withdrawn MGI markers. Here is an example of the gaf2.2 line to a withdrawn marker from the file:

MGI MGI:3642276 Rnf212b enables GO:0019789 PMID:21873635 IBA PANTHER:PTN001099470|SGD:S000004386 F RING finger protein 212B UniProtKB:D3Z423|PTN001862305 protein taxon:10090 20170228 GO_Central

Here is the header of the GOC gaf file: !gaf-version: 2.2 ! !generated-by: GOC ! !date-generated: 2021-03-18T23:15 ! !Header from source association file: !================================= ! !generated-by: GOC ! !date-generated: 2021-03-18T10:12 ! !Header from mgi source association file: !================================= !generated-by: MGI !date-generated: 2021-03-18 !================================= ! !Header copied from paint_mgi_valid.gaf !================================= !Created on Sun Feb 28 15:42:27 2021. !generated-by: PANTHER !date-generated: 2021-02-28 !PANTHER version: v.15.0. !GO version: 2021-02-01. ! !=================================

pgaudet commented 3 years ago

I think we have a folder for obsolete entities, but maybe this is just UniProt ?

@dustine32

dustine32 commented 3 years ago

@pgaudet Yep, we just check against the UniProt IDs using UniProt's GPI file to see what IDs are still valid. Since this UniProtKB:D3Z423 is still active in UniProt, it is in their GPI and so our PAINT IBA script doesn't filter it out of the generated GAF.

To fix in PAINT we would need to also factor MGI's GPI into the obsoletion process but we don't currently handle any GPIs other than UniProt's. I'm actually thinking we could set this "must be in GPI" constraint up as a GO Rule in the GO pipeline as it currently handles GPIs somewhat, though I'm not sure how good it currently is in pairing submitted GAFs with submitted GPIs (as opposed to generated GAF-derived GPIs). @dougli1sqrd or @kltm would maybe be able to quickly answer? Anyhow, I think this would be a good rule to propose and discuss.

ukemi commented 3 years ago

Since the MODs are the ultimate authority on what constitutes annotatable objects for their organisms, you should probably work out how to use the MOD GPIs in factoring the UniProtKB to gene assignments.

dustine32 commented 3 years ago

@thomaspd Thoughts?

thomaspd commented 3 years ago

I think we need an easily maintainable solution to this problem. The main action is to encourage each MOD to coordinate with UniProt to make sure the ID mapping is complete and correct. This mapping is updated every year, in the UniProt QFO Reference Proteomes release. It is possible for additional discrepancies to arise during that year, but I think we expect this to be small. If I understand correctly, the PAINT annotations for unmapped IDs are being dropped (currently by MGI, but they would also be filtered out by GO Central if they were run through GORules). This is probably acceptable only if the number is small. If the number is large, then we can look into repairing the unmapped IDs, but this will take effort and should be done as a pipeline step so it can also be applied to other annotation streams besides PAINT, such as SynGO, UniProt, etc.

dustine32 commented 3 years ago

@hdrabkin @ukemi Looking into the size of the problem, can you provide some numbers?

  1. Number of IBA annotations to obsoleted MGI IDs
  2. Number of unique, obsoleted MGI IDs in the IBA GAFs

We're baby-stepping through this process and will look at incorporating the GPI next. Thanks!

ukemi commented 3 years ago

@dustine32 This is the list of original withdrawn markers that we detected:

1700101G07Rik | MGI:3588248 Cypt10 | MGI:3616452 Cypt7 | MGI:3616446 Cypt8 | MGI:3616448 Cypt9 | MGI:3616450 Gm10325 | MGI:3642320 Gm10332 | MGI:3642276 Gm16503 | MGI:3642127

hdrabkin commented 3 years ago

But only one of them had PAINT annotation I believe it was Gm10332 that was getting PAINT annotations

dustine32 commented 3 years ago

Thanks for the quick response @hdrabkin and @ukemi !

The mapping to Gm10332/MGI:3642276 (UniProtKB:D3Z423) is still in UniProt, so we'll still be pulling this into PANTHER17.0 unless you have it removed from UniProt. This could just be a sync lag issue and it'll "clear up" soon. How recently was Gm10332 | MGI:3642276 withdrawn at MGI?

hdrabkin commented 3 years ago

According to our EI, the last modification date for the record was 2016-11-01. However, we noticed that it was still kept as a synonym, so really not sure. Also, the modified dates might not be triggerd if the withdrawal was done in mass by a script.

hdrabkin commented 3 years ago

Verified the deletion/withdrawal date was 2016!

ukemi commented 3 years ago

Even more argument to check against the markers in the GPI file.

pgaudet commented 2 years ago

@dustine32 @kltm Can this be moved to the pipeline repo ? and added to pipeline maintenance?

kltm commented 2 years ago

@pgaudet What exactly is the ask here again for the GO pipeline? Since we are not GPAD/GPI-driven yet, I suspect we'd have to do some conditional work in older code. But even that is later than source, which is now: noctua-models + PAINT + remainder_upstreams. Dependong on the original of the issue, some places may make things easier to filter than others.

pgaudet commented 2 years ago

It looks like we could add a check in the pipeline to remove any annotated entity that is not in the GPI of the upstream.

@ukemi Did I understand correctly what you meant?

ukemi commented 2 years ago

Yes, that would work but we should also put something in place that doesn't allow annotation to entities that are not in the GPI in the first place. Clearly this one has been withdrawn for a while.

pgaudet commented 2 years ago

Sure - these should not be loaded in Noctua, but I think the entities come from the GPI files? right @kltm ?

Looks like they (ie this retired MGI marker) are also loaded in Panther @dustine32

ukemi commented 2 years ago

AFAIK, they are already not loaded into Noctua. But they shouldn't be allowed other places either.

hdrabkin commented 2 years ago

Where does Panther get it's list of MGI genes? Is a drop/reload done?

hdrabkin commented 2 years ago

MGI provides two reports: http://www.informatics.jax.org/downloads/reports/index.html#marker MRK_List1.rpt (including withdrawn marker symbols) MRK_List2.rpt (excluding withdrawn marker symbols)

ukemi commented 2 years ago

The bottom line here is that the GPI is the source of truth for annotatable objects at MGI. If it isn't in the GPI it should not be allowed. If it is in the GPI and shouldn't be, MGI should fix it on our end.

dustine32 commented 2 years ago

@hdrabkin For PANTHER, we start from UniProtKB IDs (For mouse, UP000000589_10090.fasta available here) and then use UniProt's .idmapping and .gene2acc files (in same above link) to map UniProtIDs to MGI IDs. It seems like a roundabout way to get MOD mappings (vs pulling MGI GPI) but this ensures being in sync with the Reference Proteome data used to build PANTHER trees. Some more info in these slides.

hdrabkin commented 2 years ago

@dustine32 it looks like all of the ids listed above are withdrawn markers but UniProt still has them associated with a sequence and an MGI id. Our GPI file would be what is used I think.

dustine32 commented 2 years ago

Right, @hdrabkin! I see for Gm10332 | MGI:3642276, it still maps to UniProtKB:D3Z423 via the UniProt ID mapper and the D3Z423 page even links out to an MGI gene page that doesn't exist.

If this example MGI:3642276 was withdrawn at MGI in 2016 we should check with @alexsign what file exactly is used to source MOD->UniProt mappings at UniProt. I just found this https://github.com/geneontology/go-annotation/issues/3282#issuecomment-920938722 by @alexsign that points to an MGI file MRK_SwissProt_TrEMBL.rpt. When I check this file for MGI:3642276 I get this line:

$ curl -L http://www.informatics.jax.org/downloads/reports/MRK_SwissProt_TrEMBL.rpt | grep MGI:3642276
MGI:3642276     Gm10332 W       withdrawn       N/A     14      D3Z423

Obviously, it says this MGI ID is "withdrawn" but I don't know if UniProt's ingest is parsing this and the MGI:3642276 -> D3Z423 mapping is still there.

pgaudet commented 2 years ago

This is why I was suggesting that the pipeline adds a step where the GPIs are checked. Or maybe only the PAINT pipeline needs to add that.

pgaudet commented 2 years ago

Another option of course is for UniProt to do that; we can ask Maria

ValWood commented 1 year ago

Hi @ukemi @pgaudet is this still an issue? Does it need to be escalated to UniProt? Currently has no assignee

ValWood commented 1 year ago

Hi @ukemi this is from 2022, I presume it is fixed or somebody would have comaplained/escalated, so closing. Reopen with appropriate assignee if the ticket is still required.