geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

An unexpected increase in MGI annotations (duplicates showing up) #294

Closed kltm closed 2 years ago

kltm commented 2 years ago

Looking at MGI annotations moving through the pipeline we see...

Incoming upstream:

skyhook/release$ zcat products/annotations/mgi-src.gaf.gz | grep Utf1 | grep 17025
MGI MGI:1276125 Utf1    enables GO:0017025  MGI:MGI:1303208|PMID:9748258    IPI UniProtKB:P29037    F   undifferentiated embryonic cell transcription factor 1      protein_coding_gene taxon:10090 20210527MGI     

Rules applied valid:

bbop@wok:/home/skyhook/release$ zcat products/annotations/mgi_valid.gaf.gz | grep Utf1 | grep 17025
MGI MGI:1276125 Utf1    enables GO:0017025  MGI:MGI:1303208|PMID:9748258    IPI UniProtKB:P29037    F   undifferentiated embryonic cell transcription factor 1      protein_coding_gene taxon:10090 20210527MGI     

Re-filtered and assembled

bbop@wok:/home/skyhook/release$ zcat annotations/mgi.gaf.gz | grep Utf1 | grep 17025
MGI MGI:1276125 Utf1    enables GO:0017025  MGI:MGI:1303208|PMID:9748258    IPI UniProtKB:P29037    F   undifferentiated embryonic cell transcription factor 1      protein_coding_gene taxon:10090 20210527MGI     
MGI MGI:1276125 Utf1    enables GO:0017025  PMID:9748258    IPI UniProtKB:P29037    F   undifferentiated embryonic cell transcription factor 1  protein_coding_gene taxon:10090 20210527    MGI     

In the last step there, there is now a duplicate annotation with just the MGI:MGI:1303208 publication removed. Weird. There haven't been a lot of software changes recently, so my first instinct is that this is somehow related to the increase in entities from MGI (https://github.com/geneontology/go-site/issues/1876)...somehow. There is also something in the back of my head about normalizing publication identifiers, but I'm not finding a ticket.

The fact that this happens post-valid means that it's happening in sh "make -f /opt/go-site/scripts/Makefile-gaf-reprocess all"?

Tagging @dustine32

Reported by @pgaudet

dustine32 commented 2 years ago

The extra PMID-only annotation is in the noctua_mgi.gpad meaning it's coming from a Noctua model:

$ curl -L http://skyhook.berkeleybop.org/release/products/annotations/noctua_mgi-src.gpad.gz | zgrep MGI:1276125 | grep GO:0017025
MGI MGI:1276125 enables GO:0017025  PMID:9748258    ECO:0000353 UniProtKB:P2903720210527    MGI     model-state=production|contributor=https://orcid.org/0000-0003-3394-9805%7Cnoctua-model-id=gomodel:60ad85f700000058

This is getting mixed with the upstream MGI GAF, which has:

$ grep MGI:1276125 products/annotations/mgi-src.gaf | grep GO:0017025
MGI MGI:1276125 Utf1    enables GO:0017025  MGI:MGI:1303208|PMID:9748258    IPI UniProtKB:P29037    F   undifferentiated embryonic cell transcription factor 1      protein_coding_gene taxon:10090 20210527    MGI

Checking the upstream MGI GAF used for the last release (2022-07-01), I see this annotation wasn't there before:

$ curl -L http://current.geneontology.org/products/annotations/mgi-src.gaf.gz | zgrep MGI:1276125 | grep GO:0017025
(nothing)

So, thinking the upstream MGI GAF just recently started including Noctua annotations.

kltm commented 2 years ago

Tagging on @ukemi and @hdrabkin

hdrabkin commented 2 years ago

So I just realized that SOME of the MGI id -less PMIDs are from IEP annotations that we append to the gaf since we currently do not load these into MGI (but we will soon). This is about 500 annotations. However, in no case will there be an annotation using IEP with BOTH an MGI ID|PMID and a separate PMID without an MGI id using the same reference and coming from UniProt/GOA . This will be fixed asap as we decided to allow IEP annotations to load into our db. So those will not be appended to our GAF

kltm commented 2 years ago

@hdrabkin From what @dustine32 is reporting from the MGI upstream GAF that we're getting (https://github.com/geneontology/pipeline/issues/294#issuecomment-1198507064), it looks like MGI is taking in Noctua annotations and reproducing them in the GAF we take, causing a doubling of annotations, which are the same except for the additional reference. A way to look at this more closely would be to track back the annotation MGI MGI:1276125 Utf1 enables GO:0017025 MGI:MGI:1303208|PMID:9748258 IPI UniProtKB:P29037 F undifferentiated embryonic cell transcription factor 1 protein_coding_gene taxon:10090 20210527 MGI through MGI and to see where it originally comes from.

ukemi commented 2 years ago

Actually, I think what might be happening here is that some annotations have been imported from existing MGI annotations and there is no way of telling that they were imported. However, I have only done this a few times so there shouldn't be that many. We may be looking at a needle in a haystack of other issues. Maybe it's time to resurrect this: https://github.com/orgs/geneontology/projects/91

ukemi commented 2 years ago

@dustine32 and @kltm are you sure you are taking the file from MGI that has only the loads for the pipeline and not the file that has ALL the mouse annotations? The one that is the loads only is gene_association_nonnoctua.mgi.gz. This file filters out annotations from Noctua. I am checking to see if we propagated our change in annotation object types to this file.

kltm commented 2 years ago

@ukemi Okay, that's maybe the source of confusion then; I updated according to the previous ticket: https://github.com/geneontology/go-site/issues/1876#issuecomment-1191309181 Checking, there does not seem to currently be a http://www.informatics.jax.org/downloads/reports/gene_association_nonnoctua.mgi http://www.informatics.jax.org/downloads/reports/gene_association_nonnoctua.mgi.gz Is that file maybe available somewhere else?

ukemi commented 2 years ago

As far as I know, this is the only place we post the file that strips out all the Noctua annotations. I hypothesize that when we modified our gaf to be consistent with the object types in the GPI file, we (GO) inadvertently switched to the gaf that does not filter out the Noctua annotations. The unfiltered file is actually the one we (MGI) QC checked. I am checking to make sure that the changes to the object types in the full gaf also propagated to the one where we filter out the noctua annotations (nonoctua). I'm not sure whether Lori filters the full gaf or makes the non-noctua file 'de novo' from the database. At least I think we have identified the problem correctly now. Previous to this current load, did you guys use the nonoctua file? If so and now you are using mgi.gaf that's the problem.

kltm commented 2 years ago

Looking through the commit history of mgi.yaml (https://github.com/geneontology/go-site/commits/master/metadata/datasets/mgi.yaml), I believe we used

    source: http://www.informatics.jax.org/downloads/custom/noctua/gene_association_nonoctua.mgi

since the import until the most recent changes. That file still seems to be active, but I was under the impression from the other thread we were to be using the one in reports, not custom. We're good changing it to whatever it should be.

ukemi commented 2 years ago

I just checked with Lori. The file you should be using is: http://www.informatics.jax.org/downloads/reports/gene_association_nonoctua.mgi.gz

The one in the custom directory was just for initial testing.

Note that I just checked the file in reports and the annotation reported by @dustine32 (https://github.com/geneontology/pipeline/issues/294#issuecomment-1198507064) is not there.

kltm commented 2 years ago

@ukemi Great! I've updated the metadata and we'll hopefully start seeing the right things out of snapshot later this week.

kltm commented 2 years ago

@ukemi @pgaudet @dustine32 There was a bit more work to get the pipelines ticking over again, but I think we're good again looking at the snapshot results:

bbop@wok:/home/skyhook/snapshot$ zcat annotations/mgi.gaf.gz | grep Utf1 | grep 17025
MGI MGI:1276125 Utf1    enables GO:0017025  PMID:9748258    IPI UniProtKB:P29037    F   undifferentiated embryonic cell transcription factor 1  protein_coding_gene taxon:10090 20210527    MGI     
kltm commented 2 years ago

Moving to clearing as I assume this will get checked again when we attempt release.

kltm commented 2 years ago

release passed.