Closed kltm closed 2 years ago
The extra PMID
-only annotation is in the noctua_mgi.gpad
meaning it's coming from a Noctua model:
$ curl -L http://skyhook.berkeleybop.org/release/products/annotations/noctua_mgi-src.gpad.gz | zgrep MGI:1276125 | grep GO:0017025
MGI MGI:1276125 enables GO:0017025 PMID:9748258 ECO:0000353 UniProtKB:P2903720210527 MGI model-state=production|contributor=https://orcid.org/0000-0003-3394-9805%7Cnoctua-model-id=gomodel:60ad85f700000058
This is getting mixed with the upstream MGI GAF, which has:
$ grep MGI:1276125 products/annotations/mgi-src.gaf | grep GO:0017025
MGI MGI:1276125 Utf1 enables GO:0017025 MGI:MGI:1303208|PMID:9748258 IPI UniProtKB:P29037 F undifferentiated embryonic cell transcription factor 1 protein_coding_gene taxon:10090 20210527 MGI
Checking the upstream MGI GAF used for the last release (2022-07-01), I see this annotation wasn't there before:
$ curl -L http://current.geneontology.org/products/annotations/mgi-src.gaf.gz | zgrep MGI:1276125 | grep GO:0017025
(nothing)
So, thinking the upstream MGI GAF just recently started including Noctua annotations.
Tagging on @ukemi and @hdrabkin
So I just realized that SOME of the MGI id -less PMIDs are from IEP annotations that we append to the gaf since we currently do not load these into MGI (but we will soon). This is about 500 annotations. However, in no case will there be an annotation using IEP with BOTH an MGI ID|PMID and a separate PMID without an MGI id using the same reference and coming from UniProt/GOA . This will be fixed asap as we decided to allow IEP annotations to load into our db. So those will not be appended to our GAF
@hdrabkin From what @dustine32 is reporting from the MGI upstream GAF that we're getting (https://github.com/geneontology/pipeline/issues/294#issuecomment-1198507064), it looks like MGI is taking in Noctua annotations and reproducing them in the GAF we take, causing a doubling of annotations, which are the same except for the additional reference.
A way to look at this more closely would be to track back the annotation
MGI MGI:1276125 Utf1 enables GO:0017025 MGI:MGI:1303208|PMID:9748258 IPI UniProtKB:P29037 F undifferentiated embryonic cell transcription factor 1 protein_coding_gene taxon:10090 20210527 MGI
through MGI and to see where it originally comes from.
Actually, I think what might be happening here is that some annotations have been imported from existing MGI annotations and there is no way of telling that they were imported. However, I have only done this a few times so there shouldn't be that many. We may be looking at a needle in a haystack of other issues. Maybe it's time to resurrect this: https://github.com/orgs/geneontology/projects/91
@dustine32 and @kltm are you sure you are taking the file from MGI that has only the loads for the pipeline and not the file that has ALL the mouse annotations? The one that is the loads only is gene_association_nonnoctua.mgi.gz. This file filters out annotations from Noctua. I am checking to see if we propagated our change in annotation object types to this file.
@ukemi Okay, that's maybe the source of confusion then; I updated according to the previous ticket: https://github.com/geneontology/go-site/issues/1876#issuecomment-1191309181 Checking, there does not seem to currently be a http://www.informatics.jax.org/downloads/reports/gene_association_nonnoctua.mgi http://www.informatics.jax.org/downloads/reports/gene_association_nonnoctua.mgi.gz Is that file maybe available somewhere else?
As far as I know, this is the only place we post the file that strips out all the Noctua annotations. I hypothesize that when we modified our gaf to be consistent with the object types in the GPI file, we (GO) inadvertently switched to the gaf that does not filter out the Noctua annotations. The unfiltered file is actually the one we (MGI) QC checked. I am checking to make sure that the changes to the object types in the full gaf also propagated to the one where we filter out the noctua annotations (nonoctua). I'm not sure whether Lori filters the full gaf or makes the non-noctua file 'de novo' from the database. At least I think we have identified the problem correctly now. Previous to this current load, did you guys use the nonoctua file? If so and now you are using mgi.gaf that's the problem.
Looking through the commit history of mgi.yaml (https://github.com/geneontology/go-site/commits/master/metadata/datasets/mgi.yaml), I believe we used
source: http://www.informatics.jax.org/downloads/custom/noctua/gene_association_nonoctua.mgi
since the import until the most recent changes. That file still seems to be active, but I was under the impression from the other thread we were to be using the one in reports
, not custom
. We're good changing it to whatever it should be.
I just checked with Lori. The file you should be using is: http://www.informatics.jax.org/downloads/reports/gene_association_nonoctua.mgi.gz
The one in the custom directory was just for initial testing.
Note that I just checked the file in reports and the annotation reported by @dustine32 (https://github.com/geneontology/pipeline/issues/294#issuecomment-1198507064) is not there.
@ukemi Great! I've updated the metadata and we'll hopefully start seeing the right things out of snapshot
later this week.
@ukemi @pgaudet @dustine32 There was a bit more work to get the pipelines ticking over again, but I think we're good again looking at the snapshot
results:
bbop@wok:/home/skyhook/snapshot$ zcat annotations/mgi.gaf.gz | grep Utf1 | grep 17025
MGI MGI:1276125 Utf1 enables GO:0017025 PMID:9748258 IPI UniProtKB:P29037 F undifferentiated embryonic cell transcription factor 1 protein_coding_gene taxon:10090 20210527 MGI
Moving to clearing as I assume this will get checked again when we attempt release.
release
passed.
Looking at MGI annotations moving through the pipeline we see...
Incoming upstream:
Rules applied valid:
Re-filtered and assembled
In the last step there, there is now a duplicate annotation with just the MGI:MGI:1303208 publication removed. Weird. There haven't been a lot of software changes recently, so my first instinct is that this is somehow related to the increase in entities from MGI (https://github.com/geneontology/go-site/issues/1876)...somehow. There is also something in the back of my head about normalizing publication identifiers, but I'm not finding a ticket.
The fact that this happens post-valid means that it's happening in
sh "make -f /opt/go-site/scripts/Makefile-gaf-reprocess all"
?Tagging @dustine32
Reported by @pgaudet