geneontology / helpdesk

The Gene Ontology Helpdesk
http://help.geneontology.org
16 stars 6 forks source link

Confused about the contents of the goa_uniprot_all.gaf and goa_uniprot_all_noiea.gaf files available in /annotations in the downloads #432

Closed kltm closed 1 year ago

kltm commented 1 year ago

There is some confusion about the contents of the goa_uniprot_all.gaf and goa_uniprot_all_noiea.gaf files available in /annotations in the archives. Can this be clarified?

kltm commented 1 year ago

Good question, @kltm !

The files in question, using the current archive as an example, are:

The upstream files for both of these is currently: https://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz. We'll refer to this as the "upstream" file.

goa_uniprot_all.gaf.gz

For this, the GO takes the upstream file and removes all of the "canonical" species from it, essentially filtering out all species that have annotations that are provided by another resource included in the GO annotation dataset. For example, we ingest and process SGD's data (https://sgd-prod-upload.s3.amazonaws.com/latest/gene_association.sgd.gaf.gz) for NCBITaxon:285006, NCBITaxon:307796, NCBITaxon:41870, NCBITaxon:4932, and NCBITaxon:559292, so we filter out these five taxa from the upstream file for our own version of this file. This is repeated for every resource we ingest, leaving us with a file that is the remainder of what is not otherwise spoken for in the GO annotation set.

goa_uniprot_all_noiea.gaf.gz

This is the same as the goa_uniprot_all.gaf.gz file, except with electronically inferred (IEA) annotations filtered out. This is the file that is included in the AmiGO load and available at http://amigo.geneontology.org.

We are currently working to make this more clear to our end users with:

https://github.com/geneontology/go-site/issues/1971 https://github.com/geneontology/go-site/issues/1753 https://github.com/geneontology/pipeline/issues/178

pgaudet commented 1 year ago

@suzialeksander Should this go in the FAQ?

suzialeksander commented 1 year ago

Note: on 13 April 2023, the files mentioned in this ticket were renamed:

kltm commented 1 year ago

Noting: https://github.com/geneontology/go-site/issues/1984

kltm commented 1 year ago

@suzialeksander I think this is being addressed in the above issues. Can this be closed?

suzialeksander commented 1 year ago

Ah, I was keeping it open until https://github.com/geneontology/geneontology.github.io/pull/447, but yes otherwise this has been thoroughly addressed

suzialeksander commented 1 year ago

actually. closing makes sense as these are no longer the file names.