Closed kltm closed 1 year ago
Good question, @kltm !
The files in question, using the current archive as an example, are:
The upstream files for both of these is currently: https://ftp.ebi.ac.uk/pub/databases/GO/goa/UNIPROT/goa_uniprot_all.gaf.gz. We'll refer to this as the "upstream" file.
For this, the GO takes the upstream file and removes all of the "canonical" species from it, essentially filtering out all species that have annotations that are provided by another resource included in the GO annotation dataset. For example, we ingest and process SGD's data (https://sgd-prod-upload.s3.amazonaws.com/latest/gene_association.sgd.gaf.gz) for NCBITaxon:285006, NCBITaxon:307796, NCBITaxon:41870, NCBITaxon:4932, and NCBITaxon:559292, so we filter out these five taxa from the upstream file for our own version of this file. This is repeated for every resource we ingest, leaving us with a file that is the remainder of what is not otherwise spoken for in the GO annotation set.
This is the same as the goa_uniprot_all.gaf.gz file, except with electronically inferred (IEA) annotations filtered out. This is the file that is included in the AmiGO load and available at http://amigo.geneontology.org.
We are currently working to make this more clear to our end users with:
https://github.com/geneontology/go-site/issues/1971 https://github.com/geneontology/go-site/issues/1753 https://github.com/geneontology/pipeline/issues/178
@suzialeksander Should this go in the FAQ?
Note: on 13 April 2023, the files mentioned in this ticket were renamed:
@suzialeksander I think this is being addressed in the above issues. Can this be closed?
Ah, I was keeping it open until https://github.com/geneontology/geneontology.github.io/pull/447, but yes otherwise this has been thoroughly addressed
actually. closing makes sense as these are no longer the file names.
There is some confusion about the contents of the goa_uniprot_all.gaf and goa_uniprot_all_noiea.gaf files available in /annotations in the archives. Can this be clarified?