Closed pgaudet closed 5 months ago
'authoritative sources' metadata files
@pgaudet This refers specifically to the go-reference-species.yaml file in go-site
, right?
NCBITaxon:402676
) or Xenopus laevis (NCBITaxon:8355
) in go-reference-species.yaml
. S. japonicus will be added when I update the go-reference-species.yaml
to 17.0 but X. laevis won't be added until PANTHER 19.0 (soon to come a month or two). So, I believe it is correct for the X. laevis TreeGrafter IEAs to be present (not filtered out) in this release.NCBITaxon:4896
) vs. strain (NCBITaxon:284812
) taxon issue. and the new GORULE:0000064 matches on taxon_id
(currently NCBITaxon:284812
in go-reference-species.yaml). The PAINT IBAs for S. pombe use the strain taxon for now (changing this to the species taxon soon - ticket here) so I would assume that all TreeGrafter IEAs using strain taxon 284812
should currently be getting filtered out of the release. But it sounds like TreeGrafter IEAs to both taxa 4896
and 284812
are observed in the release, correct @pgaudet? If yes, we'll need to debug why @mugitty.Hi @dustine32 No, I was referring to the data in the metadata.yaml file for each contributing group: here's the pombe, and the xenopus file. I thought the taxa in these files were excluded from the goa-uniprot-all load.
However now I see that this data is in the pombase upstream file, and likewise for XenBase, so this is then the upstream that decided to load them.
So, this is not a GO Central issue. Sorry about the false alarm!
Thanks, Pascale
@pgaudet Oh, ok! So this does not have anything to do with the new GORULE:0000064?
So this does not have anything to do with the new GORULE:0000064?
No: GORULE:0000064 should exclude any annotations from the NCBITaxon:284812 reference species, which it does.
However the file for pombe that we produce (http://release.geneontology.org/2024-03-28/annotations/pombase.gaf.gz) does have TreeGrafter annotations, while the source files doesn't (http://release.geneontology.org/2024-03-28/products/upstream_and_raw_data/paint_pombase-src.gaf.gz). I thought that the GO parser should exclude TreeGrafter annotations, since Pombase is listed as the authority on both NCBITaxon:284812 and NCBITaxon:4896 (see http://release.geneontology.org/2024-03-28/metadata/datasets/pombase.yaml), so all IEA annotations should be coming from the the pombase source file, ie http://release.geneontology.org/2024-03-28/products/upstream_and_raw_data/paint_pombase-src.gaf.gz
I dont see any TreeGrafter annotations in this file. How did we get them?
so all IEA annotations should be coming from the the pombase source file, ie http://release.geneontology.org/2024-03-28/products/upstream_and_raw_data/paint_pombase-src.gaf.gz
@pgaudet Do you mean pombase-src.gaf.gz
, not paint_pombase-src.gaf.gz
, is the pombase source file? I checked the file coming from upstream PomBase (https://release.geneontology.org/2024-03-28/products/upstream_and_raw_data/pombase-src.gaf.gz) and did indeed find it contained those TreeGrafter IEA annotations:
$ grep GO_REF:0000118 pombase-src.gaf | wc -l
1461
Though, once I merge https://github.com/geneontology/go-site/pull/2289 these annotations should be getting filtered out of the final GO product pombase.gaf
because the taxon IDs (NCBITaxon:4896
) will match go-reference-species.yaml. Note: the GORULE:0000064
code only looks at taxa listed in go-reference-species.yaml rather than the taxa in metadata group datasets YAML files.
Thanks @dustine32 !!! I think I was tired yesterday - this is what I had seen last week, and yesterday I was evidently looking in the wrong file...
Though, once I merge https://github.com/geneontology/go-site/pull/2289 these annotations should be getting filtered out of the final GO product pombase.gaf because the taxon IDs (NCBITaxon:4896) will match go-reference-species.yaml.
Perfect!
@pgaudet Great! @kltm just merged https://github.com/geneontology/go-site/pull/2289 so the pipeline should now be properly filtering out TreeGrafter IEAs for S. pombe and S. japonicus. We'll test this in the next snapshot
.
Hi @kltm
This version of snapshot contains TreeGrafter data for 3 species:
However, given that these taxa are in the 'authoritative sources' metadata files, I would expect the IEAs not to be loaded in the first place.
Is this filter not working in this case?