geneontology / go-releases

Tasks and notes for monthly GO releases
0 stars 0 forks source link

2024-03-28 snapshot - Tree Grafter #77

Closed pgaudet closed 5 months ago

pgaudet commented 6 months ago

Hi @kltm

This version of snapshot contains TreeGrafter data for 3 species:

However, given that these taxa are in the 'authoritative sources' metadata files, I would expect the IEAs not to be loaded in the first place.

Is this filter not working in this case?

dustine32 commented 6 months ago

'authoritative sources' metadata files

@pgaudet This refers specifically to the go-reference-species.yaml file in go-site, right?

pgaudet commented 6 months ago

Hi @dustine32 No, I was referring to the data in the metadata.yaml file for each contributing group: here's the pombe, and the xenopus file. I thought the taxa in these files were excluded from the goa-uniprot-all load.

However now I see that this data is in the pombase upstream file, and likewise for XenBase, so this is then the upstream that decided to load them.

So, this is not a GO Central issue. Sorry about the false alarm!

Thanks, Pascale

dustine32 commented 6 months ago

@pgaudet Oh, ok! So this does not have anything to do with the new GORULE:0000064?

pgaudet commented 6 months ago

So this does not have anything to do with the new GORULE:0000064?

No: GORULE:0000064 should exclude any annotations from the NCBITaxon:284812 reference species, which it does.

However the file for pombe that we produce (http://release.geneontology.org/2024-03-28/annotations/pombase.gaf.gz) does have TreeGrafter annotations, while the source files doesn't (http://release.geneontology.org/2024-03-28/products/upstream_and_raw_data/paint_pombase-src.gaf.gz). I thought that the GO parser should exclude TreeGrafter annotations, since Pombase is listed as the authority on both NCBITaxon:284812 and NCBITaxon:4896 (see http://release.geneontology.org/2024-03-28/metadata/datasets/pombase.yaml), so all IEA annotations should be coming from the the pombase source file, ie http://release.geneontology.org/2024-03-28/products/upstream_and_raw_data/paint_pombase-src.gaf.gz

I dont see any TreeGrafter annotations in this file. How did we get them?

dustine32 commented 6 months ago

so all IEA annotations should be coming from the the pombase source file, ie http://release.geneontology.org/2024-03-28/products/upstream_and_raw_data/paint_pombase-src.gaf.gz

@pgaudet Do you mean pombase-src.gaf.gz, not paint_pombase-src.gaf.gz, is the pombase source file? I checked the file coming from upstream PomBase (https://release.geneontology.org/2024-03-28/products/upstream_and_raw_data/pombase-src.gaf.gz) and did indeed find it contained those TreeGrafter IEA annotations:

$ grep GO_REF:0000118 pombase-src.gaf | wc -l
    1461

Though, once I merge https://github.com/geneontology/go-site/pull/2289 these annotations should be getting filtered out of the final GO product pombase.gaf because the taxon IDs (NCBITaxon:4896) will match go-reference-species.yaml. Note: the GORULE:0000064 code only looks at taxa listed in go-reference-species.yaml rather than the taxa in metadata group datasets YAML files.

pgaudet commented 6 months ago

Thanks @dustine32 !!! I think I was tired yesterday - this is what I had seen last week, and yesterday I was evidently looking in the wrong file...

Though, once I merge https://github.com/geneontology/go-site/pull/2289 these annotations should be getting filtered out of the final GO product pombase.gaf because the taxon IDs (NCBITaxon:4896) will match go-reference-species.yaml.

Perfect!

dustine32 commented 5 months ago

@pgaudet Great! @kltm just merged https://github.com/geneontology/go-site/pull/2289 so the pipeline should now be properly filtering out TreeGrafter IEAs for S. pombe and S. japonicus. We'll test this in the next snapshot.