geneontology / pipeline

Declarative pipeline for the Gene Ontology.
https://build.geneontology.org/job/geneontology/job/pipeline/
BSD 3-Clause "New" or "Revised" License
5 stars 5 forks source link

Add stats for number of genes per MOD for which IBAs get dropped on the floor #300

Open cmungall opened 2 years ago

cmungall commented 2 years ago

Discussed on managers call https://github.com/geneontology/pipeline/issues/300

A statistic that is very useful for GO is the number of genes that are not mapped to the reference proteome for which we are losing IBA annotations

An ad-hoc way to get this:

✗ curl -L -s ftp://ftp.pantherdb.org/downloads/paint/presubmission/gene_association.paint_zfin.gaf.gz | gzip -dc | grep -v ^ZFIN | cut -f2 | sort -u | wc
    3005    3018   28923
✗ curl -L -s ftp://ftp.pantherdb.org/downloads/paint/presubmission/gene_association.paint_mgi.gaf.gz | gzip -dc | grep -v ^MGI | cut -f2 | sort -u | wc
     200     213    1680

[overcounts because I am lazily not filtering comments etc but you get the point]

This is based on the assumption that the paint_MOD files default to uniprot for where there is mapping

This could be done more systematically as part of the pipeline, with stats files generated

I also think it might be nice to consider this as a go rule ("all genes with ancestral annotations should have unambiguous mappings to uniprot") such that these numbers could be shown in the general report dashboard, but this requires further discussion

dustine32 commented 2 years ago

@cmungall I think this report has something close: https://docs.google.com/spreadsheets/d/1hMGJ8MFu1ozO3pHt44G2PPN9taWRPqEvb-1MK6WqVOI/edit#gid=1446471698

The section separated by taxon ID counts annotations rather than distinct genes but I added this section specifically to help spot large drops in IBAs from release to release.

If the annotation count isn't granular enough (need the distinct gene count) can I just add this new stat to this same report?

cmungall commented 2 years ago

I think the point is that we want something that is executed as part of the pipeline, with output in a standard place

dustine32 commented 2 years ago

Ah, OK. There might be a good place in the go-stats code to add this. Like in the go-annotation-changes.tsv report.