Currently gene lengths are calculated using GTFtools algorithm of "merged" length of isoforms. It was brought to our attention by @jychien that these may be unstable across Genecode versions and therefore CELLxxGENE schema versions.
For general information on gene length calculations available in GTFtools please see this slide deck.
We then have performed a systematic analysis that indeed demonstrates "merged" length of isoforms is less stable than other calculations. Please the full report here.
Definition of Done
To change the calculation of gene length from "merged" to "median" using GTFtools implementation.
The code changes will be easy. There are no unit tests at the moment which should be written, and this will make this work take longer. I estimate 2-3 days.
Motivation
Currently gene lengths are calculated using GTFtools algorithm of "merged" length of isoforms. It was brought to our attention by @jychien that these may be unstable across Genecode versions and therefore CELLxxGENE schema versions.
For general information on gene length calculations available in GTFtools please see this slide deck.
We then have performed a systematic analysis that indeed demonstrates "merged" length of isoforms is less stable than other calculations. Please the full report here.
Definition of Done
To change the calculation of gene length from "merged" to "median" using GTFtools implementation.
Current implementation in our code base is here:
https://github.com/chanzuckerberg/single-cell-curation/blob/0c77179d2e794846861f8109c037b723507959cb/cellxgene_schema_cli/scripts/gene_processing.py#L129
The new implementation should be taken from here:
https://github.com/RacconC/gtftools/blob/140fc21003a565a0f69b5176db734b9a04a004a4/gtftools/gtftools.py#L670-L688