chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
35 stars 22 forks source link

Feature length should be calculated as the median of isoforms instead of the merged length of isoforms #960

Open pablo-gar opened 2 weeks ago

pablo-gar commented 2 weeks ago

Motivation

Currently gene lengths are calculated using GTFtools algorithm of "merged" length of isoforms. It was brought to our attention by @jychien that these may be unstable across Genecode versions and therefore CELLxxGENE schema versions.

For general information on gene length calculations available in GTFtools please see this slide deck.

We then have performed a systematic analysis that indeed demonstrates "merged" length of isoforms is less stable than other calculations. Please the full report here.

image

Definition of Done

To change the calculation of gene length from "merged" to "median" using GTFtools implementation.

Current implementation in our code base is here:

https://github.com/chanzuckerberg/single-cell-curation/blob/0c77179d2e794846861f8109c037b723507959cb/cellxgene_schema_cli/scripts/gene_processing.py#L129

The new implementation should be taken from here:

https://github.com/RacconC/gtftools/blob/140fc21003a565a0f69b5176db734b9a04a004a4/gtftools/gtftools.py#L670-L688

Bento007 commented 1 week ago

The code changes will be easy. There are no unit tests at the moment which should be written, and this will make this work take longer. I estimate 2-3 days.