chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
38 stars 23 forks source link

Update pinned GENCODE references #386

Closed brianraymor closed 9 months ago

brianraymor commented 1 year ago

Design

Source Required version Download
GENCODE (Human) Human reference GRCh38.p14 (GENCODE v44/Ensembl 110) gencode.v44.primary_assembly.annotation.gtf
GENCODE (Mouse) Mouse reference GRCm39 (GENCODE vM33/Ensembl 110) gencode.vM33.primary_assembly.annotation.gtf

Lattice commentary

@jahilton wrote:

Goal: Identify the factors that will inform the timing of a gene annotation version bump in the CELLxGENE schema and the procedures to do so.

Context/Problem: Currently, the CELLxGENE schema only accepts feature identifiers that are present in GENCODE v38 (human, release 5/2021) or vM27 (mouse, released 5/2021), in addition to SARS-CoV-2 genes and spike-ins. See Reference.

In order to submit data processed using reference versions more recent than v38/vM27 to CELLxGENE, it would require removing data from features which are present in the most recent annotation but weren’t in 5/2021. This means losing data on what are thought to be valid genomic features based on the current best knowledge of the genome.

To our knowledge, no data submitted to CELLxGENE to-date have used reference versions more recent than v38/vM27, but when it happens, it would be preferable to have the reference versions bumped or be able to do so quickly, rather than lose meaningful data.

A gene annotation version bump would require each Dataset in the corpus to have features removed from it - those that were in v38/vM27 annotations but no longer present in the versions that the schema updates to. Currently, this sort of migration entails the download, update, resubmission, and re-Publish of each Dataset.

Additional Information: CellRanger is a commonly-used alignment tool for data submitted to CELLxGENE, and it provides pre-built references that are often used. Since 7/2020, the default pre-built references provided are v32/vM23. See Reference.

As of November 4, 2022, the most recent GENCODE versions are v42 (human) and vM31 (mouse), each released 10/2022.

Per Angela Pisco, Tabula Sapiens has moved to GENCODE v41 & is planning a spring 2023 data release.


Alternatives for Automation

What is the best way to programmatically convert Ensembl ids from one release to other?

Note: This is not the preferred method for Lattice. See Jason's comment below.

jahilton commented 1 year ago

Is there a reason we wouldn't treat this like ontologies and get the most recent at the time of the schema bump? (Human currently at v42, mouse at M31)

What is the best way to programmatically convert Ensembl ids from one release to other?

Our curation practices have not been to "convert Ensembl ids" so I feel this would deviate from practices up til now. Instead, we just filter out features that aren't represented by an Ensembl id that is present in the pinned annotation.