chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
84 stars 20 forks source link

Store organism ontology term ID in census object #796

Closed atarashansky closed 6 months ago

atarashansky commented 1 year ago

Description

It would be convenient if Census contained the organism ontology term ID in the corresponding organism's census object.

Context

I am building the WMG snapshot from census and need to maintain a mapping between census organism keys and their ontology term IDs (WMG snapshot requires the term IDs, not the labels): {'homo_sapiens': 'NCBITaxon:9606', 'mus_musculus': 'NCBITaxon:10090'}

Impact

It is inconvenient to need to maintain a separate mapping table, especially if we add support for more organisms.

pablo-gar commented 1 year ago

Let's add another mapping table to census["info"] with the following schema specification:

Census table of organisms – census_obj["census_info"]["organisms"]SOMADataFrame

Information about organisms whose cells are included in the Census MUST be included in a table modeled as a SOMADataFrame. Each row MUST correspond to an individual organism with the following columns:

Column Encoding Description
organism_ontology_term_id string As defined in the CELLxGENE dataset schema.
organism_label string Human-readable label as given by the ontology.
organism string Machine-friendly label used to name the SOMA Experiments, see below Census Data section.

An example of this SOMADataFrame is shown below:

organism_ontology_term_id organism_label organism
NCBITaxon:9606 Homo sapiens homo_sapiens
NCBITaxon:10090 Mus musculus mus_musculus