chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
84 stars 22 forks source link

census schema should document the logic for the tissue_general mapping #1039

Open brianraymor opened 8 months ago

brianraymor commented 8 months ago

On cell-science-platform, @bkmartinjr wrote:

The issue is that any given Census requires the following to be built:

- a CxG schema (which implies ontology versions) _- a specific mapping for tissuegeneral mapping

Currently, there is no versioning on the latter outside of the Census builder. We would (ideally) like to have a fully pinned specification for any given "schema" version, that includes both the ontologies, and how those derived IDs are generated (the mapping).

tissue_general is metadata that is specific to Census and its applications such as Gene Expression. Why not simply define the specific mapping in the version of the cell census schema which is updated when the dataset schema is updated?

Currently, the Census schema points to source code as documentation which is not the best of practices from my perspective. (also @bkmartinjr @pablo-gar - shouldn't the reference now be cell-census code per _WMG used to also infer tissuegeneral, but now reads it from the Census.?

Column Encoding Description
tissue_general_ontology_term_id string High-level tissue UBERON ID as implemented here
tissue_general string High-level tissue label as implemented here

For example, list the set of UBERON terms that are appropriate for census schema N.N and describe the logic for the mapping per:

 # List of high level tissues, ORDER MATTERS. If for a given tissue there are multiple high-level tissues associated
 # then `self.get_high_level_tissue()` returns the one that appears first in this list

Changes to the list will be reflected in new census schema versions.

Bento007 commented 8 months ago

under this proposal, a request for a specific cellxgene-schema version within the cellxgene-ontology-guide API will result in the same tissue_general data that is published in the schema.

At what point should tissue_general or any of the other hand curated lists appear in the cellxgene-ontology-guide API? Should they be accessible in the API outside of a pinned cellxgene-schema version?