chanzuckerberg / cellxgene-census

CZ CELLxGENE Discover Census
https://chanzuckerberg.github.io/cellxgene-census/
MIT License
83 stars 20 forks source link

Draft Census schema for support of Visium and Spatial data #1092

Open pablo-gar opened 6 months ago

pablo-gar commented 6 months ago

LAST EDITED: Aug, 29, 2024

See parent Epic for further information. https://github.com/chanzuckerberg/single-cell/issues/644

See current draft for spatial support in SOMA https://docs.google.com/document/d/1S48pD5XTzDcaLGlq6YVYCoUjptR93PHHHmG79TiJzsA/edit

TODOs

Schema changes


Version: 2.2.0

Last edited: April, 2024.


Data included

All datasets included in the Census MUST be of CELLxGENE dataset schema version 5.1.0. The following data constraints are imposed on top of the CELLxGENE dataset schema.

Editor's note: do this change in all other places where the CELLxGENE dataset schema version is mentioned. For simplicity all other changes are omitted here.


Assays

[...]

The Census MUST include all cells from the list of accepted assays.

These assays were selected with the following criteria:

Only children "EFO:0002772" or "EFO:0010183" are shown as this is a constraint imposed by the CELLxGENE dataset schema >3.0.0.

  • Must measure gene expression via RNA sequencing.
  • Can be done at the single-cell level.
  • May include nascent or elongating RNA data.
  • May be targeted to specific genes in an assay-specific manner.
  • May include spatial data only from Visium or Slide-seq.
  • Doesn’t measure spatial data from other assays,
  • Doesn't measure other non-RNA molecules concurrently.
  • Doesn’t require author metadata for correct interpretability (e.g. perturbation-based technologies).
  • Doesn’t intend to primarily measure RNA structure, RNA fusions, RNA modifications, or RNA interactions.
  • Doesn’t intend to primarily measure non-mRNA (e.g. tRNA, rRNA, small RNAs).
  • Doesn’t intend to primarily measure viral RNA.
  • Doesn’t intend to primarily measure introns.
  • Doesn’t do ribosome profiling.
Spatial Assays

Only observations from Visium and Slide-seq assays MUST be included in Census, as indicated in the list of accepted assays. Per the CELLxGENE dataset schema, datasets with spatial observations can be identified with the presence of the slot uns["spatial"]. For these assays, only observations from datasets that contain "one Space Ranger output for a single tissue section" MUST be included in Census.

The full logic above can be asserted as follows:


Census metadata – census_obj​​["census_info"]["summary"]SOMADataFrame

[...]

  1. Total number of cells or spatial spots included in this Census build:
    1. label: "total_cell_count"
    2. value: Cell count
  2. Unique number of cells or spatial spots included in this Census build (is_primary_data == True)
    1. label: "unique_cell_count"
    2. value: Cell count

Data encoding and organization

[...]

Census Non-Spatial Data – census_obj["census_data"][organism]SOMAExperiment

Non-spatial data for Homo sapiens MUST be stored as a SOMAExperiment in census_obj["census_data"]["homo_sapiens"].

Non-spatial data for Mus musculus MUST be stored as a SOMAExperiment in census_obj["census_data"]["mus_musculus"].


Feature dataset presence matrix – census_obj["census_data"][organism].ms["RNA"]["feature_dataset_presence_matrix"]SOMASparseNDArray

[...]

Census Spatial Sequencing Data – census_obj["census_spatial_sequencing"][organism]SOMAExperiment

Only Visium and Slide-seq are supported for spatial data. See the "assays included" section above.

Spatial data for Homo sapiens MUST be stored as a SOMAExperiment in census_obj["census_spatial_sequencing"]["homo_sapiens"].

Spatial data for Mus musculus MUST be stored as a SOMAExperiment in census_obj["census_spatial_sequencing"]["mus_musculus"].

For each organism the SOMAExperiment MUST contain the following:

Matrix Data, count (raw) matrix – census_obj["census_spatial_sequencing"][organism].ms["RNA"].X["raw"]SOMASparseNDArray

Same as non-spatial data. See the corresponding section here.

Feature metadata – census_obj["census_spatial_sequencing"][organism].ms["RNA"].varSOMADataFrame

Same as non-spatial data. See the corresponding section here.

Feature dataset presence matrix – census_obj["census_spatial_sequencing"][organism].ms["RNA"]["feature_dataset_presence_matrix"]SOMASparseNDArray

Same as non-spatial data. See the corresponding section here.

Cell metadata – census_obj["census_spatial_sequencing"][organism].obsSOMADataFrame

Same as non-spatial data. See the corresponding section here.

Important note: In addition, the following spatial obs columns from the CELLxGENE dataset schema MUST be included in this SOMADataFrame

Column Encoding Description
array_col As defined in CELLxGENE dataset schema
array_row
in_tissue

Obs to spatial mapping – census_obj["census_spatial_sequencing"][organism].obs_sceneSOMADataFrame

It indicates the link between an observation and a scene. Each row corresponds to an observation with the following columns:

Column Encoding Description
obs_id int It MUST be valid soma_joinid from census_obj["census_spatial_sequencing"][organism].obs.
scene_id string It MUST be valid scene_id from census_obj["census_spatial_sequencing"][organism].spatial.
value bool It MUST be True if the scene contains spatial information about the oberservation, otherwise it MUST be False.

Positions array of a Scene – census_obj["census_spatial_sequencing"][organism].spatial[scene_id].obsl["loc"]SOMAGeometryNDArray

scene_soma_joinid MUST correspond to the values soma_joinid in census_obj["census_spatial_sequencing"][organism].spatial.scenes.

For each observation in each Scene, spatial array positions, the geometry points associated to them, and additional positional metadata MUST be encoded as a SOMAGeometryNDArray. Each row corresponds to an observation with the following columns:

If Visium ("EFO:0010961") the units for the spatial array pisitions are pixels from the high-resolution image (spatial[scene_soma_joinid].img["highres_image"]). Otherwise TBD.

Column Encoding Description
X float It MUST be the corresponding value in the first column of obsm["spatial"]. As defined in the CELLxGENE dataset schema.
Y float It MUST be the corresponding value in the second column of obsm["spatial"]. As defined in the CELLxGENE dataset schema.
soma_geometry float Radius of points: dimeter/2. If Visium ("EFO:0010961") diameter MUST be uns.["spatial"][library_id]['spot_diameter_fullres']. As defined in the CELLxGENE dataset schema. Otherwise TBD-TODO (else for Slide-seq it should be 0.003% of the radius occupied by the full cloud of points).

Images of a Scene - census_obj["census_spatial_sequencing"][organism].spatial[scene_soma_joinid].img[library_id] –SOMASMultiscaleImage`.

Images of a Visium ("EFO:0010961") scene MUST adhere to the following specifications. Other assays MUST NOT have images, and MUST NOT include the img collection.

library_id MUST be the corresponding value in the source H5AD slot uns.["spatial"][library_id], as defined in the CELLxGENE dataset schema.

Full resolution image of a Scene – census_obj["census_spatial_sequencing"][organism].spatial[scene_soma_joinid].img[library_id]["fullres_image"]SOMAImageNDArray.

The full resolution image of a Visium ("EFO:0010961") scene MAY be included and MUST be encoded as a SOMAImageNDArray.

Value: the image from uns["spatial"][library_id]['images']['fullres'] as defined in the CELLxGENE dataset schema.

High resolution image of a Scene – census_obj["census_spatial_sequencing"][organism].spatial[scene_soma_joinid].img[library_id]["highres_image"]SOMAImageNDArray.

The full resolution image of a Visium ("EFO:0010961") scene MUST be included and MUST be encoded as a SOMAImageNDArray.

Value: the image from uns["spatial"][library_id]['images']['hires'] as defined in the CELLxGENE dataset schema.

pablo-gar commented 6 months ago

First iteration, very likely to change

https://drive.google.com/file/d/1_A8YlZsVZrDrt_hhjHIYQ_jVw0M5b_eP/view?usp=sharing

pablo-gar commented 5 months ago

Second iteration

census_schema_spatial_v2.pdf

pablo-gar commented 5 months ago

Third iteration (changes reflected in text as of today).

census_schema_spatial_v3.pdf

pablo-gar commented 5 months ago

census_schema_spatial_v4.pdf

prathapsridharan commented 4 months ago

@pablo-gar - Some questions/comments here about the differences in the diagram of census_scheme_spatial_v4.pdf and the descriptions of the data fields and types in the text above:

_Does soma_joinid in scenes dataframe correspond to soma_joinid in experiment.obs dataframe? That is, the two references to soma_joinid are actually talking about a particular observation? If so, should scenes dataframe just contain an scene_id instead of soma_joinid? I say this because experiment.obs already has a scene_id that ties each observation to a Scene and the scenes dataframe seems to be about metadata about each scene and therefore I don't see why it should contain anything about a particular observation like obs_joinid. From what I understand, a scene corresponds to multiple observations so scene metadata dataframe probably should not contain anything about obs_joinid other than perhaps num_observations or something like that?_

_soma_dim_0 and soma_dim_1 are defined as categoricals that contain the name of the "X" and "Y" spatial coordinate names. If that is the case then soma_dim_0 and soma_dim_1 are weird names. Maybe something like spatial_X_coord_name and spatial_Y_coord_name or something like that?_

Spatial Scenes with spatial data – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid] – SOMAScene. There will be as many as Spatial Scenes as spatial datasets

_Should .spatial[scene_soma_joinid] be replaced with .spatial[scene_id] where scene_id is specified in the experiment.obs dataframe (and possibly in scenes dataframe)? Also the text above describing the columns of scenes dataframe doesn't quite match with the columns listed in the v4 diagram and one or the other needs updating_

A data frame with raw spatial coordinates – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid]["obs_locations"] – SOMADataFrame

_There is no obs_locations in the v4 diagram anymore. Should this be removed from the text description above?_

MUST contain a positions array – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid].obsl["positions"] – SOMASparseSpatialArray. This will contain the spatial array positions for each observation, the geometry points associated to them, and additional metadata.

_According the v4 diagram, this is a SOMAGeometryNDArray. Should the text should be modified? Also even in the v4 boxed diagram, positions is also specified as a SparseSpatialArray which is confusing. I also think the text description about the fields of positions should be updated since it doesn't match with v4 diagram description. For instance the text contains a column called in_tissue that is not in the v4 diagram_

Full resolution image of a Scene and High resolution image of a Scene are specified as SOMAImageNDArray in the text above but the v4 diagram calls them as DenseSpatialArray. This needs updating

brianraymor commented 4 months ago

@pablo-gar - I noticed a reference to fiducial_diameter_fullres in the PDF above. This is unsupported by the dataset schema, per earlier conversations. Please see #cell-sci-modalities.

pablo-gar commented 4 months ago

@prathapsridharan answering your questions

Does soma_joinid in scenes dataframe correspond to soma_joinid in experiment.obs dataframe? That is, the two references to soma_joinid are actually talking about a particular observation? If so, should scenes dataframe just contain an scene_id instead of soma_joinid? I say this because experiment.obs already has a scene_id that ties each observation to a Scene and the scenes dataframe seems to be about metadata about each scene and therefore I don't see why it should contain anything about a particular observation like obs_joinid. From what I understand, a scene corresponds to multiple observations so scene metadata dataframe probably should not contain anything about obs_joinid other than perhaps num_observations or something like that?

No, soma_joinid in scenes datafame DOES NOT correspond to soma_joinid in experiment.obs dataframe. If that was understood from the schema text I should improve it.

soma_dim_0 and soma_dim_1 are defined as categoricals that contain the name of the "X" and "Y" spatial coordinate names. If that is the case then soma_dim_0 and soma_dim_1 are weird names. Maybe something like spatial_X_coord_name and spatial_Y_coord_name or something like that

I'll bring this proposal to Julia and Aaron. I don't have an strong opinion on it.


Spatial Scenes with spatial data – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid] – SOMAScene. There will be as many as Spatial Scenes as spatial datasets

_Should .spatial[scene_soma_joinid] be replaced with .spatial[scene_id] where scene_id is specified in the experiment.obs dataframe (and possibly in scenes dataframe)? Also the text above describing the columns of scenes dataframe doesn't quite match with the columns listed in the v4 diagram and one or the other needs updating_

I'm proposing to unify everything via the soma_joinid of the .spatial["scenes"] DataFrame, this effectively acts as a scene ID, so adding yet another scene_id field seems redundant to me. Do you I'm missing something here?


A data frame with raw spatial coordinates – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid]["obs_locations"] – SOMADataFrame

_There is no obs_locations in the v4 diagram anymore. Should this be removed from the text description above?_

Yes, thanks for the catch! I will remove it


MUST contain a positions array – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid].obsl["positions"] – SOMASparseSpatialArray. This will contain the spatial array positions for each observation, the geometry points associated to them, and additional metadata.

_According the v4 diagram, this is a SOMAGeometryNDArray. Should the text should be modified? Also even in the v4 boxed diagram, positions is also specified as a SparseSpatialArray which is confusing. I also think the text description about the fields of positions should be updated since it doesn't match with v4 diagram description. For instance the text contains a column called in_tissue that is not in the v4 diagram_

Full resolution image of a Scene and High resolution image of a Scene are specified as SOMAImageNDArray in the text above but the v4 diagram calls them as DenseSpatialArray. This needs updating

Yes, thanks for catching all of these!

pablo-gar commented 4 months ago

@brianraymor Thanks for the catch I've fixed it.

pablo-gar commented 4 months ago

Fourth iteration with fixes from the comments above. Text has also been updated in the top-level comment.

census_schema_spatial_v5.pdf

pablo-gar commented 2 months ago

Sixth iteration:

census_schema_spatial_v6.pdf

pablo-gar commented 1 month ago

Seventh iteration:

census_schema_spatial_v7.pdf

pablo-gar commented 1 month ago

Eighth iteration:

census_schema_spatial_v8.pdf