Open pablo-gar opened 6 months ago
First iteration, very likely to change
https://drive.google.com/file/d/1_A8YlZsVZrDrt_hhjHIYQ_jVw0M5b_eP/view?usp=sharing
Second iteration
Third iteration (changes reflected in text as of today).
@pablo-gar - Some questions/comments here about the differences in the diagram of census_scheme_spatial_v4.pdf
and the descriptions of the data fields and types in the text above:
_Does soma_joinid
in scenes
dataframe correspond to soma_joinid
in experiment.obs
dataframe? That is, the two references to soma_joinid
are actually talking about a particular observation? If so, should scenes
dataframe just contain an scene_id
instead of soma_joinid
? I say this because experiment.obs
already has a scene_id
that ties each observation to a Scene and the scenes
dataframe seems to be about metadata about each scene and therefore I don't see why it should contain anything about a particular observation like obs_joinid
. From what I understand, a scene corresponds to multiple observations so scene metadata dataframe probably should not contain anything about obs_joinid
other than perhaps num_observations
or something like that?_
_soma_dim_0
and soma_dim_1
are defined as categoricals that contain the name of the "X" and "Y" spatial coordinate names. If that is the case then soma_dim_0
and soma_dim_1
are weird names. Maybe something like spatial_X_coord_name
and spatial_Y_coord_name
or something like that?_
Spatial Scenes with spatial data – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid] – SOMAScene. There will be as many as Spatial Scenes as spatial datasets
_Should .spatial[scene_soma_joinid]
be replaced with .spatial[scene_id]
where scene_id
is specified in the experiment.obs
dataframe (and possibly in scenes
dataframe)? Also the text above describing the columns of scenes
dataframe doesn't quite match with the columns listed in the v4
diagram and one or the other needs updating_
A data frame with raw spatial coordinates – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid]["obs_locations"] – SOMADataFrame
_There is no obs_locations
in the v4
diagram anymore. Should this be removed from the text description above?_
MUST contain a positions array – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid].obsl["positions"] – SOMASparseSpatialArray. This will contain the spatial array positions for each observation, the geometry points associated to them, and additional metadata.
_According the v4
diagram, this is a SOMAGeometryNDArray
. Should the text should be modified? Also even in the v4
boxed diagram, positions
is also specified as a SparseSpatialArray
which is confusing. I also think the text description about the fields of positions
should be updated since it doesn't match with v4
diagram description. For instance the text contains a column called in_tissue
that is not in the v4
diagram_
Full resolution image of a Scene and High resolution image of a Scene are specified as SOMAImageNDArray
in the text above but the v4
diagram calls them as DenseSpatialArray
. This needs updating
@pablo-gar - I noticed a reference to fiducial_diameter_fullres
in the PDF above. This is unsupported by the dataset schema, per earlier conversations. Please see #cell-sci-modalities.
@prathapsridharan answering your questions
Does
soma_joinid
inscenes
dataframe correspond tosoma_joinid
inexperiment.obs
dataframe? That is, the two references tosoma_joinid
are actually talking about a particular observation? If so, shouldscenes
dataframe just contain anscene_id
instead ofsoma_joinid
? I say this becauseexperiment.obs
already has ascene_id
that ties each observation to a Scene and thescenes
dataframe seems to be about metadata about each scene and therefore I don't see why it should contain anything about a particular observation likeobs_joinid
. From what I understand, a scene corresponds to multiple observations so scene metadata dataframe probably should not contain anything aboutobs_joinid
other than perhapsnum_observations
or something like that?
No, soma_joinid
in scenes
datafame DOES NOT correspond to soma_joinid
in experiment.obs
dataframe. If that was understood from the schema text I should improve it.
soma_dim_0
andsoma_dim_1
are defined as categoricals that contain the name of the "X" and "Y" spatial coordinate names. If that is the case thensoma_dim_0
andsoma_dim_1
are weird names. Maybe something likespatial_X_coord_name
andspatial_Y_coord_name
or something like that
I'll bring this proposal to Julia and Aaron. I don't have an strong opinion on it.
Spatial Scenes with spatial data – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid] – SOMAScene. There will be as many as Spatial Scenes as spatial datasets
_Should
.spatial[scene_soma_joinid]
be replaced with.spatial[scene_id]
wherescene_id
is specified in theexperiment.obs
dataframe (and possibly inscenes
dataframe)? Also the text above describing the columns ofscenes
dataframe doesn't quite match with the columns listed in thev4
diagram and one or the other needs updating_
I'm proposing to unify everything via the soma_joinid
of the .spatial["scenes"]
DataFrame, this effectively acts as a scene ID, so adding yet another scene_id
field seems redundant to me. Do you I'm missing something here?
A data frame with raw spatial coordinates – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid]["obs_locations"] – SOMADataFrame
_There is no
obs_locations
in thev4
diagram anymore. Should this be removed from the text description above?_
Yes, thanks for the catch! I will remove it
MUST contain a positions array – census_obj["census_spatial_data"][organism].spatial[scene_soma_joinid].obsl["positions"] – SOMASparseSpatialArray. This will contain the spatial array positions for each observation, the geometry points associated to them, and additional metadata.
_According the
v4
diagram, this is aSOMAGeometryNDArray
. Should the text should be modified? Also even in thev4
boxed diagram,positions
is also specified as aSparseSpatialArray
which is confusing. I also think the text description about the fields ofpositions
should be updated since it doesn't match withv4
diagram description. For instance the text contains a column calledin_tissue
that is not in thev4
diagram_Full resolution image of a Scene and High resolution image of a Scene are specified as
SOMAImageNDArray
in the text above but thev4
diagram calls them asDenseSpatialArray
. This needs updating
Yes, thanks for catching all of these!
@brianraymor Thanks for the catch I've fixed it.
Fourth iteration with fixes from the comments above. Text has also been updated in the top-level comment.
Sixth iteration:
array_col
, array_row
, in_tissue
moved to obs
.spatial
to adhere to latest changes in TileDB-SOMAvar_scene
and obs_scene
Seventh iteration:
spot_diameter_fullres
from census_obj["census_spatial_data"][organism].spatial[scene_id].obsl["loc"]
census_obj["census_spatial_data"][organism].spatial.scenes
– SOMADataFrame
.census_obj["census_spatial_data"][organism].var_scene
– SOMADataFrame
Eighth iteration:
img
collection to match the latest changes in TileDB-SOMA for MultiscaleImage
"census_spattial_data"
with "census_spatial_sequencing"
in all occurrences see this document for more details
LAST EDITED: Aug, 29, 2024
See parent Epic for further information. https://github.com/chanzuckerberg/single-cell/issues/644
See current draft for spatial support in SOMA https://docs.google.com/document/d/1S48pD5XTzDcaLGlq6YVYCoUjptR93PHHHmG79TiJzsA/edit
TODOs
./census_accepted_assays.csv
to include:EFO:0010961
-Visium Spatial Gene Expression
EFO:0030062
-Slide-seqV2
EFO:0009920
-Slide-seq
maybe?spatial[scene_id].obsl["loc"]["soma_geometry"]
Schema changes
Version: 2.2.0
Last edited: April, 2024.
Data included
All datasets included in the Census MUST be of CELLxGENE dataset schema version 5.1.0. The following data constraints are imposed on top of the CELLxGENE dataset schema.
Assays
[...]
The Census MUST include all cells from the list of accepted assays.
These assays were selected with the following criteria:
Spatial Assays
Only observations from Visium and Slide-seq assays MUST be included in Census, as indicated in the list of accepted assays. Per the CELLxGENE dataset schema, datasets with spatial observations can be identified with the presence of the slot
uns["spatial"]
. For these assays, only observations from datasets that contain "one Space Ranger output for a single tissue section" MUST be included in Census.The full logic above can be asserted as follows:
uns["spatial"]
andTrue
inuns["spatial"]["is_single"]
, then all observations MUST be included.uns["spatial"]
andFalse
inuns["spatial"]["is_single"]
, then all observations MUST be excluded.Census metadata –
census_obj["census_info"]["summary"]
–SOMADataFrame
[...]
"total_cell_count"
"unique_cell_count"
Data encoding and organization
[...]
Census Non-Spatial Data –
census_obj["census_data"][organism]
–SOMAExperiment
Non-spatial data for Homo sapiens MUST be stored as a
SOMAExperiment
incensus_obj["census_data"]["homo_sapiens"]
.Non-spatial data for Mus musculus MUST be stored as a
SOMAExperiment
incensus_obj["census_data"]["mus_musculus"]
.Feature dataset presence matrix –
census_obj["census_data"][organism].ms["RNA"]["feature_dataset_presence_matrix"]
–SOMASparseNDArray
[...]
Census Spatial Sequencing Data –
census_obj["census_spatial_sequencing"][organism]
–SOMAExperiment
Only Visium and Slide-seq are supported for spatial data. See the "assays included" section above.
Spatial data for Homo sapiens MUST be stored as a
SOMAExperiment
incensus_obj["census_spatial_sequencing"]["homo_sapiens"]
.Spatial data for Mus musculus MUST be stored as a
SOMAExperiment
incensus_obj["census_spatial_sequencing"]["mus_musculus"]
.For each organism the
SOMAExperiment
MUST contain the following:census_obj["census_spatial_sequencing"][organism].obs
–SOMADataFrame
census_obj["census_spatial_sequencing"][organism].ms
–SOMACollection
. ThisSOMACollection
MUST only contain oneSOMAMeasurement
incensus_obj["census_spatial_sequencing"][organism].ms["RNA"]
with the following:census_obj["census_spatial_sequencing"][organism].ms["RNA"].X
–SOMACollection
. It MUST contain exactly two layers:census_obj["census_spatial_sequencing"][organism].ms["RNA"].X["raw"]
–SOMASparseNDArray
census_obj["census_spatial_sequencing"][organism].ms["RNA"].var
–SOMAIndexedDataFrame
census_obj["census_spatial_sequencing"][organism].ms["RNA"]["feature_dataset_presence_matrix"]
–SOMASparseNDArray
census_obj["census_spatial_sequencing"][organism].obs_scene
. It indicates the link between an observation and a scene, it MUST have two columns: 1)obs_id
corresponding tosoma_joinid
ofobs
and 2)scene_id
corresponding to the associated scene.census_obj["census_spatial_sequencing"][organism].spatial
–SOMACollection
.census_obj["census_spatial_sequencing"][organism].spatial[scene_soma_joinid]
–SOMAScene
. There will be as many as Spatial Scenes as spatial datasets. EachSOMAScene
MUST contain the following:census_obj["census_spatial_sequencing"][organism].spatial[scene_soma_joinid].obsl["loc"]
–SOMAGeometryNDArray
. This will contain the spatial array positions for each observation, the geometry points associated to them, and additional metadata.census_obj["census_spatial_sequencing"][organism].spatial[scene_soma_joinid].img[library_id]["fullres_image"]
–SOMAImageNDArray
.census_obj["census_spatial_sequencing"][organism].spatial[scene_soma_joinid].img[library_id]["highres_image"]
–SOMAImageNDArray
.Matrix Data, count (raw) matrix –
census_obj["census_spatial_sequencing"][organism].ms["RNA"].X["raw"]
–SOMASparseNDArray
Same as non-spatial data. See the corresponding section here.
Feature metadata –
census_obj["census_spatial_sequencing"][organism].ms["RNA"].var
–SOMADataFrame
Same as non-spatial data. See the corresponding section here.
Feature dataset presence matrix –
census_obj["census_spatial_sequencing"][organism].ms["RNA"]["feature_dataset_presence_matrix"]
–SOMASparseNDArray
Same as non-spatial data. See the corresponding section here.
Cell metadata –
census_obj["census_spatial_sequencing"][organism].obs
–SOMADataFrame
Same as non-spatial data. See the corresponding section here.
Important note: In addition, the following spatial
obs
columns from the CELLxGENE dataset schema MUST be included in thisSOMADataFrame
Obs to spatial mapping –
census_obj["census_spatial_sequencing"][organism].obs_scene
–SOMADataFrame
It indicates the link between an observation and a scene. Each row corresponds to an observation with the following columns:
soma_joinid
fromcensus_obj["census_spatial_sequencing"][organism].obs
.scene_id
fromcensus_obj["census_spatial_sequencing"][organism].spatial
.True
if the scene contains spatial information about the oberservation, otherwise it MUST beFalse
.Positions array of a Scene –
census_obj["census_spatial_sequencing"][organism].spatial[scene_id].obsl["loc"]
–SOMAGeometryNDArray
scene_soma_joinid
MUST correspond to the valuessoma_joinid
incensus_obj["census_spatial_sequencing"][organism].spatial.scenes
.For each observation in each Scene, spatial array positions, the geometry points associated to them, and additional positional metadata MUST be encoded as a
SOMAGeometryNDArray
. Each row corresponds to an observation with the following columns:If Visium ("EFO:0010961") the units for the spatial array pisitions are pixels from the high-resolution image (
spatial[scene_soma_joinid].img["highres_image"]
). Otherwise TBD.obsm["spatial"]
. As defined in the CELLxGENE dataset schema.obsm["spatial"]
. As defined in the CELLxGENE dataset schema.dimeter/2
. If Visium ("EFO:0010961")diameter
MUST beuns.["spatial"][library_id]['spot_diameter_fullres']
. As defined in the CELLxGENE dataset schema. Otherwise TBD-TODO (else for Slide-seq it should be 0.003% of the radius occupied by the full cloud of points).Images of a Scene -
census_obj["census_spatial_sequencing"][organism].spatial[scene_soma_joinid].img[library_id] –
SOMASMultiscaleImage`.Images of a Visium ("EFO:0010961") scene MUST adhere to the following specifications. Other assays MUST NOT have images, and MUST NOT include the
img
collection.library_id
MUST be the corresponding value in the source H5AD slotuns.["spatial"][library_id]
, as defined in the CELLxGENE dataset schema.Full resolution image of a Scene –
census_obj["census_spatial_sequencing"][organism].spatial[scene_soma_joinid].img[library_id]["fullres_image"]
–SOMAImageNDArray
.The full resolution image of a Visium ("EFO:0010961") scene MAY be included and MUST be encoded as a
SOMAImageNDArray
.Value: the image from
uns["spatial"][library_id]['images']['fullres']
as defined in the CELLxGENE dataset schema.High resolution image of a Scene –
census_obj["census_spatial_sequencing"][organism].spatial[scene_soma_joinid].img[library_id]["highres_image"]
–SOMAImageNDArray
.The full resolution image of a Visium ("EFO:0010961") scene MUST be included and MUST be encoded as a
SOMAImageNDArray
.Value: the image from
uns["spatial"][library_id]['images']['hires']
as defined in the CELLxGENE dataset schema.