Closed brianraymor closed 6 months ago
for in_tissue:0 observations, I would expect a dependency with cell_type. The cell_type_ontology_term_id MUST be ____ (I suggest to add a new value like empty
).
[Brian responds] - Another option is to filter in_tissue:0. This feature is supported in Seurat:
The Read10X_Image filter_matrix boolean parameter enables Filter spot/feature matrix to only include spots that have been determined to be over tissue. The default is TRUE.
[Jason responds to Brian responds] - Clarify "filter" for me. Who/what is filtering?...also if we are accepting in_tissue:0, then those will need to be excused from the Each cell MUST contain at least one non-zero value.
rule
Could enforce tissue_type is tissue or organoid if it's Visium.
There is nothing that addresses single section/library Datasets vs integrated Datasets. So currently, images & spatial embeddings are required for datasets where multiple slides have been integrated and those aren't as useful. Aggregated Datasets will be required to submit one (and only one) hires image? And are downstream features OK with consuming Datasets that will differ in this key aspect? (I would assume that they'd like to ignore the integrated cases and only consume the individual sections)
[Brian responds] RE "There is nothing ...", the requirement is There MUST be only one library_id which enforces one image at its different resolutions.
[Jason responds to Brian responds] - That doesn't do it for me. To me, that means that the contributor of this dataset will be forced to pick 1 of 3 library_id values (or come up with a new one that merges them) and 1 image.
[Brian responds] I will start a thread in #cell-science-modalities to review how to mitigate violations of the under current capabilities policy that never allowed this use case.
Or should the library_id be eliminated - it's used for aggregation in frameworks.
☝️ For a Dataset with a single section/library, this is unnecessary. Something that the downstream features can consider adding (in a standardized & globally unique manner) as they are aggregating Datasets for users.
What is the value prop for the lowres when the hires is required?
[Brian responds] I defer first to @pablo-gar since it is a census optional requirement and then to @sidneymbell. I have seen references to lowres images - https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM5808057, but I'm unaware of any relevant use case.
Is there value in storing the Capture Area size? Could it be calculated from the array_*?
[Brian responds] - That is why I'm suggesting capturing the slide serial number. See Space Ranger Slide Serial and Capture Area Parameters
[Jason responds to Brian responds] - I am not seeing the suggestion in this proposal.
uns.embeddings - 3 different identifiers seems excessive (for non-Visium datasets, this adds 2 fields that need to be curated) What is the value prop to offer a "title" that differs from the current display of the obsm key? Can we just require consistency between image key & obsm key?
[Brian responds] Please see #single-cell-modalities.
[Jason responds to Brian responds] - I am not seeing the ask or the use case for a title to be specified rather than just display the obsm key like we currently do.
CELLxGENE Explorer MUST automatically apply the corresponding scalefactor from uns['spatial'][library_id]['scalefactors'] to the embedding.
So the embedding are not scaled to each image at submission? Wouldn't this mean that each of the uns.embeddings point to the same obsm key, and that will just get scaled differently?
[Brian responds] Yes. spatial
can be reused for different resolutions images. Please see #single-cell-modalities.
[Jason responds to Brian responds] - So for any dict in embeddings, if image is defined then embedding MUST be spatial?
Have we contacted 10x to ensure there are no plans to rename hires
? This proposal is putting a lot of stock on their naming convention (esp when it can be misleading to some people - "the hi res isn't the highest?")
[Brian responds] Not planning to. There's a dependency on their naming conventions throughout the ecosystem. And this is why we have schema versions.
[Jason responds to Brian responds] - While we have the capability of changing things, I believe we should aim to future-proof against wielding that power - our users will benefit from stability. I'd prefer we aim for a standard that isn't reliant on the ever-changing whims of a company, and encourage the ecosystem to follow suit.
Thanks for the thorough schema proposal, Brian! Some general thoughts:
Editor's Note: The Explorer selector names are simply placeholders. I defer the actual names to @niknak33.
Per @jahilton's proposal:
default_embedding
is resurrected.obsm
(Embeddings)The size of the ndarray stored for a key in obsm
MUST NOT be zero.
To display a dataset in CELLxGENE Explorer, Curators MUST annotate one or more embeddings of at least two-dimensions (e.g. tSNE, UMAP, PCA, spatial coordinates) as numpy.ndarrays
in obsm
.
Key | spatial. For each available underlay image in uns['spatial'][library_id]['images'] , CELLxGENE Explorer MUST add a corresponding item to its Embedding Choice selector and appropriately scale the embedding:
|
---|---|
Annotator | Curator MUST annotate if the assay_ontology_term_id is EFO:0010961 for Visium Spatial Gene Expression. |
Value | numpy.ndarray . The array MUST be constructed from the corresponding pxl_row_in_fullres and pxl_col_in_fullres fields from tissue_positions.csv . See Space Ranger Spatial Outputs. |
Key | X_{suffix} with the following requirements:
{suffix} is presented as text to users in the Embedding Choice selector in CELLxGENE Explorer so it is STRONGLY RECOMMENDED that it be descriptive. See also default_embedding in uns . |
---|---|
Annotator | Curator MUST annotate. |
Value | numpy.ndarray with the following requirements
|
Thanks Brian
I was thinking something along these lines:
HighRes_Map, FullRes_Map, and LowRes_Map for the names, I believe those follow the guidelines.
Thank you,
Nik.
On Mon, Jan 29, 2024, 4:13 PM Brian Raymor @.***> wrote:
Rewrite for Embeddings
Editor's Note: The Explorer selector names are simply placeholders. I defer the actual names to @niknak33 https://github.com/niknak33.
Per @jahilton https://github.com/jahilton's proposal:
- default_embedding is resurrected.
- `embeddings are replaced by spatial and X_{suffix}
- {suffix} MUST NOT be "spatial"
obsm (Embeddings)
The size of the ndarray stored for a key in obsm MUST NOT be zero.
To display a dataset in CELLxGENE Explorer, Curators MUST annotate one or more embeddings of at least two-dimensions (e.g. tSNE, UMAP, PCA, spatial coordinates) as numpy.ndarrays in obsm.
spatial Key spatial. For each available underlay image in uns['spatial'][ library_id]['images'], CELLxGENE Explorer MUST add a corresponding item to its Embedding Choice selector and appropriately scale the embedding:
- CELLxGENE Explore MUST add a selector item named "spatial (with high resolution)" and MUST scale the full resolution embedding by the value of uns['spatial'][library_id]['scalefactors']['tissue_hires_scalef'].
- If uns['spatial'][library_id]['images']['tissue_fullres_image'] is present, then CELLxGENE Explorer MUST add a selector item named "spatial (with full resolution)" and MUST NOT scale the full resolution embedding.
- If uns['spatial'][library_id]['images']['tissue_lowres_image'] is present, then CELLxGENE Explore MUST add a selector item named "spatial (with low resolution)" and MUST scale the full resolution embedding by the value of uns['spatial'][library_id ]['scalefactors']['tissue_lowres_scalef'].
Annotator Curator MUST annotate if the assay_ontology_term_id is EFO:0010961 https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0010961 for Visium Spatial Gene Expression. Value numpy.ndarray. The array MUST be constructed from the corresponding pxl_row_in_fullres and pxl_col_in_fullres fields from tissue_positions.csv. See Space Ranger Spatial Outputs https://www.10xgenomics.com/support/software/space-ranger/analysis/outputs/spatial-outputs .
X{suffix} Key X{suffix} with the following requirements:
- {suffix} MUST be at least one character in length.
- The first character of {suffix} MUST be a letter of the alphabet and the remaining characters MUST be alphanumeric characters. (This is equivalent to the regular expression pattern "^[a-zA-Z][a-zA-Z0-9]*$".)
- {suffix} MUST NOT be "spatial".
{suffix} is presented as text to users in the Embedding Choice selector in CELLxGENE Explorer so it is STRONGLY RECOMMENDED that it be descriptive.
See also default_embedding in uns. Annotator Curator MUST annotate. Value numpy.ndarray with the following requirements
- MUST have the same number of rows as X and MUST include at least two columns
- MUST be a numpy.dtype.kind https://numpy.org/doc/stable/reference/generated/numpy.dtype.kind.html of "f", "i", or "u"
- MUST NOT contain any positive infinity (numpy.inf) https://numpy.org/devdocs/reference/constants.html#numpy.inf or negative infinity (numpy.NINF) https://numpy.org/devdocs/reference/constants.html#numpy.NINF values
- MUST NOT contain all Not a Number (numpy.nan) https://numpy.org/devdocs/reference/constants.html#numpy.nan values
— Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/single-cell-curation/issues/674#issuecomment-1915808520, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2S5ZBE3OTDEERGN5F73XBTYRA3MJAVCNFSM6AAAAAA6JQ35DOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJVHAYDQNJSGA . You are receiving this because you were mentioned.Message ID: @.***>
I performed the renames for HighRes_Map and FullRes_Map. We agreed to not support low resolution images.
I totally understand that; I just assumed you might want it in case you wanted to reference it at some point.
Thanks,
Nik
On Fri, Feb 2, 2024 at 10:30 AM Brian Raymor @.***> wrote:
I performed the renames for HighRes_Map and FullRes_Map. We agreed to not support low resolution images.
— Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/single-cell-curation/issues/674#issuecomment-1924458657, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2S5ZBAXMLJ6QKDB565PNHDYRUWEJAVCNFSM6AAAAAA6JQ35DOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRUGQ2TQNRVG4 . You are receiving this because you were mentioned.Message ID: @.***>
Assay | "raw" required? | "raw" location | "normalized" required? | "normalized" location |
---|---|---|---|---|
Visium Spatial Gene Expression | REQUIRED. The unfiltered feature-barcode matrix (raw_feature_bc_matrix ) MUST be used. See Space Ranger Feature-Barcode Matrices. Values MUST be de-duplicated molecule counts. All non-zero values MUST be positive integers stored as numpy.float32 . |
AnnData.raw.X unless no "normalized" is provided, then AnnData.X |
STRONGLY RECOMMENDED. This MUST be a scipy.sparse.csr_matrix . If the obs['in_tissue'] is 0 for an observation, then the values of its corresponding variable references MUST be implicit zero. |
AnnData.X |
Looks good to me!
I am expecting further requirements to enforce The unfiltered feature-barcode matrix (raw_feature_bc_matrix) MUST be used.
I am expecting further requirements to enforce
Do you have specific suggestions for the schema, @jahilton ? CC: @pablo-gar
First thought is require obs count to be in a specific range.
My question would be that if I were a Census user and did not want to reprocess each Visium dataset from scratch and wanted to integrate the data, is there a straight forward way to subset to only the spots analyzed by the authors? Would it be implied that I would need to use spots with cell_type not unknown
and in_tissue:1.
From my conversations with SpatialData developers:
They are moving to a place where the provide defaults for most use cases but still enable flexibility, in this case default is to load “in-tissue” data but the flexibility exists
Census is likely to adopt such paradigm as well
LGTM! One typo I found
CELLxGENE Explore MUST add a selector item named "spatial_HighRes_Map" and MUST scale the full resolution embedding by the value of uns['spatial'][library_id]['scalefactors']['tissue_hires_scalef'].
Explore is missing the r
Context
This addresses the census requirements authored by @pablo-gar.
See Improve CELLxGENE’s value proposition for data submitters and consumers by supporting visium and slide-seq experiments and Data Platform changes required to support visium experiments
Design
Pending
[NTR] Version specific Visium assays
For easier review, these requirements are additive to the corresponding sections in the schema 4 draft.
General Requirements
...
Visium Spatial Gene Expression. It is STRONGLY RECOMMENDED that Visium Spatial Gene Expression datasets represent one Space Ranger output for a single tissue section. This representation is referenced throughout the schema as Visium Single.
Visium datasets that represent multiple Space Ranger outputs MAY be submitted. This representation is referenced throughout the schema as Visium Multiple which will have limited support in CELLxGENE experiences:
X
(Matrix Layers)...
The following table describes the matrix data and layers requirements that are assay-specific. If an entry in the table is empty, the schema does not have any other requirements on data in those layers beyond the ones listed above.
raw_feature_bc_matrix
). See Space Ranger Feature-Barcode Matrices. Values MUST be de-duplicated molecule counts.Each cell MUST contain at least one non-zero value.All non-zero values MUST be positive integers stored asnumpy.float32
.AnnData.raw.X
unless no "normalized" is provided, thenAnnData.X
AnnData.X
obs
(Cell Metadata)obs
is apandas.DataFrame
....
Editor Note: See my comment related to filtering out some visium observations from Explorer.
Editor Note: For Visium Single datasets based on one tissue sample, all the following fields MUST have singleton values:
assay_ontology_term_id
If Visium Single, all observations MUST be the same value.
development_stage_ontology_term_id
If Visium Single, all observations MUST be the same value.
donor_id
If Visium Single, all observations MUST be the same value.
organism_ontology_term_id
If Visium Single, all observations MUST be the same value.
self_reported_ethnicity_ontology_term_id
If Visium Single, all observations MUST be the same value.
sex_ontology_term_id
If Visium Single, all observations MUST be the same value.
array_col
int
. This MUST be the value of the column coordinate for the corresponding spot from thearray_col
field intissue_positions_list.csv
ortissue_positions.csv
. The value MUST be in the range between0
and127
. See Space Ranger Spatial Outputs.array_row
int
. This MUST be value of the row coordinate for the corresponding spot from thearray_row
field in intissue_positions_list.csv
ortissue_positions.csv
. The value MUST be in the range between0
and77
. See Space Ranger Spatial Outputs.cell_type_ontology_term_id
str
categories. This MUST be a CL term or"unknown"
if:in_tissue
is0
The following terms MUST NOT be used:
"CL:0000255"
for eukaryotic cell"CL:0000257"
for Eumycetozoan cell"CL:0000548"
for animal cellin_tissue
Editor Note: This could be modeled as a boolean. Seurat models as an
integer
. Squidpy models as aint64
. There was agreement to use anint
for consistency.int
. This MUST be the value for the corresponding spot from thein_tissue
field intissue_positions_list.csv
ortissue_positions.csv
which is either0
if the spot falls outside tissue or1
if the spot falls inside tissue. See Space Ranger Spatial Outputs.obsm
(Embeddings)The size of the ndarray stored for a key in
obsm
MUST NOT be zero.To display a dataset in CELLxGENE Explorer, Curators MUST annotate one or more embeddings of at least two-dimensions (e.g. tSNE, UMAP, PCA, spatial coordinates) as
numpy.ndarrays
inobsm
.Editor Note: Jason recommends that the
spatial
implementation requirements for Explorer (selector names, scaling) be documented elsewhere. Brian says "in for a penny in for a pound".spatial
uns['spatial'][library_id]['images']
, CELLxGENE Explorer MUST add a corresponding item to its Embedding Choice selector and appropriately scale the embedding:"spatial_HighRes_Map"
and MUST scale the full resolution embedding by the value ofuns['spatial'][library_id]['scalefactors']['tissue_hires_scalef']
.uns['spatial'][library_id]['images']['fullres']
is present, then CELLxGENE Explorer MUST add a selector item named"spatial_FullRes_Map"
and MUST NOT scale the full resolution embedding."spatial"
is set as thedefault_embedding
, then Explorer MUST present"spatial_HighRes_Map"
as the default.numpy.ndarray
. The array MUST be constructed from the correspondingpxl_row_in_fullres
andpxl_col_in_fullres
fields from intissue_positions_list.csv
ortissue_positions.csv
. See Space Ranger Spatial Outputs.X_{suffix}
"^[a-zA-Z][a-zA-Z0-9]*$"
.)"spatial"
.{suffix} is presented as text to users in the Embedding Choice selector in CELLxGENE Explorer so it is STRONGLY RECOMMENDED that it be descriptive.
See also
default_embedding
inuns
.numpy.ndarray
with the following requirementsX
and MUST include at least two columnsnumpy.dtype.kind
of"f"
,"i"
, or "u"
numpy.inf
) or negative infinity (numpy.NINF
) valuesnumpy.nan
) valuesuns
(Dataset Metadata)...
default_embedding
str
. The value MUST match a key to an embedding inobsm
for the embedding to display by default in CELLxGENE Explorer.spatial
Editor Note: Add a requirement that only the fields documented in the schema must be present under
spatial
.dict
. The key-value pairs are documented in the following sections:spatial[_libraryid]
dict
. There MUST be only onelibrary_id
.spatial[_libraryid]['images']
dict
spatial[_libraryid]['images']['fullres']
ndarray
It is STRONGLY RECOMMENDED that the submitter include the full resolution image which MUST be converted to an array of shape (, , 3).
spatial[_libraryid]['images']['hires']
ndarray
tissue_hires_image.png
MUST be converted to an array of shape (, , 3). Its largest dimension MUST be 2000 pixels. See Space Ranger Spatial Outputs.Editor Note: Document that
metadata
is supported for scverse cases.spatial[_libraryid]['metadata']
dict
spatial[_libraryid]['scalefactors']
dict
spatial[_libraryid]['scalefactors']['spot_diameter_fullres']
float
. This must be the value of thespot_diameter_fullres
field fromscalefactors_json.json
. See Space Ranger Spatial Outputs.spatial[_libraryid]['scalefactors']['tissue_hires_scalef']
float
. This must be the value of thetissue_hires_scalef
field fromscalefactors_json.json
. See Space Ranger Spatial Outputs.Editor Note: Removed
slide_version
in favor of @jychien's proposal for adding EFO terms:Appendix A. Changelog
schema v4.1.0