chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets

MIT License

37 stars 23 forks source link

Add requirements for Visium Spatial Gene Expression assay #674

Closed brianraymor closed 6 months ago

brianraymor commented 11 months ago

Context

This addresses the census requirements authored by @pablo-gar.

See Improve CELLxGENE’s value proposition for data submitters and consumers by supporting visium and slide-seq experiments and Data Platform changes required to support visium experiments

Design

Pending

[NTR] Version specific Visium assays

For easier review, these requirements are additive to the corresponding sections in the schema 4 draft.

General Requirements

...

Visium Spatial Gene Expression. It is STRONGLY RECOMMENDED that Visium Spatial Gene Expression datasets represent one Space Ranger output for a single tissue section. This representation is referenced throughout the schema as Visium Single.

Visium datasets that represent multiple Space Ranger outputs MAY be submitted. This representation is referenced throughout the schema as Visium Multiple which will have limited support in CELLxGENE experiences:

There are no image underlays in CELLxGENE Explorer.
Such datasets are not included in CELLxGENE Discover Census.
Such datasets are not converted to Seurat for CELLxGENE Discover downloads.

`X` (Matrix Layers)

...

The following table describes the matrix data and layers requirements that are assay-specific. If an entry in the table is empty, the schema does not have any other requirements on data in those layers beyond the ones listed above.

Assay	"raw" required?	"raw" location	"normalized" required?	"normalized" location
Visium Spatial Gene Expression	REQUIRED. It is STRONGLY RECOMMENDED to use the unfiltered feature-barcode matrix (`raw_feature_bc_matrix`). See Space Ranger Feature-Barcode Matrices. Values MUST be de-duplicated molecule counts. ~~Each cell MUST contain at least one non-zero value.~~ All non-zero values MUST be positive integers stored as `numpy.float32`.	`AnnData.raw.X` unless no "normalized" is provided, then `AnnData.X`	STRONGLY RECOMMENDED	`AnnData.X`

`obs` (Cell Metadata)

obs is a pandas.DataFrame.

...

Editor Note: See my comment related to filtering out some visium observations from Explorer.

Editor Note: For Visium Single datasets based on one tissue sample, all the following fields MUST have singleton values:

assay_ontology_term_id

If Visium Single, all observations MUST be the same value.

development_stage_ontology_term_id

If Visium Single, all observations MUST be the same value.

donor_id

If Visium Single, all observations MUST be the same value.

organism_ontology_term_id

If Visium Single, all observations MUST be the same value.

self_reported_ethnicity_ontology_term_id

If Visium Single, all observations MUST be the same value.

sex_ontology_term_id

If Visium Single, all observations MUST be the same value.

array_col

Key	array_col
Annotator	Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value	`int`. This MUST be the value of the column coordinate for the corresponding spot from the `array_col` field in `tissue_positions_list.csv` or `tissue_positions.csv`. The value MUST be in the range between `0` and `127`. See Space Ranger Spatial Outputs.

array_row

Key	array_row
Annotator	Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value	`int`. This MUST be value of the row coordinate for the corresponding spot from the `array_row` field in in `tissue_positions_list.csv` or `tissue_positions.csv`. The value MUST be in the range between `0` and `77`. See Space Ranger Spatial Outputs.

cell_type_ontology_term_id

Key	cell_type_ontology_term_id
Annotator	Curator MUST annotate.
Value	categorical with `str` categories. This MUST be a CL term or `"unknown"` if: no appropriate term can be found (e.g. the cell type is unknown) Visium Single and the corresponding value of `in_tissue` is `0` The following terms MUST NOT be used: `"CL:0000255"` for eukaryotic cell `"CL:0000257"` for Eumycetozoan cell `"CL:0000548"` for animal cell

in_tissue

Editor Note: This could be modeled as a boolean. Seurat models as an integer. Squidpy models as a int64. There was agreement to use an int for consistency.

Key	in_tissue
Annotator	Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value	`int`. This MUST be the value for the corresponding spot from the `in_tissue` field in `tissue_positions_list.csv` or `tissue_positions.csv` which is either `0` if the spot falls outside tissue or `1` if the spot falls inside tissue. See Space Ranger Spatial Outputs.

`obsm` (Embeddings)

The size of the ndarray stored for a key in obsm MUST NOT be zero.

To display a dataset in CELLxGENE Explorer, Curators MUST annotate one or more embeddings of at least two-dimensions (e.g. tSNE, UMAP, PCA, spatial coordinates) as numpy.ndarrays in obsm.

Editor Note: Jason recommends that the spatial implementation requirements for Explorer (selector names, scaling) be documented elsewhere. Brian says "in for a penny in for a pound".

spatial

Key	spatial. For each available underlay image in `uns['spatial'][library_id]['images']`, CELLxGENE Explorer MUST add a corresponding item to its Embedding Choice selector and appropriately scale the embedding: CELLxGENE Explore MUST add a selector item named `"spatial_HighRes_Map"` and MUST scale the full resolution embedding by the value of `uns['spatial'][library_id]['scalefactors']['tissue_hires_scalef']`. If `uns['spatial'][library_id]['images']['fullres']` is present, then CELLxGENE Explorer MUST add a selector item named `"spatial_FullRes_Map"` and MUST NOT scale the full resolution embedding. If `"spatial"` is set as the `default_embedding`, then Explorer MUST present `"spatial_HighRes_Map"` as the default.
Annotator	Curator MUST annotate if Visium Single.
Value	`numpy.ndarray`. The array MUST be constructed from the corresponding `pxl_row_in_fullres` and `pxl_col_in_fullres` fields from in `tissue_positions_list.csv` or `tissue_positions.csv`. See Space Ranger Spatial Outputs.

X_{suffix}

Key	X_{suffix} with the following requirements: {suffix} MUST be at least one character in length. The first character of {suffix} MUST be a letter of the alphabet and the remaining characters MUST be alphanumeric characters. (This is equivalent to the regular expression pattern `"^[a-zA-Z][a-zA-Z0-9]$"`.) {suffix} MUST NOT be `"spatial"`. {suffix} is presented as text to users in the Embedding Choice* selector in CELLxGENE Explorer so it is STRONGLY RECOMMENDED that it be descriptive. See also `default_embedding` in `uns`.
Annotator	Curator MUST annotate if NOT Visium Single.
Value	`numpy.ndarray` with the following requirements MUST have the same number of rows as `X` and MUST include at least two columns MUST be a `numpy.dtype.kind` of `"f"`, `"i"`, or "`u"` MUST NOT contain any positive infinity (`numpy.inf`) or negative infinity (`numpy.NINF`) values MUST NOT contain all Not a Number (`numpy.nan`) values

`uns` (Dataset Metadata)

...

default_embedding

Key	default_embedding
Annotator	Curator MAY annotate.
Value	`str`. The value MUST match a key to an embedding in `obsm` for the embedding to display by default in CELLxGENE Explorer.

spatial

Editor Note: Add a requirement that only the fields documented in the schema must be present under spatial.

Key	spatial
Annotator	Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value	`dict`. The key-value pairs are documented in the following sections: spatial[library_id] spatial[library_id]['images'] spatial[library_id]['images']['fullres'] spatial[library_id]['images']['hires'] spatial[library_id]['metadata'] spatial[library_id]['scalefactors'] spatial[library_id]['scalefactors']['spot_diameter_fullres'] spatial[library_id]['scalefactors']['tissue_hires_scalef']

spatial[_libraryid]

Key	Identifier for the Visium library
Annotation	Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value	`dict`. There MUST be only one `library_id`.

spatial[_libraryid]['images']

Key	images
Annotation	Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value	`dict`

spatial[_libraryid]['images']['fullres']

Key	fullres
Annotation	Curator MAY annotate if Visium Single; otherwise, this key MUST NOT be present.
Value	`ndarray` It is STRONGLY RECOMMENDED that the submitter include the full resolution image which MUST be converted to an array of shape (, , 3).

spatial[_libraryid]['images']['hires']

Key	hires
Annotation	Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value	`ndarray` `tissue_hires_image.png` MUST be converted to an array of shape (, , 3). Its largest dimension MUST be 2000 pixels. See Space Ranger Spatial Outputs.

Editor Note: Document that metadata is supported for scverse cases.

spatial[_libraryid]['metadata']

Key	metadata
Annotation	Curator MAY annotate if Visium Single; otherwise, this key MUST NOT be present.
Value	`dict`

spatial[_libraryid]['scalefactors']

Key	scalefactors
Annotation	Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value	`dict`

spatial[_libraryid]['scalefactors']['spot_diameter_fullres']

Key	spot_diameter_fullres
Annotation	Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value	`float`. This must be the value of the `spot_diameter_fullres` field from `scalefactors_json.json`. See Space Ranger Spatial Outputs.

spatial[_libraryid]['scalefactors']['tissue_hires_scalef']

Key	tissue_hires_scalef
Annotation	Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value	`float`. This must be the value of the `tissue_hires_scalef` field from `scalefactors_json.json`. See Space Ranger Spatial Outputs.

Editor Note: Removed slide_version in favor of @jychien's proposal for adding EFO terms:

Visium Spatial Gene Expression (existing)
- Visium Spatial Gene Expression V1 (max X/Y is 128/78; total spots: 4992)
Visium CytAssist Spatial Gene Expression
- Visium CytAssist Spatial Gene Expression V4 (6.5 mm, max X/Y is 128/78; total spots: 4992)
- Visium CytAssist Spatial Gene Expression V5 (11 mm, max X/Y is 224/128; total spots: 14336) .

Appendix A. Changelog

schema v4.1.0

Introduced formal validation for Visium Spatial Gene Expression based on modeling in scanpy and squidpy.

jahilton commented 9 months ago

for in_tissue:0 observations, I would expect a dependency with cell_type. The cell_type_ontology_term_id MUST be ____ (I suggest to add a new value like empty).

[Brian responds] - Another option is to filter in_tissue:0. This feature is supported in Seurat:

The Read10X_Image filter_matrix boolean parameter enables Filter spot/feature matrix to only include spots that have been determined to be over tissue. The default is TRUE.

[Jason responds to Brian responds] - Clarify "filter" for me. Who/what is filtering?...also if we are accepting in_tissue:0, then those will need to be excused from the Each cell MUST contain at least one non-zero value. rule

Could enforce tissue_type is tissue or organoid if it's Visium.

There is nothing that addresses single section/library Datasets vs integrated Datasets. So currently, images & spatial embeddings are required for datasets where multiple slides have been integrated and those aren't as useful. Aggregated Datasets will be required to submit one (and only one) hires image? And are downstream features OK with consuming Datasets that will differ in this key aspect? (I would assume that they'd like to ignore the integrated cases and only consume the individual sections)

[Brian responds] RE "There is nothing ...", the requirement is There MUST be only one library_id which enforces one image at its different resolutions.

[Jason responds to Brian responds] - That doesn't do it for me. To me, that means that the contributor of this dataset will be forced to pick 1 of 3 library_id values (or come up with a new one that merges them) and 1 image.

[Brian responds] I will start a thread in #cell-science-modalities to review how to mitigate violations of the under current capabilities policy that never allowed this use case.

Or should the library_id be eliminated - it's used for aggregation in frameworks.

☝️ For a Dataset with a single section/library, this is unnecessary. Something that the downstream features can consider adding (in a standardized & globally unique manner) as they are aggregating Datasets for users.

What is the value prop for the lowres when the hires is required?

[Brian responds] I defer first to @pablo-gar since it is a census optional requirement and then to @sidneymbell. I have seen references to lowres images - https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM5808057, but I'm unaware of any relevant use case.

Is there value in storing the Capture Area size? Could it be calculated from the array_*?

[Brian responds] - That is why I'm suggesting capturing the slide serial number. See Space Ranger Slide Serial and Capture Area Parameters

[Jason responds to Brian responds] - I am not seeing the suggestion in this proposal.

uns.embeddings - 3 different identifiers seems excessive (for non-Visium datasets, this adds 2 fields that need to be curated) What is the value prop to offer a "title" that differs from the current display of the obsm key? Can we just require consistency between image key & obsm key?

[Brian responds] Please see #single-cell-modalities.

[Jason responds to Brian responds] - I am not seeing the ask or the use case for a title to be specified rather than just display the obsm key like we currently do.

CELLxGENE Explorer MUST automatically apply the corresponding scalefactor from uns['spatial'][library_id]['scalefactors'] to the embedding.

So the embedding are not scaled to each image at submission? Wouldn't this mean that each of the uns.embeddings point to the same obsm key, and that will just get scaled differently?

[Brian responds] Yes. spatial can be reused for different resolutions images. Please see #single-cell-modalities.

[Jason responds to Brian responds] - So for any dict in embeddings, if image is defined then embedding MUST be spatial?

Have we contacted 10x to ensure there are no plans to rename hires? This proposal is putting a lot of stock on their naming convention (esp when it can be misleading to some people - "the hi res isn't the highest?")

[Brian responds] Not planning to. There's a dependency on their naming conventions throughout the ecosystem. And this is why we have schema versions.

[Jason responds to Brian responds] - While we have the capability of changing things, I believe we should aim to future-proof against wielding that power - our users will benefit from stability. I'd prefer we aim for a standard that isn't reliant on the ever-changing whims of a company, and encourage the ecosystem to follow suit.

jychien commented 9 months ago

Thanks for the thorough schema proposal, Brian! Some general thoughts:

Overall, the overfitting (such as array_row and array_col) lends itself to be in lock step with Space Ranger. Seurat and Scanpy/Squidpy have also gone this path, and there is overall confusion whenever old/new files are not in sync with the software version that users are running. If we are all aware and are fine with this type of data modeling, then that's fine, and we will migrate whenever Space Ranger has updates. Then it makes sense to match our CxG schema to Scanpy/Squidpy, so to ease wrangling efforts by curators or contributors.
Majority of the schema coincides with Scanpy/Squidpy. I would say that there could be some more clarity and efficiency made with storing of X/Y coordinates. If the scalefactors are required to be named with the resolution of the image, does the image need to be matched up to the corresponding pre-scaled X/Y coordinates? I had initially mentioned that we wanted the post-scaling X/Y coordinates to help with QA, but if we are already requiring scaling factors, we can just use those and scale the embedding ourselves. I anticipate confusion as to which set of X/Y coordinates to put where, so something straight forward is best.

brianraymor commented 7 months ago

Rewrite for Embeddings

Editor's Note: The Explorer selector names are simply placeholders. I defer the actual names to @niknak33.

Per @jahilton's proposal:

default_embedding is resurrected.
`embeddings are replaced by spatial and X_{suffix}
{suffix} MUST NOT be "spatial"

`obsm` (Embeddings)

The size of the ndarray stored for a key in obsm MUST NOT be zero.

To display a dataset in CELLxGENE Explorer, Curators MUST annotate one or more embeddings of at least two-dimensions (e.g. tSNE, UMAP, PCA, spatial coordinates) as numpy.ndarrays in obsm.

spatial

Key	spatial. For each available underlay image in `uns['spatial'][library_id]['images']`, CELLxGENE Explorer MUST add a corresponding item to its Embedding Choice selector and appropriately scale the embedding: CELLxGENE Explore MUST add a selector item named `"spatial (with high resolution)"` and MUST scale the full resolution embedding by the value of `uns['spatial'][library_id]['scalefactors']['tissue_hires_scalef']`. If `uns['spatial'][library_id]['images']['tissue_fullres_image']` is present, then CELLxGENE Explorer MUST add a selector item named `"spatial (with full resolution)"` and MUST NOT scale the full resolution embedding. If `uns['spatial'][library_id]['images']['tissue_lowres_image']` is present, then CELLxGENE Explore MUST add a selector item named `"spatial (with low resolution)"` and MUST scale the full resolution embedding by the value of `uns['spatial'][library_id]['scalefactors']['tissue_lowres_scalef']`.
Annotator	Curator MUST annotate if the `assay_ontology_term_id` is EFO:0010961 for Visium Spatial Gene Expression.
Value	`numpy.ndarray`. The array MUST be constructed from the corresponding `pxl_row_in_fullres` and `pxl_col_in_fullres` fields from `tissue_positions.csv`. See Space Ranger Spatial Outputs.

X_{suffix}

Key	X_{suffix} with the following requirements: {suffix} MUST be at least one character in length. The first character of {suffix} MUST be a letter of the alphabet and the remaining characters MUST be alphanumeric characters. (This is equivalent to the regular expression pattern `"^[a-zA-Z][a-zA-Z0-9]$"`.) {suffix} MUST NOT be `"spatial"`. {suffix} is presented as text to users in the Embedding Choice* selector in CELLxGENE Explorer so it is STRONGLY RECOMMENDED that it be descriptive. See also `default_embedding` in `uns`.
Annotator	Curator MUST annotate.
Value	`numpy.ndarray` with the following requirements MUST have the same number of rows as `X` and MUST include at least two columns MUST be a `numpy.dtype.kind` of `"f"`, `"i"`, or "`u"` MUST NOT contain any positive infinity (`numpy.inf`) or negative infinity (`numpy.NINF`) values MUST NOT contain all Not a Number (`numpy.nan`) values

niknak33 commented 7 months ago

Thanks Brian

I was thinking something along these lines:

HighRes_Map, FullRes_Map, and LowRes_Map for the names, I believe those follow the guidelines.

Thank you,

Nik.

On Mon, Jan 29, 2024, 4:13 PM Brian Raymor @.***> wrote:

Rewrite for Embeddings

Editor's Note: The Explorer selector names are simply placeholders. I defer the actual names to @niknak33 https://github.com/niknak33.

Per @jahilton https://github.com/jahilton's proposal:

default_embedding is resurrected.

`embeddings are replaced by spatial and X_{suffix}

{suffix} MUST NOT be "spatial"

obsm (Embeddings)

The size of the ndarray stored for a key in obsm MUST NOT be zero.

To display a dataset in CELLxGENE Explorer, Curators MUST annotate one or more embeddings of at least two-dimensions (e.g. tSNE, UMAP, PCA, spatial coordinates) as numpy.ndarrays in obsm.

spatial Key spatial. For each available underlay image in uns['spatial'][ library_id]['images'], CELLxGENE Explorer MUST add a corresponding item to its Embedding Choice selector and appropriately scale the embedding:

CELLxGENE Explore MUST add a selector item named "spatial (with high resolution)" and MUST scale the full resolution embedding by the value of uns['spatial'][library_id]['scalefactors']['tissue_hires_scalef'].

If uns['spatial'][library_id]['images']['tissue_fullres_image'] is present, then CELLxGENE Explorer MUST add a selector item named "spatial (with full resolution)" and MUST NOT scale the full resolution embedding.

If uns['spatial'][library_id]['images']['tissue_lowres_image'] is present, then CELLxGENE Explore MUST add a selector item named "spatial (with low resolution)" and MUST scale the full resolution embedding by the value of uns['spatial'][library_id ]['scalefactors']['tissue_lowres_scalef'].

Annotator Curator MUST annotate if the assay_ontology_term_id is EFO:0010961 https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0010961 for Visium Spatial Gene Expression. Value numpy.ndarray. The array MUST be constructed from the corresponding pxl_row_in_fullres and pxl_col_in_fullres fields from tissue_positions.csv. See Space Ranger Spatial Outputs https://www.10xgenomics.com/support/software/space-ranger/analysis/outputs/spatial-outputs .

X{suffix} Key X{suffix} with the following requirements:

{suffix} MUST be at least one character in length.

The first character of {suffix} MUST be a letter of the alphabet and the remaining characters MUST be alphanumeric characters. (This is equivalent to the regular expression pattern "^[a-zA-Z][a-zA-Z0-9]*$".)

{suffix} MUST NOT be "spatial".

{suffix} is presented as text to users in the Embedding Choice selector in CELLxGENE Explorer so it is STRONGLY RECOMMENDED that it be descriptive.

See also default_embedding in uns. Annotator Curator MUST annotate. Value numpy.ndarray with the following requirements

MUST have the same number of rows as X and MUST include at least two columns

MUST be a numpy.dtype.kind https://numpy.org/doc/stable/reference/generated/numpy.dtype.kind.html of "f", "i", or "u"

MUST NOT contain any positive infinity (numpy.inf) https://numpy.org/devdocs/reference/constants.html#numpy.inf or negative infinity (numpy.NINF) https://numpy.org/devdocs/reference/constants.html#numpy.NINF values

MUST NOT contain all Not a Number (numpy.nan) https://numpy.org/devdocs/reference/constants.html#numpy.nan values

— Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/single-cell-curation/issues/674#issuecomment-1915808520, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2S5ZBE3OTDEERGN5F73XBTYRA3MJAVCNFSM6AAAAAA6JQ35DOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJVHAYDQNJSGA . You are receiving this because you were mentioned.Message ID: @.***>

brianraymor commented 7 months ago

I performed the renames for HighRes_Map and FullRes_Map. We agreed to not support low resolution images.

niknak33 commented 7 months ago

I totally understand that; I just assumed you might want it in case you wanted to reference it at some point.

Thanks,

Nik

On Fri, Feb 2, 2024 at 10:30 AM Brian Raymor @.***> wrote:

I performed the renames for HighRes_Map and FullRes_Map. We agreed to not support low resolution images.

— Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/single-cell-curation/issues/674#issuecomment-1924458657, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2S5ZBAXMLJ6QKDB565PNHDYRUWEJAVCNFSM6AAAAAA6JQ35DOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRUGQ2TQNRVG4 . You are receiving this because you were mentioned.Message ID: @.***>

brianraymor commented 6 months ago

Assay	"raw" required?	"raw" location	"normalized" required?	"normalized" location
Visium Spatial Gene Expression	REQUIRED. The unfiltered feature-barcode matrix (`raw_feature_bc_matrix`) MUST be used. See Space Ranger Feature-Barcode Matrices. Values MUST be de-duplicated molecule counts. All non-zero values MUST be positive integers stored as `numpy.float32`.	`AnnData.raw.X` unless no "normalized" is provided, then `AnnData.X`	STRONGLY RECOMMENDED. This MUST be a `scipy.sparse.csr_matrix`. If the `obs['in_tissue']` is `0` for an observation, then the values of its corresponding variable references MUST be implicit zero.	`AnnData.X`

pablo-gar commented 6 months ago

Looks good to me!

jahilton commented 6 months ago

I am expecting further requirements to enforce The unfiltered feature-barcode matrix (raw_feature_bc_matrix) MUST be used.

brianraymor commented 6 months ago

I am expecting further requirements to enforce

Do you have specific suggestions for the schema, @jahilton ? CC: @pablo-gar

jahilton commented 6 months ago

First thought is require obs count to be in a specific range.

jychien commented 6 months ago

My question would be that if I were a Census user and did not want to reprocess each Visium dataset from scratch and wanted to integrate the data, is there a straight forward way to subset to only the spots analyzed by the authors? Would it be implied that I would need to use spots with cell_type not unknown and in_tissue:1.

pablo-gar commented 6 months ago

From my conversations with SpatialData developers:

They are moving to a place where the provide defaults for most use cases but still enable flexibility, in this case default is to load “in-tissue” data but the flexibility exists

Census is likely to adopt such paradigm as well

pablo-gar commented 6 months ago

LGTM! One typo I found

CELLxGENE Explore MUST add a selector item named "spatial_HighRes_Map" and MUST scale the full resolution embedding by the value of uns['spatial'][library_id]['scalefactors']['tissue_hires_scalef'].

Explore is missing the r