chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
37 stars 23 forks source link

Add requirements for Visium Spatial Gene Expression assay #674

Closed brianraymor closed 6 months ago

brianraymor commented 11 months ago

Context

This addresses the census requirements authored by @pablo-gar.

See Improve CELLxGENE’s value proposition for data submitters and consumers by supporting visium and slide-seq experiments and Data Platform changes required to support visium experiments

Design

Pending

[NTR] Version specific Visium assays

For easier review, these requirements are additive to the corresponding sections in the schema 4 draft.

General Requirements

...

Visium Spatial Gene Expression. It is STRONGLY RECOMMENDED that Visium Spatial Gene Expression datasets represent one Space Ranger output for a single tissue section. This representation is referenced throughout the schema as Visium Single.

Visium datasets that represent multiple Space Ranger outputs MAY be submitted. This representation is referenced throughout the schema as Visium Multiple which will have limited support in CELLxGENE experiences:


X (Matrix Layers)

...

The following table describes the matrix data and layers requirements that are assay-specific. If an entry in the table is empty, the schema does not have any other requirements on data in those layers beyond the ones listed above.

Assay "raw" required? "raw" location "normalized" required? "normalized" location
Visium Spatial Gene Expression REQUIRED. It is STRONGLY RECOMMENDED to use the unfiltered feature-barcode matrix (raw_feature_bc_matrix). See Space Ranger Feature-Barcode Matrices. Values MUST be de-duplicated molecule counts. Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as numpy.float32. AnnData.raw.X unless no "normalized" is provided, then AnnData.X STRONGLY RECOMMENDED AnnData.X

obs (Cell Metadata)

obs is a pandas.DataFrame.

...



Editor Note: See my comment related to filtering out some visium observations from Explorer.

Editor Note: For Visium Single datasets based on one tissue sample, all the following fields MUST have singleton values:



assay_ontology_term_id

If Visium Single, all observations MUST be the same value.

development_stage_ontology_term_id

If Visium Single, all observations MUST be the same value.

donor_id

If Visium Single, all observations MUST be the same value.

organism_ontology_term_id

If Visium Single, all observations MUST be the same value.

self_reported_ethnicity_ontology_term_id

If Visium Single, all observations MUST be the same value.

sex_ontology_term_id

If Visium Single, all observations MUST be the same value.


array_col

Key array_col
Annotator Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value int. This MUST be the value of the column coordinate for the corresponding spot from the array_col field in tissue_positions_list.csv or tissue_positions.csv. The value MUST be in the range between 0 and 127. See Space Ranger Spatial Outputs.


array_row

Key array_row
Annotator Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value int. This MUST be value of the row coordinate for the corresponding spot from the array_row field in in tissue_positions_list.csv or tissue_positions.csv. The value MUST be in the range between 0 and 77. See Space Ranger Spatial Outputs.


cell_type_ontology_term_id

Key cell_type_ontology_term_id
Annotator Curator MUST annotate.
Value categorical with str categories. This MUST be a CL term or "unknown" if:
  • no appropriate term can be found (e.g. the cell type is unknown)
  • Visium Single and the corresponding value of in_tissue is 0

  • The following terms MUST NOT be used:


in_tissue



Editor Note: This could be modeled as a boolean. Seurat models as an integer. Squidpy models as a int64. There was agreement to use an int for consistency.



Key in_tissue
Annotator Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value int. This MUST be the value for the corresponding spot from the in_tissue field in tissue_positions_list.csv or tissue_positions.csv which is either 0 if the spot falls outside tissue or 1 if the spot falls inside tissue. See Space Ranger Spatial Outputs.


obsm (Embeddings)

The size of the ndarray stored for a key in obsm MUST NOT be zero.

To display a dataset in CELLxGENE Explorer, Curators MUST annotate one or more embeddings of at least two-dimensions (e.g. tSNE, UMAP, PCA, spatial coordinates) as numpy.ndarrays in obsm.



Editor Note: Jason recommends that the spatial implementation requirements for Explorer (selector names, scaling) be documented elsewhere. Brian says "in for a penny in for a pound".



spatial

Key spatial. For each available underlay image in uns['spatial'][library_id]['images'], CELLxGENE Explorer MUST add a corresponding item to its Embedding Choice selector and appropriately scale the embedding:

  • CELLxGENE Explore MUST add a selector item named "spatial_HighRes_Map" and MUST scale the full resolution embedding by the value of uns['spatial'][library_id]['scalefactors']['tissue_hires_scalef'].

  • If uns['spatial'][library_id]['images']['fullres'] is present, then CELLxGENE Explorer MUST add a selector item named "spatial_FullRes_Map" and MUST NOT scale the full resolution embedding.

If "spatial" is set as the default_embedding, then Explorer MUST present "spatial_HighRes_Map" as the default.
Annotator Curator MUST annotate if Visium Single.
Value numpy.ndarray. The array MUST be constructed from the corresponding pxl_row_in_fullres and pxl_col_in_fullres fields from in tissue_positions_list.csv or tissue_positions.csv. See Space Ranger Spatial Outputs.



X_{suffix}

Key X_{suffix} with the following requirements:

  • {suffix} MUST be at least one character in length.
  • The first character of {suffix} MUST be a letter of the alphabet and the remaining characters MUST be alphanumeric characters. (This is equivalent to the regular expression pattern "^[a-zA-Z][a-zA-Z0-9]*$".)
  • {suffix} MUST NOT be "spatial".

{suffix} is presented as text to users in the Embedding Choice selector in CELLxGENE Explorer so it is STRONGLY RECOMMENDED that it be descriptive.

See also default_embedding in uns.
Annotator Curator MUST annotate if NOT Visium Single.
Value numpy.ndarray with the following requirements


uns (Dataset Metadata)

...

default_embedding

Key default_embedding
Annotator Curator MAY annotate.
Value str. The value MUST match a key to an embedding in obsm for the embedding to display by default in CELLxGENE Explorer.


spatial



Editor Note: Add a requirement that only the fields documented in the schema must be present under spatial.



Key spatial
Annotator Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value dict. The key-value pairs are documented in the following sections:
  • spatial[library_id]
  • spatial[library_id]['images']
  • spatial[library_id]['images']['fullres']
  • spatial[library_id]['images']['hires']
  • spatial[library_id]['metadata']
  • spatial[library_id]['scalefactors']
  • spatial[library_id]['scalefactors']['spot_diameter_fullres']
  • spatial[library_id]['scalefactors']['tissue_hires_scalef']


spatial[_libraryid]

Key Identifier for the Visium library
Annotation Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value dict. There MUST be only one library_id.


spatial[_libraryid]['images']

Key images
Annotation Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value dict


spatial[_libraryid]['images']['fullres']

Key fullres
Annotation Curator MAY annotate if Visium Single; otherwise, this key MUST NOT be present.
Value ndarray

It is STRONGLY RECOMMENDED that the submitter include the full resolution image which MUST be converted to an array of shape (, , 3).


spatial[_libraryid]['images']['hires']

Key hires
Annotation Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value ndarray

tissue_hires_image.png MUST be converted to an array of shape (, , 3). Its largest dimension MUST be 2000 pixels. See Space Ranger Spatial Outputs.




Editor Note: Document that metadata is supported for scverse cases.



spatial[_libraryid]['metadata']

Key metadata
Annotation Curator MAY annotate if Visium Single; otherwise, this key MUST NOT be present.
Value dict


spatial[_libraryid]['scalefactors']

Key scalefactors
Annotation Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value dict


spatial[_libraryid]['scalefactors']['spot_diameter_fullres']

Key spot_diameter_fullres
Annotation Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value float. This must be the value of the spot_diameter_fullres field from scalefactors_json.json. See Space Ranger Spatial Outputs.


spatial[_libraryid]['scalefactors']['tissue_hires_scalef']

Key tissue_hires_scalef
Annotation Curator MUST annotate if Visium Single; otherwise, this key MUST NOT be present.
Value float. This must be the value of the tissue_hires_scalef field from scalefactors_json.json. See Space Ranger Spatial Outputs.




Editor Note: Removed slide_version in favor of @jychien's proposal for adding EFO terms:



Appendix A. Changelog

schema v4.1.0

jahilton commented 9 months ago

for in_tissue:0 observations, I would expect a dependency with cell_type. The cell_type_ontology_term_id MUST be ____ (I suggest to add a new value like empty).

[Brian responds] - Another option is to filter in_tissue:0. This feature is supported in Seurat:

The Read10X_Image filter_matrix boolean parameter enables Filter spot/feature matrix to only include spots that have been determined to be over tissue. The default is TRUE.

[Jason responds to Brian responds] - Clarify "filter" for me. Who/what is filtering?...also if we are accepting in_tissue:0, then those will need to be excused from the Each cell MUST contain at least one non-zero value. rule


Could enforce tissue_type is tissue or organoid if it's Visium.


There is nothing that addresses single section/library Datasets vs integrated Datasets. So currently, images & spatial embeddings are required for datasets where multiple slides have been integrated and those aren't as useful. Aggregated Datasets will be required to submit one (and only one) hires image? And are downstream features OK with consuming Datasets that will differ in this key aspect? (I would assume that they'd like to ignore the integrated cases and only consume the individual sections)

[Brian responds] RE "There is nothing ...", the requirement is There MUST be only one library_id which enforces one image at its different resolutions.

[Jason responds to Brian responds] - That doesn't do it for me. To me, that means that the contributor of this dataset will be forced to pick 1 of 3 library_id values (or come up with a new one that merges them) and 1 image.

[Brian responds] I will start a thread in #cell-science-modalities to review how to mitigate violations of the under current capabilities policy that never allowed this use case.


Or should the library_id be eliminated - it's used for aggregation in frameworks.

☝️ For a Dataset with a single section/library, this is unnecessary. Something that the downstream features can consider adding (in a standardized & globally unique manner) as they are aggregating Datasets for users.


What is the value prop for the lowres when the hires is required?

[Brian responds] I defer first to @pablo-gar since it is a census optional requirement and then to @sidneymbell. I have seen references to lowres images - https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM5808057, but I'm unaware of any relevant use case.


Is there value in storing the Capture Area size? Could it be calculated from the array_*?

[Brian responds] - That is why I'm suggesting capturing the slide serial number. See Space Ranger Slide Serial and Capture Area Parameters

[Jason responds to Brian responds] - I am not seeing the suggestion in this proposal.


uns.embeddings - 3 different identifiers seems excessive (for non-Visium datasets, this adds 2 fields that need to be curated) What is the value prop to offer a "title" that differs from the current display of the obsm key? Can we just require consistency between image key & obsm key?

[Brian responds] Please see #single-cell-modalities.

[Jason responds to Brian responds] - I am not seeing the ask or the use case for a title to be specified rather than just display the obsm key like we currently do.


CELLxGENE Explorer MUST automatically apply the corresponding scalefactor from uns['spatial'][library_id]['scalefactors'] to the embedding.

So the embedding are not scaled to each image at submission? Wouldn't this mean that each of the uns.embeddings point to the same obsm key, and that will just get scaled differently?

[Brian responds] Yes. spatial can be reused for different resolutions images. Please see #single-cell-modalities.

[Jason responds to Brian responds] - So for any dict in embeddings, if image is defined then embedding MUST be spatial?


Have we contacted 10x to ensure there are no plans to rename hires? This proposal is putting a lot of stock on their naming convention (esp when it can be misleading to some people - "the hi res isn't the highest?")

[Brian responds] Not planning to. There's a dependency on their naming conventions throughout the ecosystem. And this is why we have schema versions.

[Jason responds to Brian responds] - While we have the capability of changing things, I believe we should aim to future-proof against wielding that power - our users will benefit from stability. I'd prefer we aim for a standard that isn't reliant on the ever-changing whims of a company, and encourage the ecosystem to follow suit.

jychien commented 9 months ago

Thanks for the thorough schema proposal, Brian! Some general thoughts:

brianraymor commented 7 months ago

Rewrite for Embeddings

Editor's Note: The Explorer selector names are simply placeholders. I defer the actual names to @niknak33.

Per @jahilton's proposal:

  1. default_embedding is resurrected.
  2. `embeddings are replaced by spatial and X_{suffix}
  3. {suffix} MUST NOT be "spatial"

obsm (Embeddings)

The size of the ndarray stored for a key in obsm MUST NOT be zero.

To display a dataset in CELLxGENE Explorer, Curators MUST annotate one or more embeddings of at least two-dimensions (e.g. tSNE, UMAP, PCA, spatial coordinates) as numpy.ndarrays in obsm.

spatial

Key spatial. For each available underlay image in uns['spatial'][library_id]['images'], CELLxGENE Explorer MUST add a corresponding item to its Embedding Choice selector and appropriately scale the embedding:

  • CELLxGENE Explore MUST add a selector item named "spatial (with high resolution)" and MUST scale the full resolution embedding by the value of uns['spatial'][library_id]['scalefactors']['tissue_hires_scalef'].

  • If uns['spatial'][library_id]['images']['tissue_fullres_image'] is present, then CELLxGENE Explorer MUST add a selector item named "spatial (with full resolution)" and MUST NOT scale the full resolution embedding.

  • If uns['spatial'][library_id]['images']['tissue_lowres_image'] is present, then CELLxGENE Explore MUST add a selector item named "spatial (with low resolution)" and MUST scale the full resolution embedding by the value of uns['spatial'][library_id]['scalefactors']['tissue_lowres_scalef'].
Annotator Curator MUST annotate if the assay_ontology_term_id is EFO:0010961 for Visium Spatial Gene Expression.
Value numpy.ndarray. The array MUST be constructed from the corresponding pxl_row_in_fullres and pxl_col_in_fullres fields from tissue_positions.csv. See Space Ranger Spatial Outputs.



X_{suffix}

Key X_{suffix} with the following requirements:

  • {suffix} MUST be at least one character in length.
  • The first character of {suffix} MUST be a letter of the alphabet and the remaining characters MUST be alphanumeric characters. (This is equivalent to the regular expression pattern "^[a-zA-Z][a-zA-Z0-9]*$".)
  • {suffix} MUST NOT be "spatial".

{suffix} is presented as text to users in the Embedding Choice selector in CELLxGENE Explorer so it is STRONGLY RECOMMENDED that it be descriptive.

See also default_embedding in uns.
Annotator Curator MUST annotate.
Value numpy.ndarray with the following requirements


niknak33 commented 7 months ago

Thanks Brian

I was thinking something along these lines:

HighRes_Map, FullRes_Map, and LowRes_Map for the names, I believe those follow the guidelines.

Thank you,

Nik.

On Mon, Jan 29, 2024, 4:13 PM Brian Raymor @.***> wrote:

Rewrite for Embeddings

Editor's Note: The Explorer selector names are simply placeholders. I defer the actual names to @niknak33 https://github.com/niknak33.

Per @jahilton https://github.com/jahilton's proposal:

  1. default_embedding is resurrected.
  2. `embeddings are replaced by spatial and X_{suffix}
  3. {suffix} MUST NOT be "spatial"

obsm (Embeddings)

The size of the ndarray stored for a key in obsm MUST NOT be zero.

To display a dataset in CELLxGENE Explorer, Curators MUST annotate one or more embeddings of at least two-dimensions (e.g. tSNE, UMAP, PCA, spatial coordinates) as numpy.ndarrays in obsm.

spatial Key spatial. For each available underlay image in uns['spatial'][ library_id]['images'], CELLxGENE Explorer MUST add a corresponding item to its Embedding Choice selector and appropriately scale the embedding:

  • CELLxGENE Explore MUST add a selector item named "spatial (with high resolution)" and MUST scale the full resolution embedding by the value of uns['spatial'][library_id]['scalefactors']['tissue_hires_scalef'].
  • If uns['spatial'][library_id]['images']['tissue_fullres_image'] is present, then CELLxGENE Explorer MUST add a selector item named "spatial (with full resolution)" and MUST NOT scale the full resolution embedding.
  • If uns['spatial'][library_id]['images']['tissue_lowres_image'] is present, then CELLxGENE Explore MUST add a selector item named "spatial (with low resolution)" and MUST scale the full resolution embedding by the value of uns['spatial'][library_id ]['scalefactors']['tissue_lowres_scalef'].

Annotator Curator MUST annotate if the assay_ontology_term_id is EFO:0010961 https://www.ebi.ac.uk/ols4/ontologies/efo/classes?obo_id=EFO%3A0010961 for Visium Spatial Gene Expression. Value numpy.ndarray. The array MUST be constructed from the corresponding pxl_row_in_fullres and pxl_col_in_fullres fields from tissue_positions.csv. See Space Ranger Spatial Outputs https://www.10xgenomics.com/support/software/space-ranger/analysis/outputs/spatial-outputs .


X{suffix} Key X{suffix} with the following requirements:

  • {suffix} MUST be at least one character in length.
  • The first character of {suffix} MUST be a letter of the alphabet and the remaining characters MUST be alphanumeric characters. (This is equivalent to the regular expression pattern "^[a-zA-Z][a-zA-Z0-9]*$".)
  • {suffix} MUST NOT be "spatial".

{suffix} is presented as text to users in the Embedding Choice selector in CELLxGENE Explorer so it is STRONGLY RECOMMENDED that it be descriptive.

See also default_embedding in uns. Annotator Curator MUST annotate. Value numpy.ndarray with the following requirements

— Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/single-cell-curation/issues/674#issuecomment-1915808520, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2S5ZBE3OTDEERGN5F73XBTYRA3MJAVCNFSM6AAAAAA6JQ35DOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMJVHAYDQNJSGA . You are receiving this because you were mentioned.Message ID: @.***>

brianraymor commented 7 months ago

I performed the renames for HighRes_Map and FullRes_Map. We agreed to not support low resolution images.

niknak33 commented 7 months ago

I totally understand that; I just assumed you might want it in case you wanted to reference it at some point.

Thanks,

Nik

On Fri, Feb 2, 2024 at 10:30 AM Brian Raymor @.***> wrote:

I performed the renames for HighRes_Map and FullRes_Map. We agreed to not support low resolution images.

— Reply to this email directly, view it on GitHub https://github.com/chanzuckerberg/single-cell-curation/issues/674#issuecomment-1924458657, or unsubscribe https://github.com/notifications/unsubscribe-auth/A2S5ZBAXMLJ6QKDB565PNHDYRUWEJAVCNFSM6AAAAAA6JQ35DOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRUGQ2TQNRVG4 . You are receiving this because you were mentioned.Message ID: @.***>

brianraymor commented 6 months ago
Assay "raw" required? "raw" location "normalized" required? "normalized" location
Visium Spatial Gene Expression REQUIRED. The unfiltered feature-barcode matrix (raw_feature_bc_matrix) MUST be used. See Space Ranger Feature-Barcode Matrices. Values MUST be de-duplicated molecule counts. All non-zero values MUST be positive integers stored as numpy.float32. AnnData.raw.X unless no "normalized" is provided, then AnnData.X STRONGLY RECOMMENDED.
This MUST be a scipy.sparse.csr_matrix. If the obs['in_tissue'] is 0 for an observation, then the values of its corresponding variable references MUST be implicit zero.
AnnData.X
pablo-gar commented 6 months ago

Looks good to me!

jahilton commented 6 months ago

I am expecting further requirements to enforce The unfiltered feature-barcode matrix (raw_feature_bc_matrix) MUST be used.

brianraymor commented 6 months ago

I am expecting further requirements to enforce

Do you have specific suggestions for the schema, @jahilton ? CC: @pablo-gar

jahilton commented 6 months ago

First thought is require obs count to be in a specific range.

jychien commented 6 months ago

My question would be that if I were a Census user and did not want to reprocess each Visium dataset from scratch and wanted to integrate the data, is there a straight forward way to subset to only the spots analyzed by the authors? Would it be implied that I would need to use spots with cell_type not unknown and in_tissue:1.

pablo-gar commented 6 months ago

From my conversations with SpatialData developers:

They are moving to a place where the provide defaults for most use cases but still enable flexibility, in this case default is to load “in-tissue” data but the flexibility exists

Census is likely to adopt such paradigm as well

pablo-gar commented 6 months ago

LGTM! One typo I found

CELLxGENE Explore MUST add a selector item named "spatial_HighRes_Map" and MUST scale the full resolution embedding by the value of uns['spatial'][library_id]['scalefactors']['tissue_hires_scalef'].

Explore is missing the r