brianraymor commented 5 months ago

Context

@jahilton reported the case in #cell-science-platform.

This issue depends on Add modality.

The requirements may be further refined. For example, it be best to rewrite the table in X (Matrix Layers) to be much more specific about assays similar to the new row for Visium Spatial Gene Expression.

@jahilton's recommendations:

I believe expanding to a per assay table is going to be massive & unnecessary. Happy to look at a draft if I’m imagining things wrong. So as an alternative, 3 or 4 rows total…

1 row - assay:Visium Spatial Gene Expression
1 or 2 rows - modality:transcriptomics (but not assay:Visium Spatial Gene Expression)
- can be combined to one and include the distinction within “Values MUST be de-duplicated molecule counts if UMI-based assay (e.g. 10x v3, Slide-seqV2), otherwise MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM).”
- or kept as two like it is now
1 row - modality:epigenomics (currently labeled as “Accessibility”)

Design

assay_ontology_term_id or modality	"raw" required?	"raw" location	"normalized" required?	"normalized" location
`modality` is `"transcriptomics"` and `assay_ontology_term_id` is NOT `"EFO:0010961"` for Visium Spatial Gene Expression	REQUIRED. If UMI-based assay (e.g. 10x v3, Slide-seqV2), values MUST be de-duplicated molecule counts. If non-UMI-based assay (e.g. Smart-seq2), values MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM). Each observation MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as numpy.float32.	`AnnData.raw.X` unless no "normalized" is provided, then `AnnData.X`	STRONGLY RECOMMENDED	`AnnData.X`
`modality` is `"transcriptomics"` and `assay_ontology_term_id` is `"EFO:0010961"` for Visium Spatial Gene Expression	REQUIRED. Values MUST be de-duplicated molecule counts. All non-zero values MUST be positive integers stored as `numpy.float32`. If `uns['spatial']['is_single']` is `False` then each observation MUST contain at least one non-zero value. If `uns['spatial']['is_single']` is `True` then the unfiltered feature-barcode matrix (`raw_feature_bc_matrix`) MUST be used. See Space Ranger Feature-Barcode Matrices. This matrix MUST contain 4992 rows. If the `obs['in_tissue']` value is `1`, then the observation MUST contain at least one non-zero value. If any `obs['in_tissue']` values are `0`, then at least one observation corresponding to a `obs['in_tissue']` with a value of `0` MUST contain a non-zero value.	`AnnData.raw.X` unless no "normalized" is provided, then `AnnData.X`	STRONGLY RECOMMENDED	`AnnData.X`
`modality` is `"epigenomics"`	NOT REQUIRED		REQUIRED	`AnnData.X`	STRONGLY RECOMMENDED

brianraymor commented 4 months ago

@jahilton @nayib-jose-gloria - would you review the X table re-design in the top-level summary comment?

I believe that it addresses both of your recommendations.

@nayib-jose-gloria - the related validation code can be simplified to depend on the modality (with the exception of visium).

10X multiome is implicitly supported because the requirements simply depend on its modality.

It is another case that depends on ordering of field validation.

jahilton commented 4 months ago

I would order them in order of prevalence in the corpus - non-Visium transcriptomics, then Visium, then epigenomics

non-Visium transcriptomics observation is a more accurate term than cell scRNA-seq feels redundant as the row is only transcriptomics, and too broad (we are also discussing snRNA-seq), and reads a little funny. How about...

REQUIRED. If UMI-based assay (e.g. 10x v3, Slide-seqV2), values MUST be de-duplicated molecule counts. If non-UMI-based assay (e.g. Smart-seq2), values MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM). Each observation MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as numpy.float32.

Visium EFO:0010961 does not have " around it cell should definitely be avoided here. spot works since we are narrowly discussing Visium, observation is a more general term that also works & would be symmetric with non-Visium transcriptomics

nayib-jose-gloria commented 4 months ago

Re-framing this around modality makes sense to me. Thanks!

brianraymor commented 4 months ago

observation is a more accurate term than cell

Adopted.

scRNA-seq feels redundant as the row is only transcriptomics

Apologies. Managed not to include your original editorial suggestions when I was merging the rows.

EFO:0010961 does not have " around it

Updated.

I would order them in order of prevalence in the corpus - non-Visium transcriptomics, then Visium, then epigenomics

It was intentional to match other table order, but updated.

jahilton commented 4 months ago

LGTM

brianraymor commented 1 month ago

Based on the renewed discovery for 10X multiome, I'm reverting this issue from schema 5.2.0 and re-opening.

chanzuckerberg / single-cell-curation

Redesign table in X (Matrix Layers) to incorporate modality #848

Context

Design