chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
37 stars 23 forks source link

Redesign table in X (Matrix Layers) to incorporate modality #848

Open brianraymor opened 5 months ago

brianraymor commented 5 months ago

Context

@jahilton reported the case in #cell-science-platform.

This issue depends on Add modality.

The requirements may be further refined. For example, it be best to rewrite the table in X (Matrix Layers) to be much more specific about assays similar to the new row for Visium Spatial Gene Expression.

@jahilton's recommendations:

I believe expanding to a per assay table is going to be massive & unnecessary. Happy to look at a draft if I’m imagining things wrong. So as an alternative, 3 or 4 rows total…

Design

assay_ontology_term_id or modality "raw" required? "raw" location "normalized" required? "normalized" location
modality is "transcriptomics" and assay_ontology_term_id is NOT "EFO:0010961" for Visium Spatial Gene Expression REQUIRED. If UMI-based assay (e.g. 10x v3, Slide-seqV2), values MUST be de-duplicated molecule counts.

If non-UMI-based assay (e.g. Smart-seq2), values MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM).

Each observation MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as numpy.float32.
AnnData.raw.X unless no "normalized" is provided, then AnnData.X STRONGLY RECOMMENDED AnnData.X
modality is "transcriptomics" and assay_ontology_term_id is "EFO:0010961" for Visium Spatial Gene Expression REQUIRED. Values MUST be de-duplicated molecule counts. All non-zero values MUST be positive integers stored as numpy.float32.

If uns['spatial']['is_single'] is False then each observation MUST contain at least one non-zero value.

If uns['spatial']['is_single'] is True then the unfiltered feature-barcode matrix (raw_feature_bc_matrix) MUST be used. See Space Ranger Feature-Barcode Matrices. This matrix MUST contain 4992 rows. If the obs['in_tissue'] value is 1, then the observation MUST contain at least one non-zero value. If any obs['in_tissue'] values are 0, then at least one observation corresponding to a obs['in_tissue'] with a value of 0 MUST contain a non-zero value.
AnnData.raw.X unless no "normalized" is provided, then AnnData.X STRONGLY RECOMMENDED AnnData.X
modality is "epigenomics" NOT REQUIRED REQUIRED AnnData.X STRONGLY RECOMMENDED
brianraymor commented 4 months ago

@jahilton @nayib-jose-gloria - would you review the X table re-design in the top-level summary comment?

I believe that it addresses both of your recommendations.

@nayib-jose-gloria - the related validation code can be simplified to depend on the modality (with the exception of visium).

10X multiome is implicitly supported because the requirements simply depend on its modality.

It is another case that depends on ordering of field validation.

jahilton commented 4 months ago

I would order them in order of prevalence in the corpus - non-Visium transcriptomics, then Visium, then epigenomics

non-Visium transcriptomics observation is a more accurate term than cell scRNA-seq feels redundant as the row is only transcriptomics, and too broad (we are also discussing snRNA-seq), and reads a little funny. How about...

REQUIRED. If UMI-based assay (e.g. 10x v3, Slide-seqV2), values MUST be de-duplicated molecule counts. If non-UMI-based assay (e.g. Smart-seq2), values MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM). Each observation MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as numpy.float32.

Visium EFO:0010961 does not have " around it cell should definitely be avoided here. spot works since we are narrowly discussing Visium, observation is a more general term that also works & would be symmetric with non-Visium transcriptomics

nayib-jose-gloria commented 4 months ago

Re-framing this around modality makes sense to me. Thanks!

brianraymor commented 4 months ago

observation is a more accurate term than cell

Adopted.

scRNA-seq feels redundant as the row is only transcriptomics

Apologies. Managed not to include your original editorial suggestions when I was merging the rows.

EFO:0010961 does not have " around it

Updated.

I would order them in order of prevalence in the corpus - non-Visium transcriptomics, then Visium, then epigenomics

It was intentional to match other table order, but updated.

jahilton commented 4 months ago

LGTM

brianraymor commented 1 month ago

Based on the renewed discovery for 10X multiome, I'm reverting this issue from schema 5.2.0 and re-opening.