Open brianraymor opened 5 months ago
@jahilton @nayib-jose-gloria - would you review the X table re-design in the top-level summary comment?
I believe that it addresses both of your recommendations.
@nayib-jose-gloria - the related validation code can be simplified to depend on the modality
(with the exception of visium).
10X multiome
is implicitly supported because the requirements simply depend on its modality
.
It is another case that depends on ordering of field validation.
I would order them in order of prevalence in the corpus - non-Visium transcriptomics, then Visium, then epigenomics
non-Visium transcriptomics observation is a more accurate term than cell scRNA-seq feels redundant as the row is only transcriptomics, and too broad (we are also discussing snRNA-seq), and reads a little funny. How about...
REQUIRED. If UMI-based assay (e.g. 10x v3, Slide-seqV2), values MUST be de-duplicated molecule counts. If non-UMI-based assay (e.g. Smart-seq2), values MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM). Each observation MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as numpy.float32.
Visium
EFO:0010961 does not have "
around it
cell should definitely be avoided here. spot works since we are narrowly discussing Visium, observation is a more general term that also works & would be symmetric with non-Visium transcriptomics
Re-framing this around modality makes sense to me. Thanks!
observation is a more accurate term than cell
Adopted.
scRNA-seq feels redundant as the row is only transcriptomics
Apologies. Managed not to include your original editorial suggestions when I was merging the rows.
EFO:0010961 does not have " around it
Updated.
I would order them in order of prevalence in the corpus - non-Visium transcriptomics, then Visium, then epigenomics
It was intentional to match other table order, but updated.
LGTM
Based on the renewed discovery for 10X multiome, I'm reverting this issue from schema 5.2.0 and re-opening.
Context
@jahilton reported the case in #cell-science-platform.
This issue depends on Add modality.
The requirements may be further refined. For example, it be best to rewrite the table in X (Matrix Layers) to be much more specific about assays similar to the new row for Visium Spatial Gene Expression.
@jahilton's recommendations:
I believe expanding to a per assay table is going to be massive & unnecessary. Happy to look at a draft if I’m imagining things wrong. So as an alternative, 3 or 4 rows total…
Design
modality
is"transcriptomics"
andassay_ontology_term_id
is NOT"EFO:0010961"
for Visium Spatial Gene ExpressionIf non-UMI-based assay (e.g. Smart-seq2), values MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM).
Each observation MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as numpy.float32.
AnnData.raw.X
unless no "normalized" is provided, thenAnnData.X
AnnData.X
modality
is"transcriptomics"
andassay_ontology_term_id
is"EFO:0010961"
for Visium Spatial Gene Expressionnumpy.float32
.If
uns['spatial']['is_single']
isFalse
then each observation MUST contain at least one non-zero value.If
uns['spatial']['is_single']
isTrue
then the unfiltered feature-barcode matrix (raw_feature_bc_matrix
) MUST be used. See Space Ranger Feature-Barcode Matrices. This matrix MUST contain 4992 rows. If theobs['in_tissue']
value is1
, then the observation MUST contain at least one non-zero value. If anyobs['in_tissue']
values are0
, then at least one observation corresponding to aobs['in_tissue']
with a value of0
MUST contain a non-zero value.AnnData.raw.X
unless no "normalized" is provided, thenAnnData.X
AnnData.X
modality
is"epigenomics"
AnnData.X