Closed brianraymor closed 9 months ago
@brianraymor clarifying some language here
Each cell MUST contain...
"cell" is a biological cell, i.e., each row of data MUST contain... (as opposed to a spreadsheet cell or matrix cell)
Additionally, you have code-blocked the word float as float
in the description, which might be intended to specify the Python primitive class float
...? Was that intentional?
Or is the requirement just the the values be some sort of (presumably relatively inter-operable) float, including but not limited to float
, numpy.float32
, numpy.float64
, etc...? I see that numpy.float32
has been specified for the two assays...
and does STRONGLY RECOMMENDED
mean it is expected that we produce a warning if this condition is not met?
@bkmartinjr proposed additional requirements for raw in https://github.com/chanzuckerberg/single-cell-curation/issues/612#issue-1872684692. This issue simply adds validation for:
Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as
numpy.float32
.
No further work is required.
Comparing 3.1:
Assay | "raw" required? | "raw" location | "normalized" required? | "normalized" location | |
---|---|---|---|---|---|
scRNA-seq (UMI, e.g. 10x v3) | REQUIRED. Values MUST be de-duplicated molecule counts. | AnnData.raw.X unless no "normalized" is provided, then AnnData.X |
STRONGLY RECOMMENDED | AnnData.X |
|
scRNA-seq (non-UMI, e.g. SS2) | REQUIRED. Values MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM). | AnnData.raw.X unless no "normalized" is provided, then AnnData.X |
STRONGLY RECOMMENDED | AnnData.X |
|
Accessibility (e.g. ATAC-seq, mC-seq) | NOT REQUIRED | REQUIRED | AnnData.X |
STRONGLY RECOMMENDED | |
with 4.0:
Assay | "raw" required? | "raw" location | "normalized" required? | "normalized" location | |
---|---|---|---|---|---|
scRNA-seq (UMI, e.g. 10x v3) | REQUIRED. Values MUST be de-duplicated molecule counts. Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as numpy.float32 . |
AnnData.raw.X unless no "normalized" is provided, then AnnData.X |
STRONGLY RECOMMENDED | AnnData.X |
|
scRNA-seq (non-UMI, e.g. SS2) | REQUIRED. Values MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM). Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as numpy.float32 . |
AnnData.raw.X unless no "normalized" is provided, then AnnData.X |
STRONGLY RECOMMENDED | AnnData.X |
|
Accessibility (e.g. ATAC-seq, mC-seq) | NOT REQUIRED | REQUIRED | AnnData.X |
STRONGLY RECOMMENDED | |
CC: @nayib-jose-gloria
@jahilton quick question--for this issue, how helpful is it to report all the matrix rows and indices that fail the constraints? Should we prioritize that, or failing fast (i.e. just report that error exists at first sign of constraint failing, then stop checking)
Fail fast. The curator should have good inclination on how to inspect per dataset.
@jahilton one more question on this--if a dataset has violations for both
each cell MUST contain at least one non-zero value.
and
All non-zero values MUST be positive integers stored as numpy.float32.
should we fail as soon as we encounter either, with an error reporting that we found at least 1 violation, or validate both separately and report both schema errors?
Would be helpful to identify/report both separately.
@jahilton ready for QA!
With the exception of Accessibility assays, the requirements for raw matrices have been updated to include:
Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers converted to
float
.X
(Matrix Layers)The following table describes the matrix data and layers requirements that are assay-specific. If an entry in the table is empty, the schema does not have any other requirements on data in those layers beyond the ones listed above.
numpy.float32
.AnnData.raw.X
unless no "normalized" is provided, thenAnnData.X
AnnData.X
numpy.float32
.AnnData.raw.X
unless no "normalized" is provided, thenAnnData.X
AnnData.X
AnnData.X