chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
36 stars 22 forks source link

cellxgene-schema CLI must validate raw matrices #614

Closed brianraymor closed 9 months ago

brianraymor commented 11 months ago

With the exception of Accessibility assays, the requirements for raw matrices have been updated to include:

Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers converted to float.

X (Matrix Layers)

The following table describes the matrix data and layers requirements that are assay-specific. If an entry in the table is empty, the schema does not have any other requirements on data in those layers beyond the ones listed above.

Assay "raw" required? "raw" location "normalized" required? "normalized" location
scRNA-seq (UMI, e.g. 10x v3) REQUIRED. Values MUST be de-duplicated molecule counts. Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as numpy.float32. AnnData.raw.X unless no "normalized" is provided, then AnnData.X STRONGLY RECOMMENDED AnnData.X
scRNA-seq (non-UMI, e.g. SS2) REQUIRED. Values MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM). Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as numpy.float32. AnnData.raw.X unless no "normalized" is provided, then AnnData.X STRONGLY RECOMMENDED AnnData.X
Accessibility (e.g. ATAC-seq, mC-seq) NOT REQUIRED REQUIRED AnnData.X STRONGLY RECOMMENDED
danieljhegeman commented 10 months ago

@brianraymor clarifying some language here

Each cell MUST contain...

"cell" is a biological cell, i.e., each row of data MUST contain... (as opposed to a spreadsheet cell or matrix cell)

danieljhegeman commented 10 months ago

Additionally, you have code-blocked the word float as float in the description, which might be intended to specify the Python primitive class float...? Was that intentional?

Or is the requirement just the the values be some sort of (presumably relatively inter-operable) float, including but not limited to float, numpy.float32, numpy.float64, etc...? I see that numpy.float32 has been specified for the two assays...

danieljhegeman commented 10 months ago

and does STRONGLY RECOMMENDED mean it is expected that we produce a warning if this condition is not met?

brianraymor commented 10 months ago

@bkmartinjr proposed additional requirements for raw in https://github.com/chanzuckerberg/single-cell-curation/issues/612#issue-1872684692. This issue simply adds validation for:

Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as numpy.float32.

No further work is required.

Comparing 3.1:

Assay "raw" required? "raw" location "normalized" required? "normalized" location
scRNA-seq (UMI, e.g. 10x v3) REQUIRED. Values MUST be de-duplicated molecule counts. AnnData.raw.X unless no "normalized" is provided, then AnnData.X STRONGLY RECOMMENDED AnnData.X
scRNA-seq (non-UMI, e.g. SS2) REQUIRED. Values MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM). AnnData.raw.X unless no "normalized" is provided, then AnnData.X STRONGLY RECOMMENDED AnnData.X
Accessibility (e.g. ATAC-seq, mC-seq) NOT REQUIRED REQUIRED AnnData.X STRONGLY RECOMMENDED

with 4.0:

Assay "raw" required? "raw" location "normalized" required? "normalized" location
scRNA-seq (UMI, e.g. 10x v3) REQUIRED. Values MUST be de-duplicated molecule counts. Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as numpy.float32. AnnData.raw.X unless no "normalized" is provided, then AnnData.X STRONGLY RECOMMENDED AnnData.X
scRNA-seq (non-UMI, e.g. SS2) REQUIRED. Values MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM). Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as numpy.float32. AnnData.raw.X unless no "normalized" is provided, then AnnData.X STRONGLY RECOMMENDED AnnData.X
Accessibility (e.g. ATAC-seq, mC-seq) NOT REQUIRED REQUIRED AnnData.X STRONGLY RECOMMENDED

CC: @nayib-jose-gloria

nayib-jose-gloria commented 10 months ago

@jahilton quick question--for this issue, how helpful is it to report all the matrix rows and indices that fail the constraints? Should we prioritize that, or failing fast (i.e. just report that error exists at first sign of constraint failing, then stop checking)

jahilton commented 10 months ago

Fail fast. The curator should have good inclination on how to inspect per dataset.

nayib-jose-gloria commented 9 months ago

@jahilton one more question on this--if a dataset has violations for both each cell MUST contain at least one non-zero value. and All non-zero values MUST be positive integers stored as numpy.float32.

should we fail as soon as we encounter either, with an error reporting that we found at least 1 violation, or validate both separately and report both schema errors?

jahilton commented 9 months ago

Would be helpful to identify/report both separately.

nayib-jose-gloria commented 9 months ago

@jahilton ready for QA!

jahilton commented 9 months ago

LGTM QA ntbk