cellxgene-schema CLI must validate raw matrices

brianraymor commented 11 months ago

With the exception of Accessibility assays, the requirements for raw matrices have been updated to include:

Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers converted to float.

`X` (Matrix Layers)

The following table describes the matrix data and layers requirements that are assay-specific. If an entry in the table is empty, the schema does not have any other requirements on data in those layers beyond the ones listed above.

Assay	"raw" required?	"raw" location	"normalized" required?	"normalized" location
scRNA-seq (UMI, e.g. 10x v3)	REQUIRED. Values MUST be de-duplicated molecule counts. Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as `numpy.float32`.	`AnnData.raw.X` unless no "normalized" is provided, then `AnnData.X`	STRONGLY RECOMMENDED	`AnnData.X`
scRNA-seq (non-UMI, e.g. SS2)	REQUIRED. Values MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM). Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as `numpy.float32`.	`AnnData.raw.X` unless no "normalized" is provided, then `AnnData.X`	STRONGLY RECOMMENDED	`AnnData.X`
Accessibility (e.g. ATAC-seq, mC-seq)	NOT REQUIRED		REQUIRED	`AnnData.X`	STRONGLY RECOMMENDED

danieljhegeman commented 10 months ago

@brianraymor clarifying some language here

Each cell MUST contain...

"cell" is a biological cell, i.e., each row of data MUST contain... (as opposed to a spreadsheet cell or matrix cell)

danieljhegeman commented 10 months ago

Additionally, you have code-blocked the word float as float in the description, which might be intended to specify the Python primitive class float...? Was that intentional?

Or is the requirement just the the values be some sort of (presumably relatively inter-operable) float, including but not limited to float, numpy.float32, numpy.float64, etc...? I see that numpy.float32 has been specified for the two assays...

danieljhegeman commented 10 months ago

and does STRONGLY RECOMMENDED mean it is expected that we produce a warning if this condition is not met?

brianraymor commented 10 months ago

@bkmartinjr proposed additional requirements for raw in https://github.com/chanzuckerberg/single-cell-curation/issues/612#issue-1872684692. This issue simply adds validation for:

Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as numpy.float32.

No further work is required.

Comparing 3.1:

Assay	"raw" required?	"raw" location	"normalized" required?	"normalized" location
scRNA-seq (UMI, e.g. 10x v3)	REQUIRED. Values MUST be de-duplicated molecule counts.	`AnnData.raw.X` unless no "normalized" is provided, then `AnnData.X`	STRONGLY RECOMMENDED	`AnnData.X`
scRNA-seq (non-UMI, e.g. SS2)	REQUIRED. Values MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM).	`AnnData.raw.X` unless no "normalized" is provided, then `AnnData.X`	STRONGLY RECOMMENDED	`AnnData.X`
Accessibility (e.g. ATAC-seq, mC-seq)	NOT REQUIRED		REQUIRED	`AnnData.X`	STRONGLY RECOMMENDED

with 4.0:

Assay	"raw" required?	"raw" location	"normalized" required?	"normalized" location
scRNA-seq (UMI, e.g. 10x v3)	REQUIRED. Values MUST be de-duplicated molecule counts. Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as `numpy.float32`.	`AnnData.raw.X` unless no "normalized" is provided, then `AnnData.X`	STRONGLY RECOMMENDED	`AnnData.X`
scRNA-seq (non-UMI, e.g. SS2)	REQUIRED. Values MUST be one of read counts (e.g. FeatureCounts) or estimated fragments (e.g. output of RSEM). Each cell MUST contain at least one non-zero value. All non-zero values MUST be positive integers stored as `numpy.float32`.	`AnnData.raw.X` unless no "normalized" is provided, then `AnnData.X`	STRONGLY RECOMMENDED	`AnnData.X`
Accessibility (e.g. ATAC-seq, mC-seq)	NOT REQUIRED		REQUIRED	`AnnData.X`	STRONGLY RECOMMENDED

CC: @nayib-jose-gloria

nayib-jose-gloria commented 10 months ago

@jahilton quick question--for this issue, how helpful is it to report all the matrix rows and indices that fail the constraints? Should we prioritize that, or failing fast (i.e. just report that error exists at first sign of constraint failing, then stop checking)

jahilton commented 10 months ago

Fail fast. The curator should have good inclination on how to inspect per dataset.

nayib-jose-gloria commented 9 months ago

@jahilton one more question on this--if a dataset has violations for both each cell MUST contain at least one non-zero value. and All non-zero values MUST be positive integers stored as numpy.float32.

should we fail as soon as we encounter either, with an error reporting that we found at least 1 violation, or validate both separately and report both schema errors?

jahilton commented 9 months ago

Would be helpful to identify/report both separately.

nayib-jose-gloria commented 9 months ago

@jahilton ready for QA!

jahilton commented 9 months ago

LGTM QA ntbk

chanzuckerberg / single-cell-curation