chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
37 stars 23 forks source link

cellxgene-schema must deprecate X_normalization #250

Closed brianraymor closed 2 years ago

brianraymor commented 2 years ago

See Deprecate X_normalization for background.

Note: Pending requirements related to the The schema policy for PII and deprecated fields must be clarified and enforced.

ebezzi commented 2 years ago

Note: we'll just remove X_normalization from the code as part of this ticket. Enforcing the non-existence of this field will be done as part of https://app.zenhub.com/workspaces/single-cell-5e2a191dad828d52cc78b028/issues/chanzuckerberg/single-cell-curation/264

jychien commented 2 years ago

Associated github ticket (above links to ethnicity ticket): https://github.com/chanzuckerberg/single-cell-curation/issues/250 Overall expected outcome is that the validator will fail if 'X_normalization' is present in uns. The requirements for presence of raw or normalized matrices remain unchanged from schema 2.0.0, and are as listed in the 'X (layers)' table in schema 3.0.0 documentation.

Test Case Expected Result Obtained Result Test File Needs attention
X_normalization' is present in uns for scRNA-seq dataset with normalized & raw matrices Fail validation

Passed validation adata_norm_raw.h5ad X
X_normalization' is absent in uns for scRNA-seq dataset with normalized & raw matrices Pass validation

Passed validation adata_norm_raw_noX.h5ad
X_normalization' is present in uns for Slide-seq dataset with only raw matrix Fail validation

Failed validation, but the file failed for the incorrect reason.
Current logging: ERROR: Only raw data was found, i.e. there is no 'raw.X'.
adata_only_raw.h5ad X
X_normalization' is absent in uns for Slide-seq dataset with only raw matrix Pass validation
Failed validation
Current logging: ERROR: Only raw data was found, i.e. there is no 'raw.X'.
Previous logging: WARNING: Only raw data was found, i.e. there is no 'raw.X' and 'uns['X_normalization']' is 'none'. It is STRONGLY RECOMMENDED that 'final' (normalized) data is provided.
adata_only_raw_noX.h5ad X
X_normalization' is present in uns for ATAC dataset with only normalized matrix Fail validation Passed validation adata_atac_norm_only.h5ad X
X_normalization' is absent in uns for ATAC dataset with only normalized matrix Pass validation Passed validation adata_atac_norm_only_noX.h5ad

h5ad files can be found in: https://drive.google.com/drive/folders/1hNK5E2f9BBFky16nga486Mh0mHhTkgCR?usp=sharing

ebezzi commented 2 years ago

@jychien as stated in my previous comment, enforcing the absence of X_normalization will be done as part of another ticket, so I believe that will cover cases 1, 3, and 5. 3 is interesting as it shows that checking for deprecated fields should be done at an early stage of the validation, to make errors more precise. I will make sure the ticket is ready before you re-validate.

I'm gonna look at case 4 and submit another PR. Thanks!

jychien commented 2 years ago
Thanks for the clarification, @ebezzi. Here are the test cases that the validator warns and errors on. It looks to good to me! For any additional edge cases that may arise, curators need to be on the look out and troubleshoot. Test Case Expected Result Obtained Result Test File Needs attention
scRNA-seq dataset with normalized & raw matrices Pass validation

Passed validation adata_norm_raw_updated.h5ad
Slide-seq dataset with only raw matrix Pass validation
Passed validation
Logging: WARNING: Only raw data was found, i.e. there is no 'raw.X'. It is STRONGLY RECOMMENDED that 'final' (normalized) data is provided.
adata_only_raw_noX_updated.h5ad
scRNA-seq dataset with only normalized Fail validation Failed validation
Logging: ERROR: Raw data is missing: there is only a normalized matrix in X and no raw.X
adata_only_norm_updated.h5ad
scRNA-seq dataset with raw in .X and a normalized matrix in raw.X Fail validation Failed validation
Logging: ERROR: Raw data may be missing: data in 'raw.X' contains non-integer values.
adata_matrix_flipped_updated.h5ad
ATAC dataset with only normalized matrix Pass validation Passed validation adata_atac_norm_only_updated.h5ad