Closed brianraymor closed 2 years ago
Note: we'll just remove X_normalization
from the code as part of this ticket. Enforcing the non-existence of this field will be done as part of https://app.zenhub.com/workspaces/single-cell-5e2a191dad828d52cc78b028/issues/chanzuckerberg/single-cell-curation/264
Associated github ticket (above links to ethnicity ticket): https://github.com/chanzuckerberg/single-cell-curation/issues/250 Overall expected outcome is that the validator will fail if 'X_normalization' is present in uns. The requirements for presence of raw or normalized matrices remain unchanged from schema 2.0.0, and are as listed in the 'X (layers)' table in schema 3.0.0 documentation.
Test Case | Expected Result | Obtained Result | Test File | Needs attention |
---|---|---|---|---|
X_normalization' is present in uns for scRNA-seq dataset with normalized & raw matrices | Fail validation |
Passed validation | adata_norm_raw.h5ad | X |
X_normalization' is absent in uns for scRNA-seq dataset with normalized & raw matrices | Pass validation |
Passed validation | adata_norm_raw_noX.h5ad | |
X_normalization' is present in uns for Slide-seq dataset with only raw matrix | Fail validation |
Failed validation, but the file failed for the incorrect reason. Current logging: ERROR: Only raw data was found, i.e. there is no 'raw.X'. |
adata_only_raw.h5ad | X |
X_normalization' is absent in uns for Slide-seq dataset with only raw matrix | Pass validation |
Failed validation Current logging: ERROR: Only raw data was found, i.e. there is no 'raw.X'. Previous logging: WARNING: Only raw data was found, i.e. there is no 'raw.X' and 'uns['X_normalization']' is 'none'. It is STRONGLY RECOMMENDED that 'final' (normalized) data is provided. |
adata_only_raw_noX.h5ad | X |
X_normalization' is present in uns for ATAC dataset with only normalized matrix | Fail validation | Passed validation | adata_atac_norm_only.h5ad | X |
X_normalization' is absent in uns for ATAC dataset with only normalized matrix | Pass validation | Passed validation | adata_atac_norm_only_noX.h5ad | |
h5ad files can be found in: https://drive.google.com/drive/folders/1hNK5E2f9BBFky16nga486Mh0mHhTkgCR?usp=sharing
@jychien as stated in my previous comment, enforcing the absence of X_normalization
will be done as part of another ticket, so I believe that will cover cases 1, 3, and 5. 3 is interesting as it shows that checking for deprecated fields should be done at an early stage of the validation, to make errors more precise. I will make sure the ticket is ready before you re-validate.
I'm gonna look at case 4 and submit another PR. Thanks!
Thanks for the clarification, @ebezzi. Here are the test cases that the validator warns and errors on. It looks to good to me! For any additional edge cases that may arise, curators need to be on the look out and troubleshoot. | Test Case | Expected Result | Obtained Result | Test File | Needs attention |
---|---|---|---|---|---|
scRNA-seq dataset with normalized & raw matrices | Pass validation |
Passed validation | adata_norm_raw_updated.h5ad | ||
Slide-seq dataset with only raw matrix | Pass validation |
Passed validation Logging: WARNING: Only raw data was found, i.e. there is no 'raw.X'. It is STRONGLY RECOMMENDED that 'final' (normalized) data is provided. |
adata_only_raw_noX_updated.h5ad | ||
scRNA-seq dataset with only normalized | Fail validation | Failed validation Logging: ERROR: Raw data is missing: there is only a normalized matrix in X and no raw.X |
adata_only_norm_updated.h5ad | ||
scRNA-seq dataset with raw in .X and a normalized matrix in raw.X | Fail validation | Failed validation Logging: ERROR: Raw data may be missing: data in 'raw.X' contains non-integer values. |
adata_matrix_flipped_updated.h5ad | ||
ATAC dataset with only normalized matrix | Pass validation | Passed validation | adata_atac_norm_only_updated.h5ad |
See Deprecate X_normalization for background.
Note: Pending requirements related to the The schema policy for PII and deprecated fields must be clarified and enforced.