chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
38 stars 24 forks source link

cellxgene-schema CLI must add validation for obs['genetic_ancestry_*'] #1114

Open brianraymor opened 1 week ago

brianraymor commented 1 week ago

Changelog

Design

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then for each observation for the following fields, either all their values must be float("nan") or the sum of their values MUST be1.0:

genetic_ancestry_African

Key genetic_ancestry_African
Annotator Curator MUST annotate.
Value str or float. All observations with the same donor_id MUST contain the same value.

If organism_ontolology_term_id is NOT "NCBITaxon:9606" for Homo sapiens, then the value MUST be "na".

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0010" for African expressed as a float greater than or equal to 0.0 and less than or equal to 1.0


genetic_ancestry_East_Asian

Key genetic_ancestry_East_Asian
Annotator Curator MUST annotate.
Value str or float. All observations with the same donor_id MUST contain the same value.

If organism_ontolology_term_id is NOT "NCBITaxon:9606" for Homo sapiens, then the value MUST be "na".

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0009" for East Asian expressed as a float greater than or equal to 0.0 and less than or equal to 1.0


genetic_ancestry_European

Key genetic_ancestry_European
Annotator Curator MUST annotate.
Value str or float. All observations with the same donor_id MUST contain the same value.

If organism_ontolology_term_id is NOT "NCBITaxon:9606" for Homo sapiens, then the value MUST be "na".

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0005" for European expressed as a float greater than or equal to 0.0 and less than or equal to 1.0


genetic_ancestry_Indigenous_American

Key genetic_ancestry_Indigenous_American
Annotator Curator MUST annotate.
Value str or float. All observations with the same donor_id MUST contain the same value.

If organism_ontolology_term_id is NOT "NCBITaxon:9606" for Homo sapiens, then the value MUST be "na".

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0013" for Indigenous American expressed as a float greater than or equal to 0.0 and less than or equal to 1.0


genetic_ancestry_Oceanian

Key genetic_ancestry_Oceanian
Annotator Curator MUST annotate.
Value str or float. All observations with the same donor_id MUST contain the same value.

If organism_ontolology_term_id is NOT "NCBITaxon:9606" for Homo sapiens, then the value MUST be "na".

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0017" for Oceanian expressed as a float greater than or equal to 0.0 and less than or equal to 1.0


genetic_ancestry_South_Asian

Key genetic_ancestry_South_Asian
Annotator Curator MUST annotate.
Value str or float. All observations with the same donor_id MUST contain the same value.

If organism_ontolology_term_id is NOT "NCBITaxon:9606" for Homo sapiens, then the value MUST be "na".

If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, then the value MUST be a float("nan") if unavailable; otherwise, the value MUST be the genetic ancestry percentage of "HANCESTRO:0006" for South Asian expressed as a float greater than or equal to 0.0 and less than or equal to 1.0


joyceyan commented 4 days ago

@brianraymor Anndata doesn't seem to support allowing multiple data types in a single column. What do you think of changing the schema so that when organism is not homo sapiens, we require that the value is float('nan') instead of a string "na"?