Requirements from Feb 13 2024 call with @chouellaine and @jahilton documented by @brianraymor :
For the backfill, modeling as"unknown", "na", or a JSON Object encoded as a string a'la:
import json
sample = '{"HANCESTRO:1": .75, "HANCESTRO:2": .25}'
// dict: {'HANCESTRO:1': .75, 'HANCESTRO:2': .25}
ancestry = json.loads(sample)
// Do the percentages add up to 1?
sum(ancestry.values())
Labels. The set of continental keys and their mappings to HANCESTRO terms must be finalized. Assigned to @99norbs and @chouellaine. The current set is:
Values. All ancestry observations for a donor_id MUST be the same value. When ancestry values are calculated, the values are floats that sum to 1.00. For assays not included in the set above, the ancestry value must be "unknown". When ancestry values are unavailable, then "unknown" must be set.
Reference. There needs to be documentation for how values are calculated. Assigned to @chouellaine.
Until Add parent classes for ethnicity and ancestry terms is either mitigated or addressed, manual review of HANCESTRO terms is required to determine which are appropriate for ancestry rather than ethnicity.
genetic_ancestry_ontology_term_id
Key
genetic_ancestry_ontology_term_id
Annotator
Curator MUST annotate.
Value
categorical with str categories. If
organism_ontolology_term_id is
"NCBITaxon:9606" for Homo sapiens,
the value MUST be ... or "unknown" if unavailable. The following terms MUST NOT be used:
note: will use this template when terms are identified
Otherwise, for all other organisms the
str value MUST be "na".
genetic_ancestry
Key
genetic_ancestry
Annotator
CELLxGENE Discover MUST annotate.
Value
categorical with str categories. This MUST be "na" if the value of genetic_ancestry_ontology_term_id is "na". This MUST be "unknown" if the value of genetic_ancestry_ontology_term_id is "unknown". Otherwise, this MUST be ....
Context
Genetic ancestry metadata are are being generated and are starting to be included with data submitted to CELLxGENE. A metadata field is needed to be able to store these data and allow users to use, filter, and visualize genetic ancestry data across all CELLxGENE tools (Discover, Explorer, Gene Expression, and Census).
Design (@brianraymor)
Requirements from Feb 13 2024 call with @chouellaine and @jahilton documented by @brianraymor :
For the backfill, modeling as
"unknown"
,"na"
, or a JSON Object encoded as a string a'la:Labels. The set of continental keys and their mappings to HANCESTRO terms must be finalized. Assigned to @99norbs and @chouellaine. The current set is:
but also see the May 8 2024 meeting notes.
Assays. The current set of assays for the backfill from the May 8 2024 meeting notes:
Values. All ancestry observations for a
donor_id
MUST be the same value. When ancestry values are calculated, the values arefloats
that sum to 1.00. For assays not included in the set above, the ancestry value must be "unknown". When ancestry values are unavailable, then "unknown" must be set.Reference. There needs to be documentation for how values are calculated. Assigned to @chouellaine.
Until Add parent classes for ethnicity and ancestry terms is either mitigated or addressed, manual review of HANCESTRO terms is required to determine which are appropriate for ancestry rather than ethnicity.
genetic_ancestry_ontology_term_id
str
categories. Iforganism_ontolology_term_id
is"NCBITaxon:9606"
for Homo sapiens, the value MUST be ... or"unknown"
if unavailable. The following terms MUST NOT be used:"HANCESTRO:0002"
for regions and its childrenOtherwise, for all other organisms the
str
value MUST be"na"
.genetic_ancestry
str
categories. This MUST be"na"
if the value ofgenetic_ancestry_ontology_term_id
is"na"
. This MUST be"unknown"
if the value ofgenetic_ancestry_ontology_term_id
is"unknown"
. Otherwise, this MUST be ....Context
Genetic ancestry metadata are are being generated and are starting to be included with data submitted to CELLxGENE. A metadata field is needed to be able to store these data and allow users to use, filter, and visualize genetic ancestry data across all CELLxGENE tools (Discover, Explorer, Gene Expression, and Census).
See and mirror parts of issue Update self_reported_ethnicity#460
Related conversations (added by @brianraymor)
AnnData Modeling @bkmartinjr's feedback from #single-cell-ancestry-inference