Design (@brianraymor)

Requirements from Feb 13 2024 call with @chouellaine and @jahilton documented by @brianraymor :

For the backfill, modeling as"unknown", "na", or a JSON Object encoded as a string a'la:

import json

sample = '{"HANCESTRO:1": .75, "HANCESTRO:2": .25}'

// dict: {'HANCESTRO:1': .75, 'HANCESTRO:2': .25}
ancestry = json.loads(sample)

// Do the percentages add up to 1?
sum(ancestry.values())

Labels. The set of continental keys and their mappings to HANCESTRO terms must be finalized. Assigned to @99norbs and @chouellaine. The current set is:

african
america
central_south_asia
east asian
european
greater middle eastern
oceanian

but also see the May 8 2024 meeting notes.

Assays. The current set of assays for the backfill from the May 8 2024 meeting notes:

assay	assay_ontology_term_id
10x 3' transcription profiling	EFO:0030003
10x 5' transcription profiling	EFO:0030004
10x 3' v1	EFO:0009901
10x 3' v2	EFO:0009899
10x 3' v3	EFO:0009922
10x 5' v1	EFO:0011025
10x 5' v2	EFO:0009900
10x scATAC-seq	EFO:0030007
Drop-seq	EFO:0008722
Smart-seq2	EFO:0008931
~Fluidigm C1-based library preparation~	~EFO:0010058~
STRT-seq	EFO:0008953
Visium Spatial Gene Expression	EFO:0010961
BD Rhapsody Targeted mRNA	EFO:0700004
BD Rhapsody Whole Transcriptome Analysis	EFO:0700003
DroNc-seq	EFO:0008720
CEL-seq2	EFO:0010010
Seq-Well	EFO:0008919
inDrop	EFO:0008780

Values. All ancestry observations for a donor_id MUST be the same value. When ancestry values are calculated, the values are floats that sum to 1.00. For assays not included in the set above, the ancestry value must be "unknown". When ancestry values are unavailable, then "unknown" must be set.

Reference. There needs to be documentation for how values are calculated. Assigned to @chouellaine.

Until Add parent classes for ethnicity and ancestry terms is either mitigated or addressed, manual review of HANCESTRO terms is required to determine which are appropriate for ancestry rather than ethnicity.

genetic_ancestry_ontology_term_id

Key	genetic_ancestry_ontology_term_id
Annotator	Curator MUST annotate.
Value	categorical with `str` categories. If `organism_ontolology_term_id` is `"NCBITaxon:9606"` for Homo sapiens, the value MUST be ... or `"unknown"` if unavailable. The following terms MUST NOT be used: note: will use this template when terms are identified `"HANCESTRO:0002"` for regions and its children Otherwise, for all other organisms the `str` value MUST be `"na"`.

genetic_ancestry

Key	genetic_ancestry
Annotator	CELLxGENE Discover MUST annotate.
Value	categorical with `str` categories. This MUST be `"na"` if the value of `genetic_ancestry_ontology_term_id` is `"na"`. This MUST be `"unknown"` if the value of `genetic_ancestry_ontology_term_id` is `"unknown"`. Otherwise, this MUST be ....

Context

Genetic ancestry metadata are are being generated and are starting to be included with data submitted to CELLxGENE. A metadata field is needed to be able to store these data and allow users to use, filter, and visualize genetic ancestry data across all CELLxGENE tools (Discover, Explorer, Gene Expression, and Census).

See and mirror parts of issue Update self_reported_ethnicity#460

Related conversations (added by @brianraymor)

AnnData Modeling @bkmartinjr's feedback from #single-cell-ancestry-inference

chanzuckerberg / single-cell-curation

Add genetic_ancestry #689

Design (@brianraymor)

genetic_ancestry_ontology_term_id

genetic_ancestry

Context

Related conversations (added by @brianraymor)