chanzuckerberg / single-cell-curation

Code and documentation for the curation of cellxgene datasets
MIT License
35 stars 22 forks source link

Add genetic_ancestry #689

Open 99norbs opened 7 months ago

99norbs commented 7 months ago

Design (@brianraymor)

Requirements from Feb 13 2024 call with @chouellaine and @jahilton documented by @brianraymor :

For the backfill, modeling as"unknown", "na", or a JSON Object encoded as a string a'la:

import json

sample = '{"HANCESTRO:1": .75, "HANCESTRO:2": .25}'

// dict: {'HANCESTRO:1': .75, 'HANCESTRO:2': .25}
ancestry = json.loads(sample)

// Do the percentages add up to 1?
sum(ancestry.values())

Labels. The set of continental keys and their mappings to HANCESTRO terms must be finalized. Assigned to @99norbs and @chouellaine. The current set is:

but also see the May 8 2024 meeting notes.

Assays. The current set of assays for the backfill from the May 8 2024 meeting notes:

assay assay_ontology_term_id
10x 3' transcription profiling EFO:0030003
10x 5' transcription profiling EFO:0030004
10x 3' v1 EFO:0009901
10x 3' v2 EFO:0009899
10x 3' v3 EFO:0009922
10x 5' v1 EFO:0011025
10x 5' v2 EFO:0009900
10x scATAC-seq EFO:0030007
Drop-seq EFO:0008722
Smart-seq2 EFO:0008931
~Fluidigm C1-based library preparation~ ~EFO:0010058~
STRT-seq EFO:0008953
Visium Spatial Gene Expression EFO:0010961
BD Rhapsody Targeted mRNA EFO:0700004
BD Rhapsody Whole Transcriptome Analysis EFO:0700003
DroNc-seq EFO:0008720
CEL-seq2 EFO:0010010
Seq-Well EFO:0008919
inDrop EFO:0008780

Values. All ancestry observations for a donor_id MUST be the same value. When ancestry values are calculated, the values are floats that sum to 1.00. For assays not included in the set above, the ancestry value must be "unknown". When ancestry values are unavailable, then "unknown" must be set.

Reference. There needs to be documentation for how values are calculated. Assigned to @chouellaine.


Until Add parent classes for ethnicity and ancestry terms is either mitigated or addressed, manual review of HANCESTRO terms is required to determine which are appropriate for ancestry rather than ethnicity.

genetic_ancestry_ontology_term_id

Key genetic_ancestry_ontology_term_id
Annotator Curator MUST annotate.
Value categorical with str categories. If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, the value MUST be ... or "unknown" if unavailable. The following terms MUST NOT be used:

  • note: will use this template when terms are identified
  • "HANCESTRO:0002" for regions and its children

Otherwise, for all other organisms the str value MUST be "na".



genetic_ancestry

Key genetic_ancestry
Annotator CELLxGENE Discover MUST annotate.
Value categorical with str categories. This MUST be "na" if the value of genetic_ancestry_ontology_term_id is "na". This MUST be "unknown" if the value of genetic_ancestry_ontology_term_id is "unknown". Otherwise, this MUST be ....


Context

Genetic ancestry metadata are are being generated and are starting to be included with data submitted to CELLxGENE. A metadata field is needed to be able to store these data and allow users to use, filter, and visualize genetic ancestry data across all CELLxGENE tools (Discover, Explorer, Gene Expression, and Census).

See and mirror parts of issue Update self_reported_ethnicity#460

Related conversations (added by @brianraymor)

AnnData Modeling @bkmartinjr's feedback from #single-cell-ancestry-inference

brianraymor commented 2 months ago

Consenus to reschedule for later in 2024.