Note: This is a placeholder epic for Dev to add child issues following an assessment of required changes.

See #single-cell-data-wrangling for discussion with @norbid about his preferences for how multiple self reported ethnicity values may be surfaced in the CELLxGENE Discover UX.

The changes to the cellxgene-schema CLI are tracked separately in cellxgene-schema must validate self_reported_ethnicity

Design

See schema 4

self_reported_ethnicity_ontology_term_id

Key	self_reported_ethnicity_ontology_term_id
Annotator	Curator
Value	categorical with `str` categories. If `organism_ontolology_term_id` is `"NCBITaxon:9606"` for Homo sapiens, the value MUST be formatted as one or more comma-separated (with no leading or trailing spaces) HANCESTRO terms in ascending lexical order or `"unknown"` if unavailable. For example, if the terms are `"HANCESTRO:0014` and `HANCESTRO:0005"` then the value of `self_reported_ethnicity_ontology_term_id` MUST be `"HANCESTRO:0005,HANCESTRO:0014"`. The following terms MUST NOT be used: `"HANCESTRO:0002"` for regions and its children `"HANCESTRO:0003"` for country `"HANCESTRO:0004"` for ancestry category `"HANCESTRO:0018"` for uncategorised population `"HANCESTRO:0290"` for genetically isolated population `"HANCESTRO:0304"` for ancestry status and its children `"HANCESTRO:0323"` for Finnish founder `"HANCESTRO:0324"` for Dutch founder `"HANCESTRO:0551"` for genetically homogenous Irish `"HANCESTRO:0554"` for Silk Road founder `"HANCESTRO:0555"` for Arab Israeli founder `"HANCESTRO:0557"` for Costa Rican founder `"HANCESTRO:0558"` for French Canadian founder `"HANCESTRO:0559"` for Italian founder `"HANCESTRO:0560"` for Northern Finnish founder `"HANCESTRO:0561"` for Romanian founder `"HANCESTRO:0564"` for Vis founder `"HANCESTRO:0565"` for Split founder `"HANCESTRO:0566"` for undefined ancestry population The imported GEO term `"GEO:000000374"` for continent and its children: `"HANCESTRO:0029"` for Africa `"HANCESTRO:0030"` for Asia `"HANCESTRO:0031"` for Europe `"HANCESTRO:0032"` for Oceania `"HANCESTRO:0033"` for Latin America and the Caribbean `"HANCESTRO:0034"` for Northern America Otherwise, for all other organisms the `str` value MUST be `"na"`.

self_reported_ethnicity

Key	self_reported_ethnicity
Annotator	CELLxGENE Discover
Value	categorical with `str` categories. This MUST be `"na"` or `"unknown"` if set in `self_reported_ethnicity_ontology_term_id`; otherwise, this MUST be one or more comma-separated (with no leading or trailing spaces) human-readable names for the terms in `self_reported_ethnicity_ontology_term_id` in the same order. For example, if the value of `self_reported_ethnicity_ontology_term_id` is `"HANCESTRO:0005,HANCESTRO:0014"` then the value of `self_reported_ethnicity` is `"European,Hispanic or Latin American"`.

Samples

Sample normalized terms

adata.obs['self_reported_ethnicity_term_id'][0] = “HANCESTRO:0008”
adata.obs['self_reported_ethnicity_term_id'][1] = “HANCESTRO:0008,HANCESTRO:0320,HANCESTRO:0405”
adata.obs['self_reported_ethnicity_term_id'][2] = “HANCESTRO:0405”
adata.obs['self_reported_ethnicity_term_id'][3] = “HANCESTRO:0005”
adata.obs['self_reported_ethnicity_term_id'][4] = “HANCESTRO:0008”
adata.obs['self_reported_ethnicity_term_id'][5] = “HANCESTRO:0320”
adata.obs['self_reported_ethnicity_term_id'][6] = “HANCESTRO:0008,HANCESTRO:0320,HANCESTRO:0405”
adata.obs['self_reported_ethnicity_term_id'][7] = “unknown”

Sample labels for terms

adata.obs['self_reported_ethnicity'][0] = “Asian”
adata.obs['self_reported_ethnicity'][1] = “Asian,Dutch,Cuban”
adata.obs['self_reported_ethnicity'][2] = “Cuban”
adata.obs['self_reported_ethnicity'][3] = “European”
adata.obs['self_reported_ethnicity'][4] = “Asian”
adata.obs['self_reported_ethnicity'][5] = “Dutch”
adata.obs['self_reported_ethnicity'][6] = “Asian,Dutch,Cuban”
adata.obs['self_reported_ethnicity][7] = “unknown”

Data Platform

Data Portal API changes

There should be no required changes to the current implementation. Based on the dataset samples above, the response value for self_reported_ethnicity would be:

'self_reported_ethnicity': [{'label': 'Asian',
                               'ontology_term_id': 'HANCESTRO:0008'},
                              {'label': 'Asian,Dutch,Cuban',
                               'ontology_term_id': 'HANCESTRO:0008,HANCESTRO:0320,HANCESTRO:0405'},
                              {'label': 'Cuban',
                               'ontology_term_id': 'HANCESTRO:0320'},
                              {'label': 'Dutch',
                               'ontology_term_id': 'HANCESTRO:0405'},
                              {'label': 'European',
                               'ontology_term_id': 'HANCESTRO:0005'},
                              {'label': 'unknown',
                               'ontology_term_id': 'unknown'}]

Discover API changes

There should be no required changes to the current implementation. Based on the dataset samples above, the response value for self_reported_ethnicity would be:

'self_reported_ethnicity': [{'label': 'Asian',
                               'ontology_term_id': 'HANCESTRO:0008'},
                              {'label': 'Asian,Dutch,Cuban',
                               'ontology_term_id': 'HANCESTRO:0008,HANCESTRO:0320,HANCESTRO:0405'},
                              {'label': 'Cuban',
                               'ontology_term_id': 'HANCESTRO:0320'},
                              {'label': 'Dutch',
                               'ontology_term_id': 'HANCESTRO:0405'},
                              {'label': 'European',
                               'ontology_term_id': 'HANCESTRO:0005'},
                              {'label': 'unknown',
                               'ontology_term_id': 'unknown'}]

Discover UX filter changes

Refined into Self-Reported Ethnicity filter in Collections and Datasets must be updated

Data Viz

This function in the WMG processing pipeline must be updated to appropriately handle ethnicity term strings with comma-separated values.
WMG frontend filter should be updated to exclude multiethnic terms, same with the compare feature.

Census

Depending on the final modeling of the new values, it may require Census schema changes and Census Builder changes. May introduce a breaking change in the data (i.e. previous external pipelines using the Census can break)

chanzuckerberg / single-cell

Update self_reported_ethnicity #460