chanzuckerberg / single-cell

A collection of documents that reflect various design decisions that have been made for the cellxgene project.
MIT License
4 stars 2 forks source link

Update self_reported_ethnicity #460

Closed brianraymor closed 9 months ago

brianraymor commented 1 year ago

Note: This is a placeholder epic for Dev to add child issues following an assessment of required changes.

See #single-cell-data-wrangling for discussion with @norbid about his preferences for how multiple self reported ethnicity values may be surfaced in the CELLxGENE Discover UX.

The changes to the cellxgene-schema CLI are tracked separately in cellxgene-schema must validate self_reported_ethnicity

Design

See schema 4

self_reported_ethnicity_ontology_term_id

Key self_reported_ethnicity_ontology_term_id
Annotator Curator
Value categorical with str categories. If organism_ontolology_term_id is "NCBITaxon:9606" for Homo sapiens, the value MUST be formatted as one or more comma-separated (with no leading or trailing spaces) HANCESTRO terms in ascending lexical order or "unknown" if unavailable.

For example, if the terms are "HANCESTRO:0014 and HANCESTRO:0005" then the value of self_reported_ethnicity_ontology_term_id MUST be "HANCESTRO:0005,HANCESTRO:0014".

The following terms MUST NOT be used:


Otherwise, for all other organisms the str value MUST be "na".



self_reported_ethnicity

Key self_reported_ethnicity
Annotator CELLxGENE Discover
Value categorical with str categories. This MUST be "na" or "unknown" if set in self_reported_ethnicity_ontology_term_id; otherwise, this MUST be one or more comma-separated (with no leading or trailing spaces) human-readable names for the terms in self_reported_ethnicity_ontology_term_id in the same order.

For example, if the value of self_reported_ethnicity_ontology_term_id is "HANCESTRO:0005,HANCESTRO:0014" then the value of self_reported_ethnicity is "European,Hispanic or Latin American".


Samples

Sample normalized terms

adata.obs['self_reported_ethnicity_term_id'][0] = “HANCESTRO:0008”
adata.obs['self_reported_ethnicity_term_id'][1] = “HANCESTRO:0008,HANCESTRO:0320,HANCESTRO:0405”
adata.obs['self_reported_ethnicity_term_id'][2] = “HANCESTRO:0405”
adata.obs['self_reported_ethnicity_term_id'][3] = “HANCESTRO:0005”
adata.obs['self_reported_ethnicity_term_id'][4] = “HANCESTRO:0008”
adata.obs['self_reported_ethnicity_term_id'][5] = “HANCESTRO:0320”
adata.obs['self_reported_ethnicity_term_id'][6] = “HANCESTRO:0008,HANCESTRO:0320,HANCESTRO:0405”
adata.obs['self_reported_ethnicity_term_id'][7] = “unknown”

Sample labels for terms

adata.obs['self_reported_ethnicity'][0] = “Asian”
adata.obs['self_reported_ethnicity'][1] = “Asian,Dutch,Cuban”
adata.obs['self_reported_ethnicity'][2] = “Cuban”
adata.obs['self_reported_ethnicity'][3] = “European”
adata.obs['self_reported_ethnicity'][4] = “Asian”
adata.obs['self_reported_ethnicity'][5] = “Dutch”
adata.obs['self_reported_ethnicity'][6] = “Asian,Dutch,Cuban”
adata.obs['self_reported_ethnicity][7] = “unknown”

Data Platform

Data Portal API changes

There should be no required changes to the current implementation. Based on the dataset samples above, the response value for self_reported_ethnicity would be:

'self_reported_ethnicity': [{'label': 'Asian',
                               'ontology_term_id': 'HANCESTRO:0008'},
                              {'label': 'Asian,Dutch,Cuban',
                               'ontology_term_id': 'HANCESTRO:0008,HANCESTRO:0320,HANCESTRO:0405'},
                              {'label': 'Cuban',
                               'ontology_term_id': 'HANCESTRO:0320'},
                              {'label': 'Dutch',
                               'ontology_term_id': 'HANCESTRO:0405'},
                              {'label': 'European',
                               'ontology_term_id': 'HANCESTRO:0005'},
                              {'label': 'unknown',
                               'ontology_term_id': 'unknown'}]

Discover API changes

There should be no required changes to the current implementation. Based on the dataset samples above, the response value for self_reported_ethnicity would be:

'self_reported_ethnicity': [{'label': 'Asian',
                               'ontology_term_id': 'HANCESTRO:0008'},
                              {'label': 'Asian,Dutch,Cuban',
                               'ontology_term_id': 'HANCESTRO:0008,HANCESTRO:0320,HANCESTRO:0405'},
                              {'label': 'Cuban',
                               'ontology_term_id': 'HANCESTRO:0320'},
                              {'label': 'Dutch',
                               'ontology_term_id': 'HANCESTRO:0405'},
                              {'label': 'European',
                               'ontology_term_id': 'HANCESTRO:0005'},
                              {'label': 'unknown',
                               'ontology_term_id': 'unknown'}]

Discover UX filter changes

Refined into Self-Reported Ethnicity filter in Collections and Datasets must be updated

Data Viz

  1. This function in the WMG processing pipeline must be updated to appropriately handle ethnicity term strings with comma-separated values.
  2. WMG frontend filter should be updated to exclude multiethnic terms, same with the compare feature.

Census

  1. Depending on the final modeling of the new values, it may require Census schema changes and Census Builder changes. May introduce a breaking change in the data (i.e. previous external pipelines using the Census can break)
prathapsridharan commented 11 months ago

@brianraymor (cc: @atarashansky @dsadgat @joyceyan) - Will the schema CLI fail validation if it encounters values like the following. The list below is not exhaustive but giving examples of stuff that doesn't make sense:

  1. unknown,HANCESTRO:0008
  2. HANCESTRO:0008,unknown,HANCESTRO:0320
  3. unknown,na

I ask because it affects implementation in gene expression application