Genetic Diversity TaskForce - Tier 2

arschat commented 6 months ago

original metadata document: document metadata spreadsheet: spreadsheet correspondence document: document

arschat commented 6 months ago

We received the following email, which we should review and provide feedback.

As you may know, the HCA Equity Working Group has a task force called the HCA Diversity Task Force, which has been tasked to develop the following recommendations:

appropriate metrics and resolutions for tracking genetic and geographical diversity in HCA;

HCA goals for genetic and geographical diversity, as well as a timeline for achieving the goals;

ethically- and scientifically-appropriate processes for engaging underrepresented populations in sampling and monitoring efforts; and

the use of diversity-related metadata across all HCA data platforms, including appropriate metadata fields, definitions, and/or ontologies.

We want to share the metadata recommendations from the task force with you and welcome any feedback you may have. It would be helpful to receive input by Friday, March 22.

converted to excel

arschat commented 6 months ago

Spreadsheet format here

arschat commented 6 months ago

Comments:

sex_genetic & ancestry_genetic require processing of fastq files. Are there legal issues from our site on that? We may run it automatically in ingest, or manually.
age do we allow decimals?
all ethnicity_* and geography_*_country_state fields, can be ontologised?
geography_*_duration specify years in the guidance.
smoking_tobacco_cigarette HLCA v2 proposed also smoking_pack_years. Would that be of interest here?
alcohol_consumption do we care for more options instead of yes/ no?
ethnicity_parents_selfreported_freetext & ethnicity_grandparents_selfreported_freetext, how we fill if parents & grandparents have different ethnicity? should we allow multiple answers separated by ';' or '|'?
medical_history_* would we like to record specifically the diagnosis, i.e. MONDO disease ontology?

tburdett commented 6 months ago

Hi, a while back I did some analysis for a DCP roadmap discussion. The slides are here: https://docs.google.com/presentation/d/1_qrZ1Rnax5FgymtOu1wL9frdW40GMG7WLTNUNgC0bbs/edit#slide=id.g21f702e4abe_21_0

I can also reference a paper I was co-author on: https://doi.org/10.1186/s13059-018-1396-2

Happy to discuss this further if it's useful.

arschat commented 6 months ago

There is an ontology for countries in NCIT:C25464.

arschat commented 6 months ago

Action points for modelling & feedback:

[ ] Reach out to legal part to investigate if fastq processing is safe
[X] Ontologies for countries: NCIT:C25464.
[ ] Ontologies for cities & towns
[ ] Ontologies for languages: (they suggest https://www.ethnologue.com/ but there is no ontology but just a list)
[x] Modeling for "Unknown" & "Not available" options in enum or ontologies
[ ] Provide feedback by Monday, following up the email Gabby sends today

arschat commented 6 months ago

Comments for the fields here

idazucchi commented 6 months ago

comments sent

arschat commented 2 months ago

We received an updated version of their original recommendations, that includes some of our suggestions. document flatten spreadsheet

After a meeting between me, Ida & Gabby, we concluded on the following feedback:

pipeline fields: no capacity from wranglers. who is expected to run these?
not required geography fields & parents' ethnicity: do not include since it would be really unlikely to have this info and it would overwhelm contributors
diet_meat_consumption: here is the gut option for this (enum) would you like to
restrict NA options to one option (i.e. "unknown") or 2 options (i.e. "unknown", "not_sharable") compared to 3 options ("unknown", "not_collected", "not_available")

We also decided that we are not going to change the requirement in HCA schema for those fields (like age), but just add the option(s) for unknown/not available/not collected. We expect to wrangle projects that do not follow the Tier 1/ Tier 2/ Genetic Diversity schema.

arschat commented 1 month ago

We replied with the following email along with the hlca template. Lucia replied that she are in favour of the two tab template, and initially suggested to specify recommended fields but in another meeting we agreed at least for lung to omit the mandate of fields and leave everything as optionally filled.

Waiting for their reply.

arschat commented 1 month ago

For now we remove pipeline fields (sex_genetic, ancestry_genetic, ancestry_genetic_pipeline) from our templates.

arschat commented 1 month ago

response from Kock Kian Hong

> Thank you all for your efforts and work on the metadata. The HCA Diversity Task Force co-chairs are meeting later this week to discuss the feedback; I wanted to follow-up with a few initial comments: > 1. Echoing Shyam and Lucia in agreeing with how the high(er) priority human diversity metadata is incorporated into the first tab of the spreadsheet (hlca_Tier2_GDT_flat_template.xlsx). Along the lines of Lucia's comment on priority nomenclature - 'mandatory' / 'optional' - would it be helpful to name the spreadsheet tabs by priority? From the feedback we've gathered on metadata priority nomenclature, "prioritised" and "recommended" might work. > 1. We had some initial feedback from the HCA wrangling team on "GEOGRAPHY COLLECTION SITE LATITUDE LONGITUDE" being less feasible to include / collect - would it still be challenging to include metadata for the GPS coordinates of collection site in the first tab? > 1. A note on the current fields for ethnicity in the first tab - there are two (perhaps three?) separate fields: "ETHNICITY" (ethnicity_1 & ethnicity_2) and "ETHNICITY FREE TEXT" (ethnicity_free_text). The description for the "ethnicity_1", "ethnicity_2" fields brings in "Ancestry" and also tries to classify ethnicities at a "continental" / regional-level ("African Ancestry", "American Ancestry", "Central Asia and Siberia", etc.). During our HCA Diversity Task Force discussions, we debated such a formulation ("continental" reporting on ethnicity), and the feedback we got was that it was probably better to leave this as a free-text field, with terms curated by data contributors - in part due to ethnicity encompassing factors beyond geography and genetic ancestry. The examples in the spreadsheet also appear to draw an equivalence between ethnicity and ancestry. > 1. On "pipelines for genetic ancestry/sex" - I think fine to leave out from the spreadsheet. > 1. On "Parent & grandparent fields", leaving them in the secondary tab would work for us (they are placed under a lower priority tier in our current human diversity metadata specifications document due to concerns regarding metadata collection feasibility). > 1. On "unknown" - I think the single "unknown" option is fine for now, though looking at the first tab, it appears that other bionetworks have related thoughts on granularity of the "unknown" response. For example, under "SAMPLE COLLECTION TIME POINT" for the first tab of the lung bionetwork spreadsheet, "not_shareable" is listed as a possible response. This is similar to our rationale for "not_available" (information collected, but cannot be shared via the HCA managed access framework - leaving the option for researchers to contact data contributors separately if interested). > > We will get back to you on other points as soon as possible.

There are some actions and points need to be addressed:

[x] move collection site to first tab
[x] remove pipeline fields from spreadsheet
[x] leave parents/ grandparents fields in 2nd tab
[ ] decide which option will be used across all fields for not_available/not_sharable/unknown
"prioritised" and "recommended": We would be happy to add the "Tier 2 - prioritised" and "Genetic Diversity - recommended". However, most of the bionetworks have decided to have all or most of Tier 2 fields optional/ recommended. Include this information in description?
about the Lung ethnicity fields:
- Lung had ethnicity_1, ethnicity_2 and ethnicity_free_text originally. They would like to have enum of specific options to allow calculation of predicted values for pulmonary function parameters
- we suggested having array field for ethnicity instead of ethnicity_1 and enthnicity_2, Malte agreed
- GDT requested ethnicity_selfreported_freetext and ethnicity_question_text
- I asked Malte if they are ok with merging their ethnicity_free_text with GDT ethnicity_selfreported_freetext and agreed
- Now, GDT asks
  - why we ask the same question twice
    - What is the ethnicity of the donor?
      1. select from enum
      2. free_text but not as a supplement to enum's answer
  - why the enum options of ethnicity mention Ancenstry (i.e. African Ancestry) although ethnicity is not ancestry.
- [ ] we need to ask Lung to discuss /w GDT directly which fields to be used and what options to be included

ebi-ait / hca-ebi-wrangler-central

Genetic Diversity TaskForce - Tier 2 #1245

Comments: