ebi-ait / hca-ebi-wrangler-central

This repo is for tracking work related to wrangling datasets for the HCA, associated tasks and for maintaining related documentation.
https://ebi-ait.github.io/hca-ebi-wrangler-central/
Apache License 2.0
7 stars 2 forks source link

Genetic Diversity TaskForce - Tier 2 #1245

Open arschat opened 6 months ago

arschat commented 6 months ago

original metadata document: document metadata spreadsheet: spreadsheet correspondence document: document

arschat commented 6 months ago

We received the following email, which we should review and provide feedback.

As you may know, the HCA Equity Working Group has a task force called the HCA Diversity Task Force, which has been tasked to develop the following recommendations:

  1. appropriate metrics and resolutions for tracking genetic and geographical diversity in HCA;
  2. HCA goals for genetic and geographical diversity, as well as a timeline for achieving the goals;
  3. ethically- and scientifically-appropriate processes for engaging underrepresented populations in sampling and monitoring efforts; and
  4. the use of diversity-related metadata across all HCA data platforms, including appropriate metadata fields, definitions, and/or ontologies.

We want to share the metadata recommendations from the task force with you and welcome any feedback you may have. It would be helpful to receive input by Friday, March 22.

converted to excel

arschat commented 6 months ago

Spreadsheet format here

arschat commented 6 months ago

Comments:

tburdett commented 6 months ago

Hi, a while back I did some analysis for a DCP roadmap discussion. The slides are here: https://docs.google.com/presentation/d/1_qrZ1Rnax5FgymtOu1wL9frdW40GMG7WLTNUNgC0bbs/edit#slide=id.g21f702e4abe_21_0

I can also reference a paper I was co-author on: https://doi.org/10.1186/s13059-018-1396-2

Happy to discuss this further if it's useful.

arschat commented 6 months ago

There is an ontology for countries in NCIT:C25464.

arschat commented 6 months ago

Action points for modelling & feedback:

arschat commented 6 months ago

Comments for the fields here

idazucchi commented 6 months ago

comments sent

arschat commented 2 months ago

We received an updated version of their original recommendations, that includes some of our suggestions. document flatten spreadsheet

After a meeting between me, Ida & Gabby, we concluded on the following feedback:

We also decided that we are not going to change the requirement in HCA schema for those fields (like age), but just add the option(s) for unknown/not available/not collected. We expect to wrangle projects that do not follow the Tier 1/ Tier 2/ Genetic Diversity schema.

arschat commented 1 month ago

We replied with the following email along with the hlca template. Lucia replied that she are in favour of the two tab template, and initially suggested to specify recommended fields but in another meeting we agreed at least for lung to omit the mandate of fields and leave everything as optionally filled.

Waiting for their reply.

arschat commented 1 month ago

For now we remove pipeline fields (sex_genetic, ancestry_genetic, ancestry_genetic_pipeline) from our templates.

arschat commented 1 month ago
response from Kock Kian Hong > Thank you all for your efforts and work on the metadata. The HCA Diversity Task Force co-chairs are meeting later this week to discuss the feedback; I wanted to follow-up with a few initial comments: > 1. Echoing Shyam and Lucia in agreeing with how the high(er) priority human diversity metadata is incorporated into the first tab of the spreadsheet (hlca_Tier2_GDT_flat_template.xlsx). Along the lines of Lucia's comment on priority nomenclature - 'mandatory' / 'optional' - would it be helpful to name the spreadsheet tabs by priority? From the feedback we've gathered on metadata priority nomenclature, "prioritised" and "recommended" might work. > 1. We had some initial feedback from the HCA wrangling team on "GEOGRAPHY COLLECTION SITE LATITUDE LONGITUDE" being less feasible to include / collect - would it still be challenging to include metadata for the GPS coordinates of collection site in the first tab? > 1. A note on the current fields for ethnicity in the first tab - there are two (perhaps three?) separate fields: "ETHNICITY" (ethnicity_1 & ethnicity_2) and "ETHNICITY FREE TEXT" (ethnicity_free_text). The description for the "ethnicity_1", "ethnicity_2" fields brings in "Ancestry" and also tries to classify ethnicities at a "continental" / regional-level ("African Ancestry", "American Ancestry", "Central Asia and Siberia", etc.). During our HCA Diversity Task Force discussions, we debated such a formulation ("continental" reporting on ethnicity), and the feedback we got was that it was probably better to leave this as a free-text field, with terms curated by data contributors - in part due to ethnicity encompassing factors beyond geography and genetic ancestry. The examples in the spreadsheet also appear to draw an equivalence between ethnicity and ancestry. > 1. On "pipelines for genetic ancestry/sex" - I think fine to leave out from the spreadsheet. > 1. On "Parent & grandparent fields", leaving them in the secondary tab would work for us (they are placed under a lower priority tier in our current human diversity metadata specifications document due to concerns regarding metadata collection feasibility). > 1. On "unknown" - I think the single "unknown" option is fine for now, though looking at the first tab, it appears that other bionetworks have related thoughts on granularity of the "unknown" response. For example, under "SAMPLE COLLECTION TIME POINT" for the first tab of the lung bionetwork spreadsheet, "not_shareable" is listed as a possible response. This is similar to our rationale for "not_available" (information collected, but cannot be shared via the HCA managed access framework - leaving the option for researchers to contact data contributors separately if interested). > > We will get back to you on other points as soon as possible.


There are some actions and points need to be addressed: