clingen-data-model / clingen-interpretation

Allele (variant) interpretation model and API for ClinGen
3 stars 1 forks source link

Implement terms for Population Group value set in NCIt #236

Closed mbrush closed 3 years ago

mbrush commented 6 years ago

As part of our efforts to evolve SEPIO to support ClinGen variant interpretation data, we need classes for 'Population' types described in the gnomAD/Broad, IGSR, and ESP exome sequencing databases.

We would like to use the NCIt as the home for these classes, as the 'Population Group' hierarchy here seems broad enough to cover all population types defined in these systems based on race, ethnicity, and geography. Many terms will map to existing classes, but for many new terms would have to be created in the NCIt.

Lyuba Remennik from the NCIt's EVS has indicated that they are interested in expanding the representation of population groups in NCIt to support epidemiologic research as well as the looming Data Commons. So we can work with the NCIt and other efforts needing Population/Group value sets in this work.

  1. African/African American
  2. Latino
  3. Ashkenazi Jewish
  4. East Asian
  5. Finnish
  6. Non-Finnish European
  7. South Asian
  8. European-American
  9. African-American
  10. African Caribbeans in Barbados
  11. African
  12. Ad Mixed American
  13. Americans of African Ancestry in SW USA
  14. Bengali from Bangladesh
  15. Chinese Dai in Xishuangbanna, China
  16. Utah Residents (CEPH) with Northern and Western European Ancestry
  17. Han Chinese in Bejing, China
  18. Southern Han Chinese
  19. Colombians from Medellin, Colombia
  20. East Asian
  21. Esan in Nigeria
  22. European
  23. Finnish in Finland
  24. British in England and Scotland
  25. Gujarati Indian from Houston, Texas
  26. Gambian in Western Divisions in the Gambia
  27. Iberian Population in Spain
  28. Indian Telugu from the UK
  29. Japanese in Tokyo, Japan
  30. Kinh in Ho Chi Minh City, Vietnam
  31. Luhya in Webuye, Kenya
  32. Mende in Sierra Leone
  33. Mexican Ancestry from Los Angeles USA
  34. Peruvians from Lima, Peru
  35. Punjabi from Lahore, Pakistan
  36. Puerto Ricans from Puerto Rico
  37. South Asian
  38. Sri Lankan Tamil from the UK
  39. Toscani in Italia
  40. Yoruba in Ibadan, Nigeria
  41. Combined Population
  42. Overall Population
  43. Other
mbrush commented 6 years ago

Next Steps:

  1. Validate ClinGen model use case: confirm with ClinGen that they still need all terms in their Population value set.
  2. Identify other partners/use cases: Reach out to other efforts (in ClinGen and beyond) with similar use cases for Population/Group value sets, and collect requirements/terms from them
  3. Evaluate NCIt: review current Population Group hierarchy and determine how will it supports these use cases, and what changes would be needed.
  4. Improve NCIt: Work with NCIt to extend/refine population group representation.
larrybabb commented 6 years ago

Matt - The reason why ClinGen needs these population types is to allow them to support the ACMG guidelines when applying specific population allele frequency data as evidence for the criterion that requires pop allele frequency data to be met. The ACMG guidelines specifically references ExAC, 1000Genomes and ESP. If we do not have codes for the potentially referenced pop types then we would not be able to represent these 3 repos of pop allele frequency in our variant assessment. So, 1. is not just for ClinGen it is for any group that wants to apply the ACMG guidelines and do so while being able to represent the data in these 3 repos. Is that reasonable enough to cover #1 and #2 above? Or would you need to hear this from an ACMG representative. I have heard that ESP data is not necessarily as useful. I think this is since the emergence of gnomad and other sources of pop allele freq data. But, there may still be those that want to follow the ACMG recommendation by the book, so even if this is so, it makes sense to provide supporting codes for the population types that can be applied computationally for this data.

mbrush commented 6 years ago

Thanks Larry - was mainly recording this ticket for the benefit of Alice Popejoy and Lyuba Remennik as context for my upcoming calls with them., where we will work to get these terms into the NCIt. In the meantime I will put placeholders in SEPIO, and keep you posted when/if we get NCIt replacements.