brentp / somalier

fast sample-swap and relatedness checks on BAMs/CRAMs/VCFs/GVCFs... "like damn that is one smart wine guy"
MIT License
254 stars 35 forks source link

Ancestry estimate using gnomad v3 data? #67

Open theodorc opened 3 years ago

theodorc commented 3 years ago

Hi, I'm wondering if one can use gnomad instead of 1k-genomes for the ancestry estimate. somalier files for thousand genomes are restricted to only 5 superpopulations whereas gnomad has a higher diversity of ancestry in their resource. It would be great to have somalier files for gnomad (if possible).

brentp commented 3 years ago

someone from gnomad would have to do that as the genotypes are not public. the NYGC has 3202 samples, now and covers more ancestries. I'd accept contributions for either of these. Also note that users who have their own labeled samples can just use somalier ancestry as-is with there internal samples.

theodorc commented 3 years ago

The genotypes for the HGDP (n=780) is public, and according to their blog, but some work would be needed to parse them into vcf formats for somalier to ingest. I will share if I get around it. Thanks.

"...The samples included in this subset are drawn from the 1000 Genomes Project (n=2,435) and the Human Genome Diversity Project (n=780), which contain some of the most genetically diverse populations present in gnomAD. Collectively they represent human genetic diversity sampled across >60 distinct populations from Africa, Europe, the Middle East, South and Central Asia, East Asia, Oceania, and the [Americas.]..."

https://gnomad.broadinstitute.org/downloads#v3-hgdp-1kg https://gnomad.broadinstitute.org/blog/2020-10-gnomad-v3-1-new-content-methods-annotations-and-data-availability/#the-gnomad-hgdp-and-1000-genomes-callset