atgu / hgdp_tgp

MIT License
32 stars 5 forks source link

How to Download Phased Data #4

Closed kulmsc closed 1 year ago

kulmsc commented 1 year ago

Thank you for creating an excellent reference dataset and releasing clear tutorials. While the tutorials are great to follow, I think many individuals will simply want the actual data downloaded onto their system. As far as I can tell the tutorials do not clarify how one can download this reference dataset. After struggling to set up a google cloud cluster and use hail I realized (thanks to a question on Twitter) the phased data can be downloaded with the following command:

gsutil cp gs://gcp-public-data--gnomad/resources/hgdp_1kg/phased_haplotypes/hgdp.tgp.gwaspy.merged.chr[1-22].merged.bcf .

Adding this information to the tutorials could greatly speed the development of future users. (Also it could be good to note that the data used in the tutorial is not phased).

janxkoci commented 1 year ago

Thanks so much for this! I have access to several HPC clusters and zero reasons to pay for some google cloud. One mamba install gsutil later I am happily downloading the data to one of my clusters :relaxed:

I just had to tweak your command a bit, as it only downloaded chromosomes 1 & 2, so I used bash brace expansion instead:

gsutil cp gs://gcp-public-data--gnomad/resources/hgdp_1kg/phased_haplotypes/hgdp.tgp.gwaspy.merged.chr{1..22}.merged.bcf .

Edit: after a few chromosomes the tool suggested to use gsutil -m cp for parallel downloads, very cool!

Edit 2: I see the genotypes in the BCF files as phased (I checked a few sites at the top, for two chromosomes).

z-koenig commented 1 year ago

Hello all and thank you for your questions/suggestions! As you found, the phased haplotypes are currently publicly available here: gs://gcp-public-data--gnomad/resources/hgdp_1kg/phased_haplotypes and can be downloaded using:

gsutil -m cp gs://gcp-public-data--gnomad/resources/hgdp_1kg/phased_haplotypes/hgdp.tgp.gwaspy.merged.chr{1..22}.merged.bcf .

Additional data and information on our resource can be found in this gnomAD release.

Our manuscript is also being updated with a data availability section and we will be adding a data downloads section to the readme for the tutorials as well to show what data is available and where/how to download. Currently only the phased haplotypes are available, but the rest of the datasets used in the tutorials will be available soon. Our tutorials are not at their final version yet, so I appreciate your patience as we are sorting out these last threads.

As @kulmsc had suggested, we will also add to the readme that the tutorials do not use the phased haplotypes.

Our intention with these tutorials is to increase accessibility of the dataset by demonstrating common QC and analyses, so I hope adding in the data downloads section will help make that component more accessible as well.

We appreciate your feedback and let us know if you have any additional questions!

janxkoci commented 1 year ago

Very cool @z-koenig, many thanks for this work :relaxed:

BTW, any reason you haven't used the high-coverage resequencing of phase 3 data from 1000gp? It's been published last year. The data even uses hg38 so you don't need liftover anymore :wink:

z-koenig commented 1 year ago

Thank you for your interest!

As can be seen in the manuscript, we do use the high-coverage NYGC version of the 1kGP dataset!

Our resource is made up of versions of both HGDP (Bergstrom et al.) and 1kGP (Byrska-Bishop et al.) that were recently resequenced to high coverage.

More information on the datasets we used can be found in our preprint on bioRxiv.

janxkoci commented 1 year ago

Huh, interesting - I read the preprint methods again and noticed you do mention the resequenced data there (you call them "NYGC"). But you still also say you used the original 1000gp data with only 4-8X coverage, so probably I just got it wrong / missed the NYGC data.

So, great, very nice job :relaxed:

z-koenig commented 1 year ago

No worries! The NYGC is the group of researchers who did the resequencing of the dataset, hence why we use that name to refer to it. ☺️ In regards to us saying we used the original phase 3 1kGP dataset, is that in the comparison section of the preprint? Our dataset only contains the newly resequenced versions of both datasets, but we did do comparisons between our harmonized resource and phase 3 1kGP.

janxkoci commented 1 year ago

Yes, I've read it in that section.