broadinstitute / gnomad_local_ancestry

Hail batch pipeline and scripts for local ancestry inference
MIT License
3 stars 0 forks source link

Convert simulated admixed data from hap files to VCF #78

Closed mike-w-wilson closed 3 years ago

mike-w-wilson commented 3 years ago

Marcos and Jessica use the shape-it default output, haps/sample files, the batch lai pipeline currently only accepts VCFs. Use shapeit to convert to a VCF

mike-w-wilson commented 3 years ago

The haps/sample data provided by Marcos and Jessica is located here: gs://gnomad-batch/mwilson/simulated_data/

Jessica provided this description of the files Files named "Mix" refer to 33% EUR/33%AFR/33% NAT simulations, and files named "Brasa" refer to 60%EUR/ 25%AFR/ 15%NAT ones. Both were simulated considering one pulse of admixture 17 generations ago. We have simulated only chr1 for now, but can simulate and send files for the other chromosomes if you need.

Here goes a brief description on the files:

Files for simulating admixed individuals. .phgeno contains the genetic information; .sample are the 1kg (AFR and EUR) and HGDP (NAT) IDs of the reference samples for simulations; .dat contains proportions for simulating each admixture pattern. AFR1.phgeno EUR1.phgeno NAT1.phgeno hgdp_EUR_chr1_AdmixSimu.sample hgdp_AFR_chr1_AdmixSimu.sample hgdp_NAT_chr1_AdmixSimu.sample Mix.dat BRASA.dat

Simulated haplotypes (admix-simu output): Brasa1.hanc2
Brasa.chr1.haps Brasa1.bp Brasa1.log Brasa.chr1.sample
Brasa1.hanc
Brasa1.phgeno

Mix1.hanc2 Mix.chr1.haps Mix1.hanc Mix1.phgeno Mix1.bp Mix1.log Mix.chr1.sample

The Brasa.chr1 and Mix.chr1 haps/sample files were converted to the VCF format using shapeit2 on the gnomad_lai virtual machine and copied to the same bucket.