broadinstitute / gnomad_local_ancestry

Hail batch pipeline and scripts for local ancestry inference
MIT License
4 stars 0 forks source link

Add HGDP,TGP, and pop option to subset #135

Closed mike-w-wilson closed 10 months ago

mike-w-wilson commented 11 months ago

This updates the pre-phasing subset code to allow for genetic ancestry group selection and adding in HGDP and TGP samples. The script now uses the gnomAD v3 VDS. It also uses the v4 release for AF, info, and filter field annotations.

The purpose of the script is to subset gnomAD to an admixed group, e.g. 'amr' or 'afr', filter to high coverage (callrate >0.9), common (genetic ancestry group AF > 0.1%) bi-allelic SNPs, as phasing needs high callrate SNPs and local ancestry inference cannot be done with confidence on rare sites. The script then exports a single VCF per chromosome because it is required for the phasing tool.

I've added the option to include HGDP and TGP samples as the VCF we will deliver to Elizabeth Atkinson's team needs to have these samples so they can further subset and build a reference panel from HGDP and TGP for local ancestry inference. I've included site statistics like QD, FS, and MQ as the gnomAD production team uses these when running QC and I wanted Elizabeth's team to have the option to do further QC on the exported VCF.

To test this I ran

hailctl dataproc submit mw subset_vcf_for_phase.py --test --pop afr --contigs chr1 --hgdp --tgp --output-path gs://gnomad-tmp-4day/mwilson/lai/afr --overwrite

@KoalaQin , please let me know if you have any questions regarding the project if they would help with your review. Thank you!

KoalaQin commented 11 months ago

The main purpose of this code is to subset a certain pop, maybe plus HGDP/TGP and a list of samples (no matter which pop they belong to), but we use the AF of this pop to filter the variants, right? Do you also need to subset based on a TSV file, then do you need a AF reference?

mike-w-wilson commented 11 months ago

So pop should be required, I can update that. We dont need an AF reference for the input TSV samples. This script is very specific to the LAI pipeline and subsets all samples that will be analyzed alongside the reference samples which will be composed of a number of different genetic ancestry groups.