dongwonlee-lab / tsim

0 stars 0 forks source link

TSIM: Two-Stage Imputation Method

NOTE (22 May 2024): When using TOPMed Imputation Server v.2.0.0 with reference panel version r3, an imputation quality of 0.98 should be used for identifying high-quality SNPs.

NOTE (24 April 2024): The TOPMed Imputation Server v1.6.6 (Minimac v4-1.0.2 for imputation, Eagle v2.4 for phasing, r2 for reference panel) was used for testing TSIM and generating all results in the paper (see below). Since completing our analyses, the TOPMed Imputation Server has been updated to v2.0.0 (Minimac v4.1.6 for imputation, Eagle v2.4 for phasing, r3 for reference panel). We've found that the r2 filter for high-quality SNPs needs to be recalibrated for this updated reference panel and are currently working on that analysis.



Create conda virtual environment

conda env create -f environment.yml
conda activate tsim

If you get an error building wheel for cyvcf2, you can install the package manually with pip after activating the conda environment.

conda activate tsim
pip install cyvcf2


tsim has 4 subcommands. You can check the options with the -h flag of -h rsq -h #recalculates Rsq based on selected samples qc -h #apply Rsq, ER2, MAF, and HWE filters to imputed variants overlap -h #find intersection of 2 variant lists merge -h #merge 2 VCFs based on variant list

NOTE: tsim was developed using output from the TOPMed Imputation Server v1.6.6 (Minimac4 for imputation, Eagle v2.4 for phasing, r2 for reference panel). We are aware that there were some recent changes in output format and are working on updating these scripts accordingly.


Before running, QC and impute your cohorts separately.

1. (optional) Recalculate Rsq based on a subset of samples

Rsq (or R2) does not generally need to be recalculated. However, it is a sample-based calculation. So, if you are working with a subset of samples included in the imputation results, recalculating rsq will provide more accurate measurements for determining high-quality SNPs.

python rsq -v a.vcf.gz -o a.recalc_rsq.tsv -s a.samples.txt
python rsq -v b.vcf.gz -o b.recalc_rsq.tsv -s b.samples.txt

2. QC imputed variants

Recommended HWE command:
plink --vcf <vcf> --allow-no-sex --hardy --mpheno 4 --out <output> --pheno <fam_file> --update-sex <fam_file> 3

Default options:

If working with a control-only cohort and you want to filter HWE, use flag --nocases.

python qc -r a.recalc_rsq.tsv -m a.recalc_rsq.tsv -o a.variant_qc.txt --chrom 22 --hwe a.hardy.hwe
python qc -r b.recalc_rsq.tsv -m b.recalc_rsq.tsv -o b.variant_qc.txt -c 22 --hwe b.hardy.hwe

3. Find overlapping high-quality variants

This command assumes that variants have consistent naming scheme across all cohorts.

### to create input file
ls *.variant_qc.txt > l.filelist.txt
python overlap -l l.filelist.txt -o l.overlap.txt -c 22

4. Merge VCFs

### to create input file
echo "a.vcf.gz,l.overlap.txt,a.samples.txt" > l.mergelist.txt
echo "b.vcf.gz,l.overlap.txt,b.samples.txt" >> l.mergelist.txt
python merge -l l.mergelist.txt -o merged.vcf.gz -c 22 --snpsonly
  1. Impute the merged VCFs.

The rsq and qc functions may also be used after the second stage of imputation.

Please cite paper below

Anya Greenberg, Kaylia Reynolds, Michelle T McNulty, Matthew G. Sampson, Hyun Min Kang, Dongwon Lee. "Accurate cross-platform GWAS analysis via two-stage imputation."