Judong Shen & Andrew Slater
This workflow implements HIBAG1 to impute 4-digit classical HLA alleles from SNV genotypes in the xMHC region and convert the probabilities to a binary-expanded set of doses in minimac format. HIBAG was developed in a collaboration between GSK and the University of Washington which maintains the R package and hosts a series of pre-fit classification models on their website.
Currently, the pre-fit models are all trained from a single reference dataset of individuals with both classical HLA genotypes (determined by direct assaying) and SNV genotypes from arrays of the Illumina 1M class (unclear which specific version(s)). To train each model, the SNV genotypes in the reference dataset were subset to the variants on the array of interest that are polymorphic in the ancestry group of interest. For example, the Asian model for the Affymetrix Genome-Wide Human SNP Array 5.0 was trained using the Asian individuals from the reference dataset, removing monomorphic SNVs in this subset of individuals and removing SNVs not assayed by the Affymetrix Genome-Wide Human SNP Array 5.0.
1) Zheng X, Shen J, Cox C, Wakefield J, Ehm M, Nelson M, Weir BS. HIBAG – HLA Genotype Imputation with Attribute Bagging. Pharmacogenomics Journal (2013). doi: 10.1038/tpj.2013.18.
This workflow consists of a csh driver script which calls R scripts to perform the following steps:
An ancestry map file is included
If ethnicity is Hispanic, ancestry is Hispanic. Otherwise, use this table to map race to ancestry:
Race | Ancestry |
---|---|
African American/African Heritage | African |
American Indian or Alaskan Native | Broad |
Asian - Central/South Asian Heritage | Broad |
Asian - East Asian Heritage | Asian |
Asian - Japanese Heritage | Asian |
Asian - South East Asian Heritage | Asian |
Native Hawaiian or Other Pacific Islander | Broad |
White - Arabic/North African Heritage | European |
White - White/Caucasian/European Heritage | European |
Call the driver script. Below is an example command if your plink dataset is named PGxNNN.bed, your ancestry map file is named ancestry.txt and both are in the current directory. nohup is recommended as it will take several hours to run (~3 hours for 150 subjects). In this example, stdout and stderr are re-directed to files in the current directory (re-running will overwrite these files).
nohup /GWD/appbase/projects/statgen/GXapp/HIBAGImputation/RUN_HIBAG_HLA_IMPUTATION.sh PGxNNN ancestry.txt >myrun.out 2>myrun.err
Alternatively, to submit to SGE with e-mail notification
qsub -N PGxNNN -q dl580 -b y -l mt=5G -m e -cwd \
-e myrun.err \
-o myrun.out \
/GWD/appbase/projects/statgen/GXapp/HIBAGImputation/RUN_HIBAG_HLA_IMPUTATION.sh \
PGxNNN ancestry.txt
The workflow proceeds sequentially (cost/benefit of adding parallel computing support is unclear) as follows:
For each ancestry group present in the data per the ancestry map file, iterate over the relevant pre-fit models in /GWD/appbase/projects/RD-MDD-GX_PUBLIC/HIBAG_Classifiers (downloaded July 2, 2015) where the name of the ancestry group is present in the file name. For each model and locus:
Determine the coordinate and alleles of the SNVs and their contribution to the model.
Remove any ambiguous 'A/T' and 'C/G' SNVs.
Remove any SNVs whose alleles do not match those observed in the data for the same coordinate.
Using SNVs remaining, summarize overlap with data and append to Results_CheckSNPOverlap/comparison.txt:
Metric | Description |
---|---|
num.classifier | Number of classifiers in model |
num.model.snp | Number of SNVs in model |
mean.snp.in.classif | Average number of SNVs in a classifier |
mean.haplo.in.classif | Average number of haplotypes in a classifier |
mean.accuracy | Average accuracy of a classifier |
num.model.in.data | Number of SNVs in model and data (excludes ambiguous & mis-matching alleles) |
pct.model.in.data | num.model.in.data / num.model.snp |
sum.miss.pctl | The number of classifiers each SNV contributes to is converted to percentile as a measure of importance (higher = more important). This is the sum of the percentiles of SNVs not counted in num.model.in.data. |
Append list of names of overlapping SNVs in data to Results_CheckSNPOverlap/[Model File Name].extract.IDs.txt.