CompoundHetVIP is designed to be used with gVCF or VCF files. Please see CompoundHetVIP_example.pdf for example code and a description of each step of the pipeline. Here is a brief overview of each step:
Keep variant-only sites of VCF or gVCF files
Combine each trio into a single file
Liftover trio files and individual files from GRCh38 to GRCh37
Remove unplaced sites, multiallelic sites, and duplicate sites from lifted files
Separate VCF file into chromosome files, then generate plink files for each chromosome file
Phase each of the trios with a haplotype reference panel using SHAPEIT2, Beagle, or Eagle2
Revert REF/ALT to be congruent with reference panel
Concat and merge phased trio chromosome files into one VCF file
Trim and normalize VCF file
Annotate with snpEff
Load VCF as GEMINI database
Query for CH variants (also supports de novo and homozygous alternate variant identification)
Add Gene Damage Index Scores and Gene lengths to files
The Docker image, compound-het-vip, contains all the tools needed to identify compound heterozygous variants using VCF or gVCF files. Tools available and used in the container include: Plink2 (1, 2), Picard (3), GATK4 (4), SAMtools (5), BCFtools (5), SHAPEIT2 (6), Beagle (7), Eagle2 (8), vt (9), SnpEff (10), GEMINI (11), Gene Damage Index (12), and any necessary dependencies.
A compound heterozygous variant occurs when a person inherits a variant from one parent within a specific gene and also inherits another variant from the other parent at a different position within the same gene (13). The effect of compound heterozygotic inheritance results in two recessive alleles that may cause disease. To detect these types of variants it is necessary to differentiate between paternally and maternally derived nucleotides. If sequencing has already taken place, computational algorithms can be used to help determine which nucleotides were inherited from each parent through a process termed “phasing” (14).
Phasing requires specific file types which may vary depending on phasing software. Many phasing programs require that input files have been aligned to a specific reference genome, do not contain multiallelic positions, are free of repeat positions, and that each chromosome is phased separately. Figuring out how to prepare files for phasing can be challenging as passing files from program to program may invoke unforeseen incompatibilities. Also, installing specific programs can be challenging because many programs require various dependencies.
We have designed our Compound Heterozygous Variant Identification Pipeline (CompoundHetVIP) to overcome many of the time-consuming challenges that researchers may face when trying to identify compound heterozygous variants. By encapsulating existing tools in reproducible scripts and executing these scripts within a single Docker image, other researchers will be able to examine our methodology in detail and apply it to their data. Using a Docker image helps control what software versions are used, what system libraries are used, and creates a cohesive computational environment.
If you encounter an issue, please add to the issue page