avallonking / ForestQC

Quality control on genetic variants from next-generation sequencing data using random forest
MIT License
21 stars 7 forks source link

ForestQC

ForestQC uses a random forest classifier to identify high-quality and low-quality variants from genetic variants detected from next-generation sequencing data. Citation: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007556

Installation

To install the software. This software is compatible with OSX, 64-bit Linux and 64-bit Windows systems.

$ conda install forestqc -c avallonking

To update the software.

$ conda update forestqc -c avallonking

After installation, you can check the usage

$ ForestQC -h
# and check the usage for each command, for example, stat
$ ForestQC stat -h

We provide a usage example. Note that all example files are only for testing. Do not use them to analyze real data.

$ git clone git@github.com:avallonking/ForestQC.git
$ cd ForestQC/examples/example_script_and_output/
$ cat example.script.sh
$ sh example.script.sh

Workflow

example/example_output/example.script.sh provides step-by-step tutorial about how to run this software.

Before doing the classification, we should set the thresholds for Outlier_GQ and Outlier_DP. It would print out Outlier_DP threshold and Outlier_GQ threshold on the screen, which would be used as inputs in the next step. Idealy, all vcf files in this analysis should be included, but we suggest to use Chromosome 1 only to speed up the computation if the dataset is very large. Please make sure the memory size (set with -m argument) smaller than the total size of input files and the memory limit of your system

$ ForestQC set_outlier -m 1G -i [vcf_file1],[vcf_file2],[...]

If you use other reference genome other than hg19, please use our script to generate a GC content table for your reference genome. Otherwise, we have prepared hg19 GC content table in examples/gc_content_hg19.tsv and you can directly use it.

# generate GC content table for your reference genome
$ ForestQC compute_gc -i [ref_genome.fasta] -o [output_file]

First, we need to calculate ths statistics from vcf file. It will output a file containing all statistics information for each variant. Note:

$ ForestQC stat -i [input_vcf] -o [output_filename] -c [gc_content_file] -g [gender_file(optional)] -p [ped_file (optional)] -d [discordant_genotype_file (optional)] -w [hwe_file(optional)] --gq [Outlier_GQ] --dp [Outlier_DP] -as [user_defined_statistics_file (optional)]

Second, we need to divide the dataset into high-quality, low-quality and "undertermined" variants. The output files would be three: good.xx (high-quality variants), bad.xx (low-quality variants) and grey.xx. (variants with undetermined quality) The output filename is only the suffix, that is, the latter part of the file name. All output files would be in the same folder with input files. Note that you don't have to merge the input file together. Also, the model used in classification and splitting must be the same. Or you would get ValueError

$ ForestQC split -i [input_file] -o [output_filename_suffix (optional)] -t [user_defined_threshold_file (optional)] -as [user_defined_statistics_names (if user-defined statistics added in last step, this is required)]

Third, we can train our random forest model and apply it on the classification of gray variants. The output files would be predicted_good_xx and predictred_bad_xx. The output filename is only the suffix, that is, the latter part of the file name. All output files would be in the same folder with input files. Note that you need to merge all high-quality variants into one file, all low-quality variants into one file and all "undetermined" variants into one file. Also, the model used in classification and splitting must be the same. Otherwise, you would get ValueError

$ ForestQC classify -g [good_variants] -b [bad_variants] -y [grey_variants] -o [output_filename_suffix (optional)] -t [probability_threshold (optional)] -f [selected_features_names (optional)]

Fourth, to get the final sets of good variants and bad variants, we need to combine the variants predicted by the random forest model (step 3) and the variants separated by the filters (step 2).

$ cat [good_variants(from step 2)] [predicted_good_variants(from step 3)] > [total_good_variants]
$ cat [bad_variants(from step 2)] [predicted_bad_variants(from step 3)] > [total_bad_variants]

Fifth, to get a VCF file containing the variants passing ForestQC, you can get the positions of bad variants and remove them from the original VCF file.

# if chromosome names in the original VCF file are like "chr1", "chr2", etc.
$ awk -F "\t" 'NR>1{print $2"\t"$3}' [total_bad_variants] > [bad_variants_positions]
# if chromosome names in the original VCF file are like "1", "2", etc.
$ awk -F "\t" 'NR>1{print $2"\t"$3}' [total_bad_variants] | sed 's/chr//' > [bad_variants_positions]
# remove bad variants from the original VCF file
$ vcftools --gzvcf [original_gziped_vcf_file] --exclude-positions [bad_variants_positions] --recode --recode-INFO-all -c | gzip -c > [output_vcf]

And VCFtools would need to be installed.

File format

Note that all files are tab-separated. If you came across any IndexError, please check your input files and make sure they are all tab-separated.

Output file

Input file