A tool to create a draft genome file out of a GATK VCF file
Download via GitHub Releases or via Bioconda.
Author: Alexander Herbig herbig@shh.mpg.de (v0.84), Alexander Peltzer (v0.90+).
Contact Alexander Peltzerpeltzer@shh.mpg.de for questions regarding the tool or via GitHub and/or open a ticket here.
You can see a help when running the tool with -h
. This generates the following help message:
Option "-draft" is required
-draft VAL : draft contains Ns where no call can be made. RefMod contains reference calls instead at
these positions.
-draftname DRAFT_SEQ_NAME : Name of the draft sequence.
-h : Display this help information and exit. (default: true)
-in VAL : input VCF file
-minc MIN_COVERAGE_FOR_SNP : Minimum coverage / reads confirming the call.
-minfreq MIN_SNP_FREQUENCY : Minimum fraction of reads supporting the called nucleotide.
-minq MIN_QUAL_SCORE : Minimum quality score. For UG: Phred scaled quality score. For HC genome quality score.
-ref VAL : reference genome in FastA format
-refMod VAL : More precise uncertainty encoding. N: Not covered or ambiguous. R: Low coverage but looks
like Ref. a,c,t,g (lower case): Low coverage but looks like SNP.
-uncertain VAL : Special 1234 encoded FastA output.
Example: java -jar VCF2Genome.jar -draft VAL -draftname DRAFT_SEQ_NAME -in VAL -minc MIN_COVERAGE_FOR_SNP -minfreq MIN_SNP_FREQUENCY -minq MIN_QUAL_SCORE -ref VAL -refMod VAL -uncertain VAL
java -jar VCF2Genome.jar -draft my_output_genome.fasta -draftname "My_Fancy_Genome_Name" -in my_input.vcf -minc 5 -minfreq 0.8 -minq 30 -ref myreference_genome.fasta -refMod output.refMod -uncertain 1234_output.fasta
-draft
Name of the output file to which the FastA genome sequence should be written. Contains Ns where no call can be made.
-draftname
Name of the draft sequence inside the FastA file (header of the FastA entry that is created).
-in
Name of the input VCF file in VCF4.0/4.1 format.
-minc
Minimum coverage / reads confirming the call required.
-minq
Minimum quality threshold used for filtering the calls.
-minfreq
Minimum fraction of reads supporting the called nucleotide.
-ref
Reference genome used in FastA format.
-refMod
Path to refMod format output file. This contains a more detailed output encoding than just including N
at unclear positions. Useful for further investigation of some sites for example.
N: Not covered or ambiguous. R: Low coverage but looks like Reference call. a,c,t,g (lower case): Low coverage but looks like SNP.
-uncertain
Path to uncertainty encoded output file in a special 1234 format for some downstream tools.
Note that this tool was written a couple of years ago for reconstructing genomes from GATK UnifiedGenotyper VCF output files. It may work with other genotypers providing the same kind of VCF4.0/VCF4.1 format, but might not work well with data originating for example from GATK HaplotypeCaller. The tool requires an EMIT_ALL_SITES
compatible VCF input file.
This tool is currently unable to handle indels properly due to the index handling procedure in the software itself. SNPs are fine.