Open parlar opened 8 years ago
Hi Pär,
Thanks Adrian!
Tried vt view -s in.vcf >out.vcf
but it just produces a vcf without a header... ? Using that as input makes the genotype call complain: [bcf_ordered_reader.cpp:49 BCFOrderedReader] Not a VCF/BCF file: vt.in.vcf
.
I have used bcbio-nextgen's ensemble method to call somatic mutations in a bunch of samples. However, due to inconsistent variant records I'd like to streamline them using this method. The reason for using the ensemble approach is that significant discrepencies exist in output from different callers (well aware that I also get more false positives...). For vt to be useful, however, it must work also for larger normalized indels. Do you think that this is realistic?
But, in a broader perspective, it might be a useful thing to be able to test for a specific variant in a sample (patient) using other methods than those provided by the general callers. Principally, sensitivity could increase significantly, at least for some variants, this way. Which might be a useful thing in a clinical setting where new studies indicate that also very low-frequency clones can be of clinical relevance. Borderline calls could also be verified downstream by other means. At least, that is what I'm thinking ... what do you think?
cheers,
Pär
sorry, got it to work now using vt view -hs in.vcf >out.vcf
. Reading the manual is sometimes a good thing ...
Try: vt view -s in.vcf -o out.vcf or vt view -h -s in.vcf > out.vcf
Can you give an example of a large normalized Indel? vt should work for any indel that is explicitly defined and is smaller than the length of a read. What do you mean by inconsistent variant records, as in the variant sets are dissimilar?
The general callers that bcb-bio includes callers that are assembly based (I think the recommended ones are freebayes and haplotypecaller). If you want to re-assay a variant in-silico in a non biased fashion, it is not a straight forward thing to do. My understanding is that freebayes and haplotypecaller are classified under what "local assemblers", they take in an aligned bam and reassembles the reads locally. Effectively, this gives an additional layer of filtering that accounts for how well each read fits with one another. In Haplotypecaller, you can observe that by the large number of reads they discard in low complexity regions. In addition to this, there are also some bonafide large indels that assembly methods can recover, but that can be done better with a global assembly based caller.
To really get a non biased assay of a variant, you need to realign the reads using different alignment methods like novoalign, bwa-mem, tophat. In addition to that, use a global assembler for alternate alignments like Cortex. Once you have those bams, you can apply the variant callers and see if you get a consensus answer while keeping in mind that some large variants can only be present in assembly based bams. So to do this downstream, it is probably not feasible.
Another issue is the case of tandem repeats that are inexact, none of the standard callers take that into account. You can observe this by the high number of overlap of indel variants for each short tandem repeat discovered by tandem repeat finder. So this is a case of representation inconsistency when imperfect repeats are encountered.
Sensitivity can only be improved by adding variants, so a further step in testing and filtering variants will probably just decrease it. However, we are always interested in the balancing sensitivity and specificity.
I think haplotypecaller has low sensitivity with respect to low frequency variants unless the sequencing depth is really high, so it is important to add alignment based methods in your ensemble calling too.
It's a complex issue. I think you can get some mileage out of using variant graphs too. There's a project by @ekg with vg that might help resolve some of these issues. https://github.com/vgteam/vg
vt can help double check the variant, but it does not reassemble the reads, so if a variant is discovered in such a fashion, you will get a null result when trying to validate it.
Thanks for the elaborate reply!
I do realignment (using gatk), but what I mainly was concerned about was if there are any inherent limits to indel size in vt and, as you're hinting about, that normalized indels might be shifted relative to its position in the bam.
I stumbled upon an issue, that looks like a bug to me .. ? The output only contains indels. No snps there... although vt says SNPs are genotyped.
Also: I'm also confused about the AD field in the genotype. No description seems to be added to the header about which value it is that corresponds to the alt allele depth. I see no documentation regarding this on the web page.
example:
chr2 48033890 . CT C . . . GT:PL:DP:AD:GQ 0/1:358,0,691:72:37,23,12:.
The output from vt genotype:
Options: Input VCF File vt.in.vcf
Input BAM File 12191-96-ready.bam
Output VCF File 12191-96-ensemble.vcf.gz.norm.decomp.vcf.gz.vt.vcf
Sample ID 12191-96
Stats: SNPs genotyped 148
Indels genotyped 10
cheers
Pär
Hi Pär,
The realignment command in GATK realigns parts of the genome to help improve the calling of indels in reducing the false positive rates. I think that command is obviated since haplotypecaller reassembles all the aligned reads locally.
The header should be there, else the record cannot be written, can you check again? AD field is the allele depth, this is described in VCF format v4.3 now too. https://samtools.github.io/hts-specs/VCFv4.3.pdf
Let me check on the bug, I've made some serious overhaul and have quite a few changes not committed to github.
Thanks Adrian,
As exemplified above, the AD field contains three values (37,23,12) and not a single value. This was true for all records from vt, which only included indels, no snps were in the output).
The vcf header is identical in the input to vt and the output from vt with regard to ##FORMAT=
. The only explaination for AD was the one already present in the in.vcf header. Unfortunately, this header does not provide an explaination three values in AD.
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
Im using bcbio-nextgen, and the setting realign: gatk
in the config file. Perhaps bcbio relies on an old version of GATK for realignment..?
ok, that code is not updated to the newest version. I'll have that fixed by the end of the week.
that realignment portion is just obsolete for haplotypecaller I believe.
Thanks!
Yes, haplotype callers have no need for realignment. We do it anyway for since we inspect the alignment for clinically important variants to make sure everything looks ok, to spot potential problems.
Best,
Pär
I'm getting this error when trying to run vt genotype
Not setting
-s
also produces a complaint:What I want to do is to extract allelic frequencies and possibly other params in the bam alignment for pre-specified variants (in a vcf). I'm not even sure that vt will do the job ... ? Any suggestions on how to do this in the best way is also greatly appreciated. Have tried Mutect2, which takes ages, freebayes but which does not call all input variants.
Kind regards,
Pär Larsson