duplicated sample name issue

parlar commented 8 years ago

I'm getting this error when trying to run vt genotype

[parlar@mps 12191-96]$ vt genotype -r /home/bcbio_root/share/bcbio/genomes/Hsapiens/hg19/seq/hg19.fa -s 12191-96 -b 12191-96-ready.bam -o 12191-96-ensemble.vcf.gz.norm.decomp.vcf.gz.vt.vcf 12191-96-ensemble.vcf.gz.norm.decomp.vcf
[E::bcf_hdr_add_sample] Duplicated sample name '12191-96'
Aborted (core dumped)

Not setting -s also produces a complaint:

vt genotype -r /home/bcbio_root/share/bcbio/genomes/Hsapiens/hg19/seq/hg19.fa -b 12191-96-ready.bam -o 12191-96-ensemble.vcf.gz.norm.decomp.vcf.gz.vt.vcf 12191-96-ensemble.vcf.gz.norm.decomp.vcf

  undefined -- Required argument missing: s

genotype v0.5

description : Genotypes variants for each sample.

usage : vt genotype [options] <in.vcf>

options : -d  debug alignments
          -r  reference FASTA file
          -s  sample ID
          -o  output VCF file
          -b  input BAM file
          -I  file containing list of intervals []
          -i  intervals []
          -?  displays help

What I want to do is to extract allelic frequencies and possibly other params in the bam alignment for pre-specified variants (in a vcf). I'm not even sure that vt will do the job ... ? Any suggestions on how to do this in the best way is also greatly appreciated. Have tried Mutect2, which takes ages, freebayes but which does not call all input variants.

Kind regards,

Pär Larsson

atks commented 8 years ago

Hi Pär,

It seems like there are duplicated sample names in your VCF file. This is rejected by htslib which vt uses, so you should strip the sample information first by using vt view -s.
vt will actually allow you to extract allele frequencies from variants but there are quite a few more steps that are required. You mentioned other parameters - can you give some examples. what vt genotype does is to collect allele information, cycle, strand and compute allelic likelihoods for a single sample. That's what the -s option is for, to denote the sample name. Once all the samples are processed, other vt commands are used to combine that information from each of the output VCF file to get a single VCF file with all the genotypes for all the samples and aggregated statistics for each variant such as allele frequences assuming HWE or not, inbreeding coefficients, allele balance etc etc.

parlar commented 8 years ago

Thanks Adrian!

Tried vt view -s in.vcf >out.vcf but it just produces a vcf without a header... ? Using that as input makes the genotype call complain: [bcf_ordered_reader.cpp:49 BCFOrderedReader] Not a VCF/BCF file: vt.in.vcf .

I have used bcbio-nextgen's ensemble method to call somatic mutations in a bunch of samples. However, due to inconsistent variant records I'd like to streamline them using this method. The reason for using the ensemble approach is that significant discrepencies exist in output from different callers (well aware that I also get more false positives...). For vt to be useful, however, it must work also for larger normalized indels. Do you think that this is realistic?

But, in a broader perspective, it might be a useful thing to be able to test for a specific variant in a sample (patient) using other methods than those provided by the general callers. Principally, sensitivity could increase significantly, at least for some variants, this way. Which might be a useful thing in a clinical setting where new studies indicate that also very low-frequency clones can be of clinical relevance. Borderline calls could also be verified downstream by other means. At least, that is what I'm thinking ... what do you think?

cheers,

Pär

parlar commented 8 years ago

sorry, got it to work now using vt view -hs in.vcf >out.vcf. Reading the manual is sometimes a good thing ...

atks commented 8 years ago

Try: vt view -s in.vcf -o out.vcf or vt view -h -s in.vcf > out.vcf

Can you give an example of a large normalized Indel? vt should work for any indel that is explicitly defined and is smaller than the length of a read. What do you mean by inconsistent variant records, as in the variant sets are dissimilar?

The general callers that bcb-bio includes callers that are assembly based (I think the recommended ones are freebayes and haplotypecaller). If you want to re-assay a variant in-silico in a non biased fashion, it is not a straight forward thing to do. My understanding is that freebayes and haplotypecaller are classified under what "local assemblers", they take in an aligned bam and reassembles the reads locally. Effectively, this gives an additional layer of filtering that accounts for how well each read fits with one another. In Haplotypecaller, you can observe that by the large number of reads they discard in low complexity regions. In addition to this, there are also some bonafide large indels that assembly methods can recover, but that can be done better with a global assembly based caller.

To really get a non biased assay of a variant, you need to realign the reads using different alignment methods like novoalign, bwa-mem, tophat. In addition to that, use a global assembler for alternate alignments like Cortex. Once you have those bams, you can apply the variant callers and see if you get a consensus answer while keeping in mind that some large variants can only be present in assembly based bams. So to do this downstream, it is probably not feasible.

Another issue is the case of tandem repeats that are inexact, none of the standard callers take that into account. You can observe this by the high number of overlap of indel variants for each short tandem repeat discovered by tandem repeat finder. So this is a case of representation inconsistency when imperfect repeats are encountered.

Sensitivity can only be improved by adding variants, so a further step in testing and filtering variants will probably just decrease it. However, we are always interested in the balancing sensitivity and specificity.

I think haplotypecaller has low sensitivity with respect to low frequency variants unless the sequencing depth is really high, so it is important to add alignment based methods in your ensemble calling too.

It's a complex issue. I think you can get some mileage out of using variant graphs too. There's a project by @ekg with vg that might help resolve some of these issues. https://github.com/vgteam/vg

vt can help double check the variant, but it does not reassemble the reads, so if a variant is discovered in such a fashion, you will get a null result when trying to validate it.

parlar commented 8 years ago

Thanks for the elaborate reply!

I do realignment (using gatk), but what I mainly was concerned about was if there are any inherent limits to indel size in vt and, as you're hinting about, that normalized indels might be shifted relative to its position in the bam.

I stumbled upon an issue, that looks like a bug to me .. ? The output only contains indels. No snps there... although vt says SNPs are genotyped.

Also: I'm also confused about the AD field in the genotype. No description seems to be added to the header about which value it is that corresponds to the alt allele depth. I see no documentation regarding this on the web page.

example:

chr2    48033890    .   CT  C   .   .   .   GT:PL:DP:AD:GQ  0/1:358,0,691:72:37,23,12:.

The output from vt genotype:

Options: Input VCF File   vt.in.vcf
         Input BAM File   12191-96-ready.bam
         Output VCF File  12191-96-ensemble.vcf.gz.norm.decomp.vcf.gz.vt.vcf
         Sample ID        12191-96

Stats: SNPs genotyped     148
       Indels genotyped   10

cheers

Pär

atks commented 8 years ago

Hi Pär,

The realignment command in GATK realigns parts of the genome to help improve the calling of indels in reducing the false positive rates. I think that command is obviated since haplotypecaller reassembles all the aligned reads locally.

The header should be there, else the record cannot be written, can you check again? AD field is the allele depth, this is described in VCF format v4.3 now too. https://samtools.github.io/hts-specs/VCFv4.3.pdf

Let me check on the bug, I've made some serious overhaul and have quite a few changes not committed to github.

parlar commented 8 years ago

Thanks Adrian,

As exemplified above, the AD field contains three values (37,23,12) and not a single value. This was true for all records from vt, which only included indels, no snps were in the output).

The vcf header is identical in the input to vt and the output from vt with regard to ##FORMAT= . The only explaination for AD was the one already present in the in.vcf header. Unfortunately, this header does not provide an explaination three values in AD.

##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">

Im using bcbio-nextgen, and the setting realign: gatk in the config file. Perhaps bcbio relies on an old version of GATK for realignment..?

atks commented 8 years ago

ok, that code is not updated to the newest version. I'll have that fixed by the end of the week.

that realignment portion is just obsolete for haplotypecaller I believe.

parlar commented 8 years ago

Thanks!

Yes, haplotype callers have no need for realignment. We do it anyway for since we inspect the alignment for clinically important variants to make sure everything looks ok, to spot potential problems.

Best,

Pär

atks / vt

duplicated sample name issue #57