arq5x / bedtools2

bedtools - the swiss army knife for genome arithmetic
MIT License
937 stars 287 forks source link

invalid records when processing vcf file #832

Open ghost opened 4 years ago

ghost commented 4 years ago

Hello, I have vcf files generated from bcftools convert --gvcf2vcf and this raises an issue

bedtools coverage -a A_vaga.NDPD.bed -b ARC_ancestor.expanded.g.vcf.gz -counts                      
Error: Invalid record in file ARC_ancestor.expanded.g.vcf.gz. Record is 
Chrom_3 1   .   A   <*> 0   .   .   GT:GQ:MIN_DP:PL 0/0:1:0:0,0,0

Here is the header of the generated vcf

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FILTER=<ID=RefCall,Description="Genotyping model thinks this site is reference.">
##FILTER=<ID=LowQual,Description="Confidence in this variant being real is below calling threshold.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position (for use with symbolic alleles)">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Conditional genotype quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read depth">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block.">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Read depth for each allele">
##FORMAT=<ID=VAF,Number=A,Type=Float,Description="Variant allele fractions.">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Phred-scaled genotype likelihoods rounded to the closest integer">
##contig=<ID=Chrom_3,length=20354777>
##contig=<ID=Chrom_1,length=18146847>
##contig=<ID=Chrom_5,length=16930519>
##contig=<ID=Chrom_2,length=16274841>
##contig=<ID=Chrom_4,length=15224634>
##contig=<ID=Chrom_6,length=13893210>
##bcftools_convertVersion=1.10.2-27-g9d66868+htslib-1.10.2-32-g84f0a86
##bcftools_convertCommand=convert -f ../A_vaga.NDPD.fasta --gvcf2vcf ARC_ancestor.g.vcf.gz; Date=Tue Mar 17 12:02:46 2020
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ARC_ancestor

I suppose Bedtools doesn't like the empty FILTER and INFO columns. As far as I can tell the vcf file seems to be a valid vcf.

arq5x commented 4 years ago

What version of bedtools is this?

DaGaMs commented 4 years ago

I'm seemingly getting the same issue with the latest git master of bedtools and a vcf file generated with bcftools mpileup 1.9. I've attached the file here. The error I get is:

Error: Invalid record in file WTCHG_230414_256227_ERCC.vcf.gz. Record is 
ERCC-00002      1       .       T       <*>     0       .       DP=144;AD=106,0;I16=106,0,0,0,5217,305911,0,0,6360,381600,0,0,1173,15191,0,0;QS=1,0;MQ0F=0      PL      0,255,255

WTCHG_230414_256227_ERCC.vcf.gz

DaGaMs commented 4 years ago

Ah! It's the weird way in which bcftools encodes the symbolic allele. The <*> causes the error message. Maybe something to fix though, since most raw bcftools output includes these "alleles".

38 commented 4 years ago

It seems bedtools expecting either SVLEN or END, but the VCF doesn't have neither of them. I suspect the VCF is used to represent per-base records, thus we could potentially make bedtools understand this convention.

My question is is <*> means the record is per-base ?

DaGaMs commented 4 years ago

From a discussion at https://www.biostars.org/p/279971, I see that there is a reference to the newer VCF spec, which states:

.5 Representing unspecified alleles and REF-only blocks (gVCF)

In order to report sequencing data evidence for both variant and non-variant positions in the genome, the VCF specification allows to represent blocks of reference-only calls in a single record using the END INFO tag, an idea originally introduced by the gVCF file format† . The convention adopted here is to represent reference evidence as likelihoods against an unknown alternate allele. Think of this as the likelihood for reference as compared to any other possible alternate allele (both SNP, indel, or otherwise). A symbolic alternate allele <*> is used to represent this unspecified alternate allele

Basically, newer versions of bcftools and samtools report the potential for an alternate allele at every position with the <*> placeholder. After bcftools call, they disappear. But a lot of people (like me) just want to use mpileup to get a quick list of all variant positions, and so we have to deal with the '<*>' manually. It'd be nice to not have to do some sed magic every time, so if this could be added as a valid allele symbol (which it is) in bedtools, then that would be awesome.

arq5x commented 4 years ago

So, I read this as interpret <> as length of 1 nucleotide if there is no END tag in the INFO field. Otherwise, if END is present in the INFO field and <> is present, then the length of the interval for the VCF record is (END-POS)+1. Agree?