Open ghost opened 4 years ago
What version of bedtools is this?
I'm seemingly getting the same issue with the latest git master of bedtools and a vcf file generated with bcftools mpileup 1.9. I've attached the file here. The error I get is:
Error: Invalid record in file WTCHG_230414_256227_ERCC.vcf.gz. Record is
ERCC-00002 1 . T <*> 0 . DP=144;AD=106,0;I16=106,0,0,0,5217,305911,0,0,6360,381600,0,0,1173,15191,0,0;QS=1,0;MQ0F=0 PL 0,255,255
Ah! It's the weird way in which bcftools
encodes the symbolic allele. The <*>
causes the error message. Maybe something to fix though, since most raw bcftools output includes these "alleles".
It seems bedtools expecting either SVLEN or END, but the VCF doesn't have neither of them. I suspect the VCF is used to represent per-base records, thus we could potentially make bedtools understand this convention.
My question is is <*>
means the record is per-base ?
From a discussion at https://www.biostars.org/p/279971, I see that there is a reference to the newer VCF spec, which states:
.5 Representing unspecified alleles and REF-only blocks (gVCF)
In order to report sequencing data evidence for both variant and non-variant positions in the genome, the VCF specification allows to represent blocks of reference-only calls in a single record using the END INFO tag, an idea originally introduced by the gVCF file format† . The convention adopted here is to represent reference evidence as likelihoods against an unknown alternate allele. Think of this as the likelihood for reference as compared to any other possible alternate allele (both SNP, indel, or otherwise). A symbolic alternate allele <*> is used to represent this unspecified alternate allele
Basically, newer versions of bcftools and samtools report the potential for an alternate allele at every position with the <*>
placeholder. After bcftools call
, they disappear. But a lot of people (like me) just want to use mpileup
to get a quick list of all variant positions, and so we have to deal with the '<*>' manually. It'd be nice to not have to do some sed
magic every time, so if this could be added as a valid allele symbol (which it is) in bedtools, then that would be awesome.
So, I read this as interpret <> as length of 1 nucleotide if there is no END tag in the INFO field. Otherwise, if END is present in the INFO field and <> is present, then the length of the interval for the VCF record is (END-POS)+1. Agree?
Hello, I have vcf files generated from bcftools convert --gvcf2vcf and this raises an issue
Here is the header of the generated vcf
I suppose Bedtools doesn't like the empty FILTER and INFO columns. As far as I can tell the vcf file seems to be a valid vcf.