Closed lshep closed 5 years ago
Hi,
sessionInfo()
and reproducible example please?
Problem is that this file has no header so is not considered valid VCF per bcftools
latest standards. Old versions of bcftools
were more forgiving and were somehow able to read the header, even if there isn't any, by returning some default header. For example, with samtools
0.1.19:
hpages@spectre:~$ samtools-0.1.19/bcftools/bcftools view inputVCF.vcf
##fileformat=VCFv4.1
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=IS,Number=2,Type=Float,Description="Maximum number of reads supporting an indel and fraction of indel reads">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint">
##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio">
##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2.">
##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">
##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
##INFO=<ID=QBD,Number=1,Type=Float,Description="Quality by Depth: QUAL/#reads">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Read Position Bias">
##INFO=<ID=MDV,Number=1,Type=Integer,Description="Maximum number of high-quality nonRef reads in samples">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias (v2) for filtering splice-site artefacts in RNA-seq data. Note: this version may be broken.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality non-reference bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
[bcf_sync] incorrect number of fields (0 != 5) at 0:0
But more recent versions of bcftools
are stricter and choke on it:
hpages@spectre:~$ bcftools-1.7/bcftools view inputVCF.vcf
Failed to open inputVCF.vcf: unknown file type
You should be able to fix the file with something like:
hpages@spectre:~$ samtools-0.1.19/bcftools/bcftools view inputVCF.vcf | head -n -1 > repaired_inputVCF.vcf
hpages@spectre:~$ cat inputVCF.vcf >> repaired_inputVCF.vcf
Note that you need a version of samtools
that is "old enough" for this.
Then:
library(Rsamtools)
h <- scanBcfHeader("repaired_inputVCF.vcf")
h
# $repaired_inputVCF.vcf
# List of length 3
# names(3): Reference Sample Header
Maybe there is a better/easier way to update the VCF file to bcftools
latest standards, I don't know. That's something you would need to check with the bcftools
people. Rsamtools is basically a wrapper around bcftools.
Bottom line: your problem is because the bcftools
people have made non-backward compatible changes to the VCF parsing code.
Hope this helps, H.
> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS
Matrix products: default
BLAS: /home/hpages/R/R-3.6.0/lib/libRblas.so
LAPACK: /home/hpages/R/R-3.6.0/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] Rsamtools_2.0.0 Biostrings_2.52.0 XVector_0.24.0
[4] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0 IRanges_2.18.2
[7] S4Vectors_0.22.0 BiocGenerics_0.30.0
loaded via a namespace (and not attached):
[1] zlibbioc_1.30.0 compiler_3.6.0 tools_3.6.0
[4] GenomeInfoDbData_1.2.1 RCurl_1.95-4.12 BiocParallel_1.18.1
[7] bitops_1.0-6
@hpages I don't have more info than that. I'll reply to the original email on the mailing list to respond here and check out your response
@lshep I normally receive emails sent to maintainer@bioconductor.org but I don't think I received this one. Do you know if there was any change made to the configuration of the maintainer@bioconductor.org list? (I'm not even sure this is a mailing list, maybe it's just an email alias). Anyway please make sure you do Reply All when you answer to Christelle. This should send a copy of your answer to maintainer@bioconductor.org. We'll see if I receive it. Thanks!
The email came while you were away. It might have just got lost in the shuffle. I'll be responding momentarily with a link to this issue.
I found the email (subject line: "Problem Variantannotation" from TESSON Christelle). I couldn't find it earlier because I was searching for an email with "scanBcfHeader" in the subject line.
Also got your answer to it.
Thanks!
Initially reported to maintainer@bioconductor.org:
See attached file. (note I couldn't directly attach a .vcf for I added .txt extension to I could post the file included in the email) inputVCF.vcf.txt