Bioconductor / Rsamtools

Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import
https://bioconductor.org/packages/Rsamtools
Other
26 stars 27 forks source link

ERROR in scanBcfHeader [Reported to maintainer@bioconductor.org] #10

Closed lshep closed 5 years ago

lshep commented 5 years ago

Initially reported to maintainer@bioconductor.org:

Dear maintener,

I am sorry to bother you, but I have some difficulty with Variantannotation package.
With the new version (corresponding to R6.0 and later) I have this message
Error in scanBcfHeader(bf) : no 'header' line "#CHROM POS ID..."?
with the function  readVcf(), but the same VCF did work with R5.1

I don't know if you can help me, but I will be very gratefull if you can.
Thanks a lot.

Christelle

See attached file. (note I couldn't directly attach a .vcf for I added .txt extension to I could post the file included in the email) inputVCF.vcf.txt

hpages commented 5 years ago

Hi,

sessionInfo() and reproducible example please?

Problem is that this file has no header so is not considered valid VCF per bcftools latest standards. Old versions of bcftools were more forgiving and were somehow able to read the header, even if there isn't any, by returning some default header. For example, with samtools 0.1.19:

hpages@spectre:~$ samtools-0.1.19/bcftools/bcftools view inputVCF.vcf 
##fileformat=VCFv4.1
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=DP4,Number=4,Type=Integer,Description="# high-quality ref-forward bases, ref-reverse, alt-forward and alt-reverse bases">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square mapping quality of covering reads">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele count (no HWE assumption)">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=IS,Number=2,Type=Float,Description="Maximum number of reads supporting an indel and fraction of indel reads">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes for each ALT allele, in the same order as listed">
##INFO=<ID=G3,Number=3,Type=Float,Description="ML estimate of genotype frequencies">
##INFO=<ID=HWE,Number=1,Type=Float,Description="Chi^2 based HWE test P-value based on G3">
##INFO=<ID=CLR,Number=1,Type=Integer,Description="Log ratio of genotype likelihoods with and without the constraint">
##INFO=<ID=UGT,Number=1,Type=String,Description="The most probable unconstrained genotype configuration in the trio">
##INFO=<ID=CGT,Number=1,Type=String,Description="The most probable constrained genotype configuration in the trio">
##INFO=<ID=PV4,Number=4,Type=Float,Description="P-values for strand bias, baseQ bias, mapQ bias and tail distance bias">
##INFO=<ID=INDEL,Number=0,Type=Flag,Description="Indicates that the variant is an INDEL.">
##INFO=<ID=PC2,Number=2,Type=Integer,Description="Phred probability of the nonRef allele frequency in group1 samples being larger (,smaller) than in group2.">
##INFO=<ID=PCHI2,Number=1,Type=Float,Description="Posterior weighted chi^2 P-value for testing the association between group1 and group2 samples.">
##INFO=<ID=QCHI2,Number=1,Type=Integer,Description="Phred scaled PCHI2.">
##INFO=<ID=PR,Number=1,Type=Integer,Description="# permutations yielding a smaller PCHI2.">
##INFO=<ID=QBD,Number=1,Type=Float,Description="Quality by Depth: QUAL/#reads">
##INFO=<ID=RPB,Number=1,Type=Float,Description="Read Position Bias">
##INFO=<ID=MDV,Number=1,Type=Integer,Description="Maximum number of high-quality nonRef reads in samples">
##INFO=<ID=VDB,Number=1,Type=Float,Description="Variant Distance Bias (v2) for filtering splice-site artefacts in RNA-seq data. Note: this version may be broken.">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GL,Number=3,Type=Float,Description="Likelihoods for RR,RA,AA genotypes (R=ref,A=alt)">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="# high-quality bases">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="# high-quality non-reference bases">
##FORMAT=<ID=SP,Number=1,Type=Integer,Description="Phred-scaled strand bias P-value">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT
[bcf_sync] incorrect number of fields (0 != 5) at 0:0

But more recent versions of bcftools are stricter and choke on it:

hpages@spectre:~$ bcftools-1.7/bcftools view inputVCF.vcf 
Failed to open inputVCF.vcf: unknown file type

You should be able to fix the file with something like:

hpages@spectre:~$ samtools-0.1.19/bcftools/bcftools view inputVCF.vcf | head -n -1 > repaired_inputVCF.vcf
hpages@spectre:~$ cat inputVCF.vcf >> repaired_inputVCF.vcf

Note that you need a version of samtools that is "old enough" for this.

Then:

library(Rsamtools)
h <- scanBcfHeader("repaired_inputVCF.vcf")
h
# $repaired_inputVCF.vcf
# List of length 3
# names(3): Reference Sample Header

Maybe there is a better/easier way to update the VCF file to bcftools latest standards, I don't know. That's something you would need to check with the bcftools people. Rsamtools is basically a wrapper around bcftools.

Bottom line: your problem is because the bcftools people have made non-backward compatible changes to the VCF parsing code.

Hope this helps, H.

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS:   /home/hpages/R/R-3.6.0/lib/libRblas.so
LAPACK: /home/hpages/R/R-3.6.0/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
[1] Rsamtools_2.0.0      Biostrings_2.52.0    XVector_0.24.0      
[4] GenomicRanges_1.36.0 GenomeInfoDb_1.20.0  IRanges_2.18.2      
[7] S4Vectors_0.22.0     BiocGenerics_0.30.0 

loaded via a namespace (and not attached):
[1] zlibbioc_1.30.0        compiler_3.6.0         tools_3.6.0           
[4] GenomeInfoDbData_1.2.1 RCurl_1.95-4.12        BiocParallel_1.18.1   
[7] bitops_1.0-6          
lshep commented 5 years ago

@hpages I don't have more info than that. I'll reply to the original email on the mailing list to respond here and check out your response

hpages commented 5 years ago

@lshep I normally receive emails sent to maintainer@bioconductor.org but I don't think I received this one. Do you know if there was any change made to the configuration of the maintainer@bioconductor.org list? (I'm not even sure this is a mailing list, maybe it's just an email alias). Anyway please make sure you do Reply All when you answer to Christelle. This should send a copy of your answer to maintainer@bioconductor.org. We'll see if I receive it. Thanks!

lshep commented 5 years ago

The email came while you were away. It might have just got lost in the shuffle. I'll be responding momentarily with a link to this issue.

hpages commented 5 years ago

I found the email (subject line: "Problem Variantannotation" from TESSON Christelle). I couldn't find it earlier because I was searching for an email with "scanBcfHeader" in the subject line.

Also got your answer to it.

Thanks!