Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
23 stars 20 forks source link

Error in VCF parsing with VariantAnnotation #25

Closed octopusCat88 closed 5 years ago

octopusCat88 commented 5 years ago

Hi,

I have been annotating VCF files with VEP.

utils::download.file("https://i12g-gagneurweb.in.tum.de/public/bugreports/bioc_variantAnnotation/example_no_anno.vcf.gz", "example_no_anno.vcf.gz")
utils::download.file("https://i12g-gagneurweb.in.tum.de/public/bugreports/bioc_variantAnnotation/example_vep_anno.vcf.gz", "example_vep.vcf.gz")

VEP command on the command line

vep -i example_no_anno.vcf.gz --vcf TRUE --output_file example_vep.vcf.gz --compress_output bgzip --minimal TRUE --allele_number TRUE --everything TRUE --assembly GRCh37 --db_version 94 --merged TRUE --user anonymous --port 3337 --host ensembldb.ensembl.org --cache TRUE --dir dir_cache/ensembl-vep/94/cachedir --sift s --polyphen s --total_length TRUE --numbers TRUE --symbol TRUE --hgvs TRUE --ccds TRUE --uniprot TRUE --xref_refseq TRUE --af TRUE --max_af TRUE --af_exac TRUE --af_gnomad TRUE --pubmed TRUE --canonical TRUE --biotype TRUE

However after reading the annotated VCF file, some lines seem to be randomly split and parsed as a new line. In a minimal example with 1 variant, I end up with 2 entries in R, where the second one has half of the info column as chromosome names. Could this be a bug?

library(VariantAnnotation)

# plain vcf file
vcf <- readVcf("example_no_anno.vcf.gz")
colData(vcf)
dim(vcf)
str(seqlevels(vcf))

# annotated with VEP 
# contains very long line but no errors in the format
vcf <- readVcf("example_vep.vcf.gz")
colData(vcf)
dim(vcf)
str(seqlevels(vcf)[1])
str(seqlevels(vcf)[2])
str(seqlevels(vcf)[3])

sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Scientific Linux 7.6 (Nitrogen)

Matrix products: default
BLAS: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRblas.so
LAPACK: /data/nasif12/modules_if12/SL7/i12g/R/3.5.1-Bioc3.8/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] plotly_4.8.0                ggpubr_0.2                  scales_1.0.0                cowplot_0.9.4               tidyr_0.8.2                 ggplot2_3.1.0              
 [7] magrittr_1.5                ensemblVEP_1.24.0           AnnotationHub_2.14.2        dplyr_0.7.8                 data.table_1.12.0           VariantAnnotation_1.28.7   
[13] Rsamtools_1.34.0            Biostrings_2.50.2           XVector_0.22.0              SummarizedExperiment_1.12.0 DelayedArray_0.8.0          BiocParallel_1.16.5        
[19] matrixStats_0.54.0          Biobase_2.42.0              GenomicRanges_1.34.0        GenomeInfoDb_1.18.1         IRanges_2.16.0              S4Vectors_0.20.1           
[25] BiocGenerics_0.28.0        

loaded via a namespace (and not attached):
 [1] httr_1.4.0                    viridisLite_0.3.0             jsonlite_1.6                  bit64_0.9-7                   shiny_1.2.0                   assertthat_0.2.0             
 [7] interactiveDisplayBase_1.20.0 BiocManager_1.30.4            blob_1.1.1                    BSgenome_1.50.0               GenomeInfoDbData_1.2.0        yaml_2.2.0                   
[13] progress_1.2.0                pillar_1.3.1                  RSQLite_2.1.1                 lattice_0.20-38               glue_1.3.0                    digest_0.6.18                
[19] promises_1.0.1                colorspace_1.3-2              htmltools_0.3.6               httpuv_1.4.5.1                Matrix_1.2-15                 plyr_1.8.4                   
[25] XML_3.98-1.16                 pkgconfig_2.0.2               biomaRt_2.38.0                zlibbioc_1.28.0               purrr_0.2.5                   xtable_1.8-3                 
[31] later_0.7.5                   tibble_2.0.0                  DT_0.5                        withr_2.1.2                   GenomicFeatures_1.34.1        lazyeval_0.2.1               
[37] crayon_1.3.4                  mime_0.6                      memoise_1.1.0                 tools_3.5.1                   prettyunits_1.0.2             hms_0.4.2                    
[43] stringr_1.3.1                 munsell_0.5.0                 AnnotationDbi_1.44.0          bindrcpp_0.2.2                compiler_3.5.1                rlang_0.3.1                  
[49] grid_3.5.1                    RCurl_1.95-4.11               rstudioapi_0.9.0              htmlwidgets_1.3               labeling_0.3                  bitops_1.0-6                 
[55] gtable_0.2.0                  DBI_1.0.0                     R6_2.3.0                      GenomicAlignments_1.18.1      rtracklayer_1.42.1            bit_1.1-14                   
[61] bindr_0.1.1                   stringi_1.2.4                 Rcpp_1.0.0                    tidyselect_0.2.5

Please let me know, if you need more input to replicate this error.

Best, Michaela Müller

vobencha commented 5 years ago

Hi Michaela (@octopusCat88 ),

I'm not able to reproduce this problem with VariantAnnotation 1.28.11 which is the most current version in release.

There was a bug in parsing long records which was fixed on Jan 18th:

commit 90b9deae85acddcf8eb8a0c0c2041b51ae7cf1f1
Author: vobencha <vobencha@gmail.com>
Date:   Fri Jan 18 12:56:54 2019 -0800

     Fix bug in buffer reallocation when a record fills the buffer exactly

     See https://github.com/Bioconductor/VariantAnnotation/issues/19 for
     details

I see you're using version 1.28.7 which is a version before the fix was applied so it's possible you're hitting this bug. Please update your version of VariantAnnotation and try again.

Thanks. Valerie

Testing with VariantAnnotation 1.28.11:

> vcf_no_anno <- readVcf("example_no_anno.vcf.gz")
> colData(vcf_no_anno)
DataFrame with 1 row and 1 column
            Samples
          <integer>
123Sample         1
> dim(vcf_no_anno)
[1] 1 1
> str(seqlevels(vcf_no_anno))
 chr "chr1"

> vcf_vep_anno <- readVcf("example_vep_anno.vcf.gz")
> colData(vcf_vep_anno)
DataFrame with 1 row and 1 column
            Samples
          <integer>
123Sample         1
> dim(vcf_vep_anno)
[1] 1 1
> str(seqlevels(vcf_vep_anno))
 chr "chr1"

> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
...
other attached packages:
 [1] VariantAnnotation_1.28.11   Rsamtools_1.34.1
 [3] Biostrings_2.50.2           XVector_0.22.0
 [5] SummarizedExperiment_1.12.0 DelayedArray_0.8.0
 [7] BiocParallel_1.16.6         matrixStats_0.54.0
 [9] Biobase_2.42.0              GenomicRanges_1.34.0
[11] GenomeInfoDb_1.18.2         IRanges_2.16.0
[13] S4Vectors_0.20.1            BiocGenerics_0.28.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0               compiler_3.5.1           prettyunits_1.0.2
 [4] GenomicFeatures_1.34.3   bitops_1.0-6             tools_3.5.1
 [7] zlibbioc_1.28.0          progress_1.2.0           biomaRt_2.38.0
[10] digest_0.6.18            bit_1.1-14               BSgenome_1.50.0
[13] RSQLite_2.1.1            memoise_1.1.0            lattice_0.20-38
[16] pkgconfig_2.0.2          rlang_0.3.1              Matrix_1.2-14
[19] DBI_1.0.0                GenomeInfoDbData_1.2.0   rtracklayer_1.42.2
[22] httr_1.4.0               stringr_1.4.0            hms_0.4.2
[25] bit64_0.9-7              grid_3.5.1               R6_2.4.0
[28] AnnotationDbi_1.44.0     XML_3.98-1.18            magrittr_1.5
[31] blob_1.1.1               GenomicAlignments_1.18.1 assertthat_0.2.0
[34] stringi_1.3.1            RCurl_1.95-4.12          crayon_1.3.4
c-mertes commented 5 years ago

Dear @octopusCat88 and @vobencha,

thanks for checking this and pointing us to the updated version. And yes after updating the package the problem is solved. So I guess this bug is a duplicate of #19

Since this was solved by updating the package we can close this issue.

Thanks again and sorry for not checking before the new version.

Best, Christian

vobencha commented 5 years ago

No problem. I'm glad the new version worked for you. Valerie