Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
26 stars 20 forks source link

scanVcf not working with a connection and streaming #29

Closed SebastianHollizeck closed 5 years ago

SebastianHollizeck commented 5 years ago

Hi,

I was just trying to stream through a very big VCF because it is too big to store in memory and saw that scanVcf has that capability theoretically. However I get an arrow instead.

Here the reproducible example

fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
vcf_file <- file(fl)
vcf_line <- scanVcf(file=vcf_file,n=1)

which just says "Error: $ operator is invalid for atomic vectors"

but I CAN use readLine(vcf_file,n=1)

also the session info

> sessionInfo()
R version 3.6.0 (2019-04-26)
Platform: x86_64-apple-darwin17.7.0 (64-bit)
Running under: macOS High Sierra 10.13.6

Matrix products: default
BLAS:   /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
LAPACK: /usr/local/Cellar/openblas/0.3.6/lib/libopenblasp-r0.3.6.dylib

Random number generation:
 RNG:     Mersenne-Twister 
 Normal:  Inversion 
 Sample:  Rounding 

locale:
[1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8

attached base packages:
 [1] grid      stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] VariantAnnotation_1.30.0                R6_2.4.0                                org.Hs.eg.db_3.8.2                      TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2
 [5] GenomicFeatures_1.36.0                  AnnotationDbi_1.46.0                    rtracklayer_1.44.0                      stringdist_0.9.5.1                     
 [9] ape_5.3                                 grImport_0.9-2                          XML_3.98-1.19                           Rsamtools_2.0.0                        
[13] Biostrings_2.52.0                       XVector_0.24.0                          SummarizedExperiment_1.14.0             DelayedArray_0.10.0                    
[17] BiocParallel_1.18.0                     matrixStats_0.54.0                      Biobase_2.44.0                          GenomicRanges_1.36.0                   
[21] GenomeInfoDb_1.20.0                     IRanges_2.18.0                          S4Vectors_0.22.0                        BiocGenerics_0.30.0                    
[25] dplyr_0.8.1                             data.table_1.12.2                       RColorBrewer_1.1-2                      kableExtra_1.1.0                       
[29] knitr_1.22                             

loaded via a namespace (and not attached):
 [1] colorspace_1.4-1                  htmlTable_1.13.1                  base64enc_0.1-3                   rstudioapi_0.10                   bit64_0.9-7                      
 [6] xml2_1.2.0                        splines_3.6.0                     geneplotter_1.62.0                Formula_1.2-3                     annotate_1.62.0                  
[11] cluster_2.0.9                     BiocManager_1.30.4                readr_1.3.1                       compiler_3.6.0                    httr_1.4.0                       
[16] backports_1.1.4                   assertthat_0.2.1                  Matrix_1.2-17                     lazyeval_0.2.2                    acepack_1.4.1                    
[21] htmltools_0.3.6                   prettyunits_1.0.2                 tools_3.6.0                       igraph_1.2.4.1                    gtable_0.3.0                     
[26] glue_1.3.1                        GenomeInfoDbData_1.2.1            reshape2_1.4.3                    fastmatch_1.1-0                   Rcpp_1.0.1                       
[31] limSolve_1.5.5.3                  nlme_3.1-140                      xfun_0.7                          stringr_1.4.0                     rvest_0.3.3                      
[36] lpSolve_5.6.13                    MASS_7.3-51.4                     zlibbioc_1.30.0                   scales_1.0.0                      BSgenome_1.52.0                  
[41] hms_0.4.2                         yaml_2.2.0                        memoise_1.1.0                     gridExtra_2.3                     ggplot2_3.1.1                    
[46] biomaRt_2.40.0                    rpart_4.1-15                      latticeExtra_0.6-28               stringi_1.4.3                     RSQLite_2.1.1                    
[51] highr_0.8                         genefilter_1.66.0                 checkmate_1.9.3                   rlang_0.3.4                       pkgconfig_2.0.2                  
[56] bitops_1.0-6                      evaluate_0.13                     lattice_0.20-38                   purrr_0.3.2                       labeling_0.3                     
[61] GenomicAlignments_1.20.0          htmlwidgets_1.3                   cowplot_0.9.4                     bit_1.1-14                        tidyselect_0.2.5                 
[66] BSgenome.Hsapiens.UCSC.hg19_1.4.0 plyr_1.8.4                        magrittr_1.5                      DESeq2_1.24.0                     Hmisc_4.2-0                      
[71] DBI_1.0.0                         withr_2.1.2                       pillar_1.4.0                      foreign_0.8-71                    survival_2.44-1.1                
[76] RCurl_1.95-4.12                   nnet_7.3-12                       tibble_2.1.1                      crayon_1.3.4                      rmarkdown_1.12                   
[81] progress_1.2.1                    locfit_1.5-9.1                    blob_1.1.1                        digest_0.6.18                     webshot_0.5.1                    
[86] xtable_1.8-4                      munsell_0.5.0                     viridisLite_0.3.0                 quadprog_1.5-7                   
mtmorgan commented 5 years ago

Please ask questions about package use on the support site https://support.bioconductor.org .

Probably what you want to do is to create a VcfFile() representing the vcf file, and specify a 'yield size' (typically, 10,000 - 100,000 records at a time) representing the number of records to read

fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- VcfFile(fl, yieldSize = 5000)

Then, open the file and iterate through it

open(vcf)
repeat {
    result <- readVcf(vcf)
    if (length(result) == 0)
        break
    ## work on this chunk of the file
    message(nrow(result))
}
close(vcf)

There is an example on the help page ?readVcf

       ## ---------------------------------------------------------------------
       ## Iterate through VCF with 'yieldSize'
       ## ---------------------------------------------------------------------
       fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
       param <- ScanVcfParam(fixed="ALT", geno=c("GT", "GL"), info=c("LDAF"))
       tab <- TabixFile(fl, yieldSize=4000)
       open(tab)
       while (nrow(vcf_yield <- readVcf(tab, "hg19", param=param)))
           cat("vcf dim:", dim(vcf_yield), "\n")
       close(tab)

The GenomicFiles package reduceByYield() function might also be relevant.

SebastianHollizeck commented 5 years ago

Sorry if my message didnt come across, I consider this a bug or "not working as documented" I am quite capable of using other methods to read that vcf, but the documentation of the scanVcf methods says "scanVcf with param="missing" and file="character" or file="connection" scan the entire file. With file="connection", an argument n indicates the number of lines of the VCF file to input; a connection open at the beginning of the call is open and incremented by n lines at the end of the call, providing a convenient way to stream through large VCF files."

If that is not the case, you might want to either remove that line from the documentation of scanVcf or fix the bug.

Cheers, Sebastian

mtmorgan commented 5 years ago

Thanks I did not understand from your original comment that the documentation was not accurate. I'll update the documentation.

mtmorgan commented 5 years ago

Hmm, looking closer I see that it is documented and implemented, but with several bugs... I'll provide a patch

mtmorgan commented 5 years ago

This should be fixed when RELEASE_3_9 builds tonight at version 1.30.1, or the devel version builds tonight at 1.31.3.

SebastianHollizeck commented 5 years ago

Hey, sorry but this did not fix the issue.

I have VariantAnnotation 1.30.1 attached but I still get the same error. And from what I can see, the error does not come from the .vcf_scan_connection function that you patched, because it does not have the "scanVcf: " prefix But it comes from somewhere else.

> line <- scanVcf(file=vcfFile,n=1)
Error: $ operator is invalid for atomic vectors

The only thing I can think of is the result <- .Call(.scan_vcf_connection, txt, maps$samples, maps$fmap, maps$imap, maps$gmap, row.names) Also you are passing the already read in lines to the function, which by its name is made for a connection and not for plain text. But I dont know enough about the structure of VariantAnnotation to be able to help.

However the error still persists in the last version

mtmorgan commented 5 years ago

I have

> packageVersion("VariantAnnotation")
[1] '1.30.1'
> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> vcf_file <- file(fl)
> vcf_line <- scanVcf(file=vcf_file,n=1)
> names(vcf_line[[1]])
[1] "rowRanges" "REF"       "ALT"       "QUAL"      "FILTER"    "INFO"
[7] "GENO"
> vcf_line[[1]]$REF
  A DNAStringSet instance of length 1
    width seq
[1]     1 G> packageVersion("VariantAnnotation")
[1] '1.30.1'
> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> vcf_file <- file(fl)
> vcf_line <- scanVcf(file=vcf_file,n=1)
> names(vcf_line[[1]])
[1] "rowRanges" "REF"       "ALT"       "QUAL"      "FILTER"    "INFO"
[7] "GENO"
> vcf_line[[1]]$REF
  A DNAStringSet instance of length 1
    width seq
[1]     1 G

Do you have a complete example that fails?

.scan_vcf_connection is called internally; probably it's visible in traceback() after the error occurs, or going forward

> selectMethod("scanVcf", c("connection", "missing"))
Method Definition:

function (file, ..., param)
{
    .vcf_scan_connection(file, ...)
}
<bytecode: 0x7fd4c876b600>
<environment: namespace:VariantAnnotation>

Signatures:
        file         param
target  "connection" "missing"
defined "connection" "missing"
SebastianHollizeck commented 5 years ago

Oh I am very sorry, my R must have had a hiccup.

It is working fine now!

Thank you so much for the fix and sorry for the false alarm

mtmorgan commented 5 years ago

No problem, thanks for the bug report.