Bioconductor / Rsamtools

Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import
https://bioconductor.org/packages/Rsamtools
Other
27 stars 27 forks source link

`Error: scanTabix: '4' not present in tabix index` #33

Closed bschilder closed 2 years ago

bschilder commented 2 years ago

Hello,

So I seem to be having some issues with querying remote tabix files (e.g from ENCODE). Though I'm not sure if this is strictly related to the file being remote, or some other difference in how the file is formatted.

Reprex

Main example

 #### Setup ####

    #### Get example data (Parkinson's Disease GWAS) ####
    if(!require("echodata")) remotes::install_github("RajLabMSSM/echodata") 
    #### Construct query as granges ####
    dat <- echodata::BST1
    gr <- GenomicRanges::GRanges(
        seqnames = as.integer(dat$CHR),
        ranges = IRanges::IRanges(
            start = as.integer(dat$POS),
            end = as.integer(dat$POS)
        )
    )
    #### For reconstructing the data as a data.table with colnames ####
    scanTabix_to_dt <- function(bgz, query){
        ### Add missing header back in 
        header <- Rsamtools::headerTabix(bgz) 
        query_dt <- data.table::fread(paste(c(header$header, query),
                                            collapse = "\n"), fill=TRUE)
        return(query_dt)
    } 

    #### ------- Example 1: Remote file ------- ####
    ### Currently produces error

    ## This is an tabix-indexed and bgzip-compressed ENCODE file.
    bgz1 <- file.path(
        "https://egg2.wustl.edu/roadmap/data/byFileType",
        "chromhmmSegmentations/ChmmModels/coreMarks/jointModel/final",
        "E099_15_coreMarks_dense.bed.bgz"
    ) 
    #### Query ####
    query1 <- Rsamtools::scanTabix(file = bgz1, param = gr) # <-- error occurs here
    #  "Error: scanTabix: '4' not present in tabix index"

    #### Running without the gr query returns the while file ####
    ## However, because it's the whole file, this takes a while.
    query1 <- Rsamtools::scanTabix(file = bgz1)
    ## This part tends to crash R (memory overload?)
    # query_dt1 <- scanTabix_to_dt(bgz1, query1) 

Extended examples

```R #### ------- Example 2: Local file ------- #### ### Currently working fullSS_path <- echodata::example_fullSS() bgz2 <- Rsamtools::bgzip(file = fullSS_path, dest = paste0(fullSS_path,".bgz"), overwrite = TRUE) tbi2 <- Rsamtools::indexTabix(file = bgz2, seq = 2, start = 3, end = 3, comment = "SNP") #### Query #### query2 <- Rsamtools::scanTabix(file = bgz2, param = gr) query_dt2 <- scanTabix_to_dt(bgz2, query2) #### ------- Example 3: Local file prepared without Rsamtools ------- #### ### Currently working ## Tho not when bgzipping with CLI. may be a version conflict issue ## (unrelated to Rsamtools per se). fullSS_path <- echodata::example_fullSS() bgz3 <- paste0(fullSS_path,".bgz") #### Compress the file via CLI #### bgzip_method <- "rsamtools" if(bgzip_method == "cli"){ system(paste("bgzip -f", fullSS_path)) ## However, this causes an error with tabix: ## "[tabix] the compression of '..._subset.tsv.bgz' is not BGZF". weird! #### Check version of bgzip #### help <- system("bgzip -h", intern = TRUE) cat(help, sep = "\n") } else { ## Compressing with Rsamtools seems to work fine with tabix CLI bgz3 <- Rsamtools::bgzip(file = fullSS_path, dest = bgz3, overwrite = TRUE) } #### Tabix-index #### system(paste("tabix", "-f", "-h", "-s",2, "-b",3, "-e",3, "-c","SNP", bgz3 )) query3 <- Rsamtools::scanTabix(file = bgz3, param = gr) query_dt3 <- scanTabix_to_dt(bgz3, query3) ```

Session info

``` R version 4.1.0 (2021-05-18) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Big Sur 11.4 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] bitops_1.0-7 matrixStats_0.61.0 fs_1.5.2 [4] lubridate_1.8.0 bit64_4.0.5 filelock_1.0.2 [7] progress_1.2.2 httr_1.4.2 GenomeInfoDb_1.30.1 [10] gh_1.3.0 tools_4.1.0 utf8_1.2.2 [13] R6_2.5.1 DT_0.21 DBI_1.1.2 [16] BiocGenerics_0.40.0 tidyselect_1.1.2 prettyunits_1.1.1 [19] bit_4.0.4 curl_4.3.2 compiler_4.1.0 [22] cli_3.2.0 Biobase_2.54.0 xml2_1.3.3 [25] DelayedArray_0.20.0 rtracklayer_1.54.0 rappdirs_0.3.3 [28] stringr_1.4.0 digest_0.6.29 Rsamtools_2.10.0 [31] piggyback_0.1.1 R.utils_2.11.0 XVector_0.34.0 [34] pkgconfig_2.0.3 htmltools_0.5.2 echodata_0.99.7 [37] MatrixGenerics_1.6.0 BSgenome_1.62.0 dbplyr_2.1.1 [40] fastmap_1.1.0 htmlwidgets_1.5.4 rlang_1.0.2 [43] rstudioapi_0.13 RSQLite_2.2.10 BiocIO_1.4.0 [46] generics_0.1.2 jsonlite_1.8.0 echoconda_0.99.5 [49] BiocParallel_1.28.3 dplyr_1.0.8 zip_2.2.0 [52] R.oo_1.24.0 VariantAnnotation_1.40.0 RCurl_1.98-1.6 [55] magrittr_2.0.2 GenomeInfoDbData_1.2.7 Matrix_1.4-0 [58] Rcpp_1.0.8.3 S4Vectors_0.32.3 fansi_1.0.2 [61] reticulate_1.24-9000 lifecycle_1.0.1 R.methodsS3_1.8.1 [64] yaml_2.3.5 stringi_1.7.6 brio_1.1.3 [67] SummarizedExperiment_1.24.0 zlibbioc_1.40.0 BiocFileCache_2.2.1 [70] grid_4.1.0 blob_1.2.2 parallel_4.1.0 [73] crayon_1.5.0 lattice_0.20-45 Biostrings_2.62.0 [76] GenomicFeatures_1.46.5 hms_1.1.1 KEGGREST_1.34.0 [79] pillar_1.7.0 GenomicRanges_1.46.1 rjson_0.2.21 [82] biomaRt_2.50.3 clisymbols_1.2.0 stats4_4.1.0 [85] XML_3.99-0.9 glue_1.6.2 data.table_1.14.2 [88] png_0.1-7 vctrs_0.3.8 testthat_3.1.2 [91] purrr_0.3.4 tidyr_1.2.0 assertthat_0.2.1 [94] cachem_1.0.6 openxlsx_4.2.5 echotabix_0.99.3 [97] restfulr_0.0.13 gitcreds_0.1.1 tibble_3.1.6 [100] GenomicAlignments_1.30.0 AnnotationDbi_1.56.2 memoise_2.0.1 [103] IRanges_2.28.0 ellipsis_0.3.2 ```
bschilder commented 2 years ago

i think there might actually be a couple things going on here:

Screenshot 2022-03-19 at 18 05 56

vjcitn commented 2 years ago

I find that rtracklayer::import can import your data. As you note, the seqlevelsStyle is UCSC. A GRanges without this style can never succeed. After seqlevelsStyle(gr) = "UCSC",

> head(gr)
GRanges object with 6 ranges and 0 metadata columns:
      seqnames    ranges strand
         <Rle> <IRanges>  <Rle>
  [1]     chr4  15712787      *
  [2]     chr4  15730146      *
  [3]     chr4  15730398      *
  [4]     chr4  15710330      *
  [5]     chr4  15706790      *
  [6]     chr4  15737348      *
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> rtracklayer::import("E099_15_coreMarks_dense.bed.bgz", which=gr)
GRanges object with 163 ranges and 4 metadata columns:
        seqnames            ranges strand |        name     score     itemRgb
           <Rle>         <IRanges>  <Rle> | <character> <numeric> <character>
    [1]     chr4 14108401-14843200      * |    15_Quies         0     #FFFFFF
    [2]     chr4 14843201-14843600      * |       7_Enh         0     #FFFF00
    [3]     chr4 14843601-14844400      * |      5_TxWk         0     #006400
    [4]     chr4 14844401-14844600      * |       7_Enh         0     #FFFF00
    [5]     chr4 14844601-14847000      * |      5_TxWk         0     #006400
    ...      ...               ...    ... .         ...       ...         ...
  [159]     chr4 16723401-16724400      * |      5_TxWk         0     #006400
  [160]     chr4 16724401-16724600      * |       7_Enh         0     #FFFF00
  [161]     chr4 16724601-16725000      * |      5_TxWk         0     #006400
  [162]     chr4 16725001-16725600      * |       7_Enh         0     #FFFF00
  [163]     chr4 16725601-16805200      * |    15_Quies         0     #FFFFFF
                    thick
                <IRanges>
    [1] 14108401-14843200
    [2] 14843201-14843600
    [3] 14843601-14844400
    [4] 14844401-14844600
    [5] 14844601-14847000
    ...               ...
  [159] 16723401-16724400
  [160] 16724401-16724600
  [161] 16724601-16725000
  [162] 16725001-16725600
  [163] 16725601-16805200
  -------
  seqinfo: 25 sequences from an unspecified genome; no seqlengths
vjcitn commented 2 years ago

For scanTabix, with a correct seqlevelsStyle in the query,

> scanTabix(file=TabixFile(bgz1), param=gr) -> ii
[E::bgzf_read] Read block operation failed with error -1 after 0 of 8 bytes

I don't understand that.

mtmorgan commented 2 years ago

This is from samtools / htslib with an update to the Bioconductor version required; see https://github.com/Bioconductor/Rsamtools/issues/32#issuecomment-1073116872

hpages commented 2 years ago

A long due update! See https://github.com/Bioconductor/Rhtslib/issues/4 and https://github.com/Bioconductor/Rsamtools/issues/8#issuecomment-1076024301. Should we make this a priority for BioC 3.16?

bschilder commented 2 years ago

For scanTabix, with a correct seqlevelsStyle in the query,

> scanTabix(file=TabixFile(bgz1), param=gr) -> ii
[E::bgzf_read] Read block operation failed with error -1 after 0 of 8 bytes

I don't understand that.

Getting this error as well now with VCFs from the 1000 Genomes Project, which I previously didn't have any issues with.`

This seems to have the effect of breaking VariantAnnotation::readVcf as well. @vjcitn have you noticed this with other files?

Here are several examples that currently work to varying degrees. I can confirm this because I have these numbers recorded in my unit tests. https://github.com/RajLabMSSM/echotabix/blob/main/tests/testthat/test-query_vcf.R

target_path <- file.path(
"ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/",
"ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz")
param <- GenomicRanges::GRanges("4:14737349-16737284")

## This produces the error message but successfully returns the header
header <- Rsamtools::headerTabix(file = target_path) 

## This does NOT produce the error message and successfully returns the header
header <- VariantAnnotation::scanVcfHeader(file = target_path) 

## These both  produce the error message but return only 468 variants from all of chrom 4
## Was previously returning many many thousands of variants.
vcf <-  VariantAnnotation::readVcf(target_path)
vcf <- Rsamtools::scanTabix(file = target_path)
## Produces the same "Read block operation failed" error message as the other two methods,
## but then fails with an error in R, thus returning no output:
### Error in read.table(con, sep = "\t", ...) : 
###   incomplete final line found by readTableHeader on 'gzcon(ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521//ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz)'
tbx <- Rsamtools::TabixFile(target_path)
out <- rtracklayer::import(tbx) 

## These both produce the error message but return 0 variants
## Was previously returning 24,376 variants.  
vcf <-  VariantAnnotation::readVcf(target_path, param=param)
## Warning: Take a very long time
vcf <- Rsamtools::scanTabix(file = target_path, param=param) 
## Produces the same error message (3 times) but return 0 variants, 
## and then throws an error indicating that there's no input to parse.
tbx <- Rsamtools::TabixFile(target_path)
out <- rtracklayer::import(tbx, which=param) 

Screenshot 2022-04-10 at 11 17 34

Session info

``` R version 4.1.3 (2022-03-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.3.1 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] dplyr_1.0.8 echotabix_0.99.5 VariantAnnotation_1.40.0 [4] Rsamtools_2.10.0 Biostrings_2.62.0 XVector_0.34.0 [7] SummarizedExperiment_1.24.0 Biobase_2.54.0 GenomicRanges_1.46.1 [10] GenomeInfoDb_1.30.1 IRanges_2.28.0 S4Vectors_0.32.4 [13] MatrixGenerics_1.6.0 matrixStats_0.61.0 BiocGenerics_0.40.0 loaded via a namespace (and not attached): [1] colorspace_2.0-3 rjson_0.2.21 class_7.3-20 [4] ellipsis_0.3.2 rprojroot_2.0.3 fs_1.5.2 [7] gld_2.6.4 proxy_0.4-26 rstudioapi_0.13 [10] DT_0.22 gh_1.3.0 bit64_4.0.5 [13] AnnotationDbi_1.56.2 fansi_1.0.3 mvtnorm_1.1-3 [16] lubridate_1.8.0 xml2_1.3.3 R.methodsS3_1.8.1 [19] rootSolve_1.8.2.3 cachem_1.0.6 pkgload_1.2.4 [22] jsonlite_1.8.0 dbplyr_2.1.1 png_0.1-7 [25] R.oo_1.24.0 echoconda_0.99.5 echodata_0.99.7 [28] readr_2.1.2 compiler_4.1.3 httr_1.4.2 [31] assertthat_0.2.1 Matrix_1.4-0 fastmap_1.1.0 [34] gargle_1.2.0 cli_3.2.0 htmltools_0.5.2 [37] prettyunits_1.1.1 tools_4.1.3 gtable_0.3.0 [40] glue_1.6.2 lmom_2.8 GenomeInfoDbData_1.2.7 [43] rappdirs_0.3.3 Rcpp_1.0.8.3 vctrs_0.4.0 [46] rtracklayer_1.54.0 stringr_1.4.0 MungeSumstats_1.3.16 [49] brio_1.1.3 openxlsx_4.2.5 testthat_3.1.3 [52] lifecycle_1.0.1 restfulr_0.0.13 XML_3.99-0.9 [55] googleAuthR_2.0.0 scales_1.1.1 MASS_7.3-55 [58] zlibbioc_1.40.0 BSgenome_1.62.0 clisymbols_1.2.0 [61] hms_1.1.1 parallel_4.1.3 expm_0.999-6 [64] Exact_3.1 yaml_2.3.5 curl_4.3.2 [67] memoise_2.0.1 reticulate_1.24-9000 ggplot2_3.3.5 [70] biomaRt_2.50.3 stringi_1.7.6 RSQLite_2.2.11 [73] BiocIO_1.4.0 desc_1.4.1 e1071_1.7-9 [76] GenomicFeatures_1.46.5 filelock_1.0.2 boot_1.3-28 [79] zip_2.2.0 BiocParallel_1.28.3 rlang_1.0.2 [82] pkgconfig_2.0.3 bitops_1.0-7 lattice_0.20-45 [85] scKirby_0.1.0 purrr_0.3.4 GenomicAlignments_1.30.0 [88] htmlwidgets_1.5.4 bit_4.0.4 tidyselect_1.1.2 [91] magrittr_2.0.3 R6_2.5.1 DescTools_0.99.44 [94] generics_0.1.2 piggyback_0.1.1 DelayedArray_0.20.0 [97] DBI_1.1.2 pillar_1.7.0 withr_2.5.0 [100] KEGGREST_1.34.0 RCurl_1.98-1.6 tibble_3.1.6 [103] crayon_1.5.1 utf8_1.2.2 BiocFileCache_2.2.1 [106] tzdb_0.3.0 progress_1.2.2 grid_4.1.3 [109] data.table_1.14.2 blob_1.2.2 digest_0.6.29 [112] tidyr_1.2.0 R.utils_2.11.0 munsell_0.5.0 ```
mtmorgan commented 2 years ago

I'll look into this more closely; I also remember the 1000 genomes VCFs working.

Unrelated, but file.path() is not the right way to assemble URLs, because it uses as separator the default for the platform -- on Windows that's \ rather than /, at least in principle. Also philosophically those things at the end of a URL are not paths, just identifiers, as in object stores such as Amazon S3 where the 'object' just happens to have an identifier that looks like a file path.

bschilder commented 2 years ago

thanks @mtmorgan. ah, hadn't thought of that with URLs, i'll be more careful with those in the future.

bschilder commented 2 years ago

Just a heads up, this also affects rtracklayer, which also has some functionality for importing VCFs (which I never knew it could do!). Tagging the rtracklayer maintainer as well. @sanchit-saini

I've updated the reprex above to demonstrate that the same error occurs with this method.

bschilder commented 2 years ago

Interestingly, seqminer is still able to successfully perform these VCF queries. I believe this is because it does not rely on Rhtslib or Rsamtools. I may need to switch to this method for my packages while the issues with Rsamtools and/or Rhtslib are being resolved, since querying VCFs is pretty crucial to my packages. That said, I'd prefer to use VariantAnnotation once it is working again since it has the added benefit of subsetting VCFs by sample IDs. @vjcitn

Tagging the seqminer maintainers here: @zhanxw @yang-lina

seqminer reprex

 target_path <- paste(
        "ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/",
        "ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz",
        sep="/")

out <- seqminer::tabix.read.table(target_path, tabixRange = "4:14737349-16737284")
dim(out)
## [1] 28228  1101

Session info

``` R version 4.1.3 (2022-03-10) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.3.1 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] echotabix_0.99.5 dplyr_1.0.8 VariantAnnotation_1.40.0 Rsamtools_2.10.0 [5] Biostrings_2.62.0 XVector_0.34.0 SummarizedExperiment_1.24.0 Biobase_2.54.0 [9] GenomicRanges_1.46.1 GenomeInfoDb_1.30.1 IRanges_2.28.0 S4Vectors_0.32.4 [13] MatrixGenerics_1.6.0 matrixStats_0.61.0 BiocGenerics_0.40.0 loaded via a namespace (and not attached): [1] fs_1.5.2 bitops_1.0-7 lubridate_1.8.0 bit64_4.0.5 [5] filelock_1.0.2 progress_1.2.2 httr_1.4.2 rprojroot_2.0.3 [9] gh_1.3.0 tools_4.1.3 utf8_1.2.2 R6_2.5.1 [13] DT_0.22 DBI_1.1.2 withr_2.5.0 tidyselect_1.1.2 [17] prettyunits_1.1.1 bit_4.0.4 curl_4.3.2 compiler_4.1.3 [21] cli_3.2.0 xml2_1.3.3 desc_1.4.1 DelayedArray_0.20.0 [25] rtracklayer_1.54.0 readr_2.1.2 rappdirs_0.3.3 stringr_1.4.0 [29] digest_0.6.29 piggyback_0.1.1 R.utils_2.11.0 htmltools_0.5.2 [33] pkgconfig_2.0.3 echodata_0.99.7 dbplyr_2.1.1 fastmap_1.1.0 [37] BSgenome_1.62.0 htmlwidgets_1.5.4 rlang_1.0.2 rstudioapi_0.13 [41] RSQLite_2.2.12 BiocIO_1.4.0 generics_0.1.2 jsonlite_1.8.0 [45] echoconda_0.99.5 BiocParallel_1.28.3 zip_2.2.0 R.oo_1.24.0 [49] RCurl_1.98-1.6 magrittr_2.0.3 GenomeInfoDbData_1.2.7 Matrix_1.4-1 [53] Rcpp_1.0.8.3 fansi_1.0.3 reticulate_1.24-9000 lifecycle_1.0.1 [57] R.methodsS3_1.8.1 stringi_1.7.6 yaml_2.3.5 zlibbioc_1.40.0 [61] brio_1.1.3 BiocFileCache_2.2.1 grid_4.1.3 blob_1.2.2 [65] parallel_4.1.3 crayon_1.5.1 lattice_0.20-45 GenomicFeatures_1.46.5 [69] hms_1.1.1 KEGGREST_1.34.0 seqminer_8.4 pillar_1.7.0 [73] rjson_0.2.21 clisymbols_1.2.0 biomaRt_2.50.3 pkgload_1.2.4 [77] XML_3.99-0.9 glue_1.6.2 data.table_1.14.2 tzdb_0.3.0 [81] png_0.1-7 vctrs_0.4.0 testthat_3.1.3 tidyr_1.2.0 [85] purrr_0.3.4 assertthat_0.2.1 cachem_1.0.6 openxlsx_4.2.5 [89] restfulr_0.0.13 tibble_3.1.6 GenomicAlignments_1.30.0 AnnotationDbi_1.56.2 [93] memoise_2.0.1 ellipsis_0.3.2 ```
vjcitn commented 2 years ago

Thanks for these reports. I am having a look at migrating Rsamtools/Rhtslib to more current htslib. I do not know how long it will take. @hpages @mtmorgan -- Herve has a number of fixes to htslib 1.7 sources that may or may not need to be migrated to an updated Rhtslib.

vjcitn commented 2 years ago

@bschilder would you consider forking the relevant Bioconductor packages and adding unit tests that exhibit the problems you have identified, and adding these unit tests that will fail until these conditions are fixed?

sanchit-saini commented 2 years ago

Just a heads up, this also affects rtracklayer, which also has some functionality for importing VCFs (which I never knew it could do!). Tagging the rtracklayer maintainer as well. @sanchit-saini

I've updated the reprex above to demonstrate that the same error occurs with this method.

@lawremi I'm just a collaborator tagging the actual maintainer.

bschilder commented 2 years ago

@bschilder would you consider forking the relevant Bioconductor packages and adding unit tests that exhibit the problems you have identified, and adding these unit tests that will fail until these conditions are fixed?

I'm afraid this is a bit more than I can commit to atm, but please feel free to use the example I provided above. I think that should contain all of the information you need. @vjcitn

bschilder commented 2 years ago

Thanks for these reports. I am having a look at migrating Rsamtools/Rhtslib to more current htslib. I do not know how long it will take. @hpages @mtmorgan -- Herve has a number of fixes to htslib 1.7 sources that may or may not need to be migrated to an updated Rhtslib.

Just checking in, has there been progress on fixing this? @hpages @mtmorgan

vjcitn commented 2 years ago

Progress is being made but we need to release 3.15 of the whole ecosystem. After that I will try to deal with this.

hpages commented 2 years ago

An update on this: This is done for Linux and Mac (see https://github.com/Bioconductor/Rhtslib/issues/4#issuecomment-1118098667). We still need to make sure things work properly on Windows.

bschilder commented 2 years ago

thanks for the update @hpages

Updating Rhtslib via GitHub

remotes::install_github('Bioconductor/Rhtslib')

I just tried installing the latest Rhtslib from GitHub (v1.99.2) and rerunning the examples above. Even after restarting R, I still seem to be getting the exact same issues as before.

Updating Rhtslib via Bioc 3.16

BiocManager::install(version='devel')

That said, when I updated from Bioc 3.15 --> 3.16 within a Bioc Docker container, I noticed that the above examples work as expected! As in, they return the correct number of rows back without error.

So i'm wondering if this difference between installation methods might have something to do with the new version of htslib not replacing the old one when I install via GitHub? Or perhaps some other Bioc libraries also need to be upgraded in order for this to work, in which case maybe some minimum versions could be specified in the Rhtslib DESCRIPTION file.

Session info

``` R version 4.2.0 (2022-04-22) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.3.1 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] bitops_1.0-7 matrixStats_0.62.0 bit64_4.0.5 [4] filelock_1.0.2 progress_1.2.2 httr_1.4.3 [7] GenomeInfoDb_1.32.1 tools_4.2.0 utf8_1.2.2 [10] R6_2.5.1 DT_0.22 DBI_1.1.2 [13] BiocGenerics_0.42.0 tidyselect_1.1.2 prettyunits_1.1.1 [16] bit_4.0.4 curl_4.3.2 compiler_4.2.0 [19] cli_3.3.0 Biobase_2.56.0 basilisk.utils_1.8.0 [22] xml2_1.3.3 DelayedArray_0.22.0 rtracklayer_1.56.0 [25] readr_2.1.2 rappdirs_0.3.3 stringr_1.4.0 [28] digest_0.6.29 Rsamtools_2.12.0 piggyback_0.1.2 [31] R.utils_2.11.0 basilisk_1.8.0 XVector_0.36.0 [34] pkgconfig_2.0.3 htmltools_0.5.2 echodata_0.99.9 [37] MatrixGenerics_1.8.0 dbplyr_2.1.1 fastmap_1.1.0 [40] BSgenome_1.64.0 htmlwidgets_1.5.4 rlang_1.0.2 [43] rstudioapi_0.13 RSQLite_2.2.14 BiocIO_1.6.0 [46] generics_0.1.2 jsonlite_1.8.0 echoconda_0.99.6 [49] BiocParallel_1.30.0 zip_2.2.0 dplyr_1.0.9 [52] R.oo_1.24.0 VariantAnnotation_1.42.0 RCurl_1.98-1.6 [55] magrittr_2.0.3 GenomeInfoDbData_1.2.8 Matrix_1.4-1 [58] Rcpp_1.0.8.3 S4Vectors_0.34.0 fansi_1.0.3 [61] reticulate_1.24 lifecycle_1.0.1 R.methodsS3_1.8.1 [64] stringi_1.7.6 yaml_2.3.5 SummarizedExperiment_1.26.1 [67] zlibbioc_1.42.0 brio_1.1.3 BiocFileCache_2.4.0 [70] grid_4.2.0 blob_1.2.3 parallel_4.2.0 [73] crayon_1.5.1 dir.expiry_1.4.0 lattice_0.20-45 [76] Biostrings_2.64.0 GenomicFeatures_1.48.0 hms_1.1.1 [79] KEGGREST_1.36.0 pillar_1.7.0 GenomicRanges_1.48.0 [82] rjson_0.2.21 biomaRt_2.52.0 stats4_4.2.0 [85] XML_3.99-0.9 glue_1.6.2 data.table_1.14.2 [88] tzdb_0.3.0 png_0.1-7 vctrs_0.4.1 [91] testthat_3.1.4 tidyr_1.2.0 purrr_0.3.4 [94] assertthat_0.2.1 cachem_1.0.6 openxlsx_4.2.5 [97] echotabix_0.99.6 restfulr_0.0.13 tibble_3.1.7 [100] GenomicAlignments_1.32.0 AnnotationDbi_1.58.0 memoise_2.0.1 [103] IRanges_2.30.0 ellipsis_0.3.2 ```
mtmorgan commented 2 years ago

VariantAnnotation needs to be re-installed (it calls Rsamtools' C code from C code). It has had a version bump so should be installable via BiocManager either later today or later on Sunday, all being well...

It looks like Rhtslib is available via BiocManager https://bioconductor.org/packages/3.16/bioc/html/Rhtslib.html.

Packages need to be installed in the correct order (which BiocManager::install() takes care of, once updated versions have successfully propagated...) ... first Rhtslib then Rsamtools then VariantAnnotation. If you installed Rsamtools (using a previous version of Rhtslib), then Rhtslib, Rsamtools will be statically linked to the previous version of Rhtslib, which explains why you see the same behavior. If you install Rhtslib then Rsamtools but don't install VariantAnnotation, the readVcf() etc will result in a segfault because VariantAnnotation is expecting a different version of the Rsamtools C code.

I'm not completely familiar with the macOS build system, but in general it is important that the same compiler and compiler settings are used for each library, so in general one would want to either install all from source, or all as binaries.

hpages commented 2 years ago

The latest Rsamtools (2.13.2) was updated to work with the new Rhtslib (based on htslib 1.15.1). It is now available in BioC 3.16 (current devel) via BiocManager::install(). This should grab the new Windows or Mac binary for Rsamtools 2.13.2 if you are on these platforms.

Can someone confirm that this issue is gone with Rsamtools 2.13.2? We want to make sure that this is tested on Windows before we close. Thanks!

H.

vjcitn commented 2 years ago

I confirm that

vcf <- Rsamtools::scanTabix(file = target_path, param=param) 

from https://github.com/Bioconductor/Rsamtools/issues/33#issuecomment-1088048436 succeeds with

R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] Rsamtools_2.13.2     Biostrings_2.65.0    XVector_0.37.0      
[4] GenomicRanges_1.49.0 GenomeInfoDb_1.33.3  IRanges_2.31.0      
[7] S4Vectors_0.35.0     BiocGenerics_0.43.0 

loaded via a namespace (and not attached):
[1] crayon_1.5.1           bitops_1.0-7           zlibbioc_1.43.0       
[4] BiocParallel_1.31.3    tools_4.2.0            RCurl_1.98-1.6        
[7] parallel_4.2.0         compiler_4.2.0         GenomeInfoDbData_1.2.8
hpages commented 2 years ago

Excellent. Thanks Vince!