Closed bschilder closed 2 years ago
i think there might actually be a couple things going on here:
Rsamtools
.I find that rtracklayer::import can import your data. As you note, the seqlevelsStyle is UCSC. A GRanges without this style can never succeed. After seqlevelsStyle(gr) = "UCSC"
,
> head(gr)
GRanges object with 6 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr4 15712787 *
[2] chr4 15730146 *
[3] chr4 15730398 *
[4] chr4 15710330 *
[5] chr4 15706790 *
[6] chr4 15737348 *
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> rtracklayer::import("E099_15_coreMarks_dense.bed.bgz", which=gr)
GRanges object with 163 ranges and 4 metadata columns:
seqnames ranges strand | name score itemRgb
<Rle> <IRanges> <Rle> | <character> <numeric> <character>
[1] chr4 14108401-14843200 * | 15_Quies 0 #FFFFFF
[2] chr4 14843201-14843600 * | 7_Enh 0 #FFFF00
[3] chr4 14843601-14844400 * | 5_TxWk 0 #006400
[4] chr4 14844401-14844600 * | 7_Enh 0 #FFFF00
[5] chr4 14844601-14847000 * | 5_TxWk 0 #006400
... ... ... ... . ... ... ...
[159] chr4 16723401-16724400 * | 5_TxWk 0 #006400
[160] chr4 16724401-16724600 * | 7_Enh 0 #FFFF00
[161] chr4 16724601-16725000 * | 5_TxWk 0 #006400
[162] chr4 16725001-16725600 * | 7_Enh 0 #FFFF00
[163] chr4 16725601-16805200 * | 15_Quies 0 #FFFFFF
thick
<IRanges>
[1] 14108401-14843200
[2] 14843201-14843600
[3] 14843601-14844400
[4] 14844401-14844600
[5] 14844601-14847000
... ...
[159] 16723401-16724400
[160] 16724401-16724600
[161] 16724601-16725000
[162] 16725001-16725600
[163] 16725601-16805200
-------
seqinfo: 25 sequences from an unspecified genome; no seqlengths
For scanTabix, with a correct seqlevelsStyle in the query,
> scanTabix(file=TabixFile(bgz1), param=gr) -> ii
[E::bgzf_read] Read block operation failed with error -1 after 0 of 8 bytes
I don't understand that.
This is from samtools / htslib with an update to the Bioconductor version required; see https://github.com/Bioconductor/Rsamtools/issues/32#issuecomment-1073116872
A long due update! See https://github.com/Bioconductor/Rhtslib/issues/4 and https://github.com/Bioconductor/Rsamtools/issues/8#issuecomment-1076024301. Should we make this a priority for BioC 3.16?
For scanTabix, with a correct seqlevelsStyle in the query,
> scanTabix(file=TabixFile(bgz1), param=gr) -> ii [E::bgzf_read] Read block operation failed with error -1 after 0 of 8 bytes
I don't understand that.
Getting this error as well now with VCFs from the 1000 Genomes Project, which I previously didn't have any issues with.`
This seems to have the effect of breaking VariantAnnotation::readVcf
as well. @vjcitn have you noticed this with other files?
Here are several examples that currently work to varying degrees. I can confirm this because I have these numbers recorded in my unit tests. https://github.com/RajLabMSSM/echotabix/blob/main/tests/testthat/test-query_vcf.R
target_path <- file.path(
"ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/",
"ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz")
param <- GenomicRanges::GRanges("4:14737349-16737284")
## This produces the error message but successfully returns the header
header <- Rsamtools::headerTabix(file = target_path)
## This does NOT produce the error message and successfully returns the header
header <- VariantAnnotation::scanVcfHeader(file = target_path)
## These both produce the error message but return only 468 variants from all of chrom 4
## Was previously returning many many thousands of variants.
vcf <- VariantAnnotation::readVcf(target_path)
vcf <- Rsamtools::scanTabix(file = target_path)
## Produces the same "Read block operation failed" error message as the other two methods,
## but then fails with an error in R, thus returning no output:
### Error in read.table(con, sep = "\t", ...) :
### incomplete final line found by readTableHeader on 'gzcon(ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521//ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz)'
tbx <- Rsamtools::TabixFile(target_path)
out <- rtracklayer::import(tbx)
## These both produce the error message but return 0 variants
## Was previously returning 24,376 variants.
vcf <- VariantAnnotation::readVcf(target_path, param=param)
## Warning: Take a very long time
vcf <- Rsamtools::scanTabix(file = target_path, param=param)
## Produces the same error message (3 times) but return 0 variants,
## and then throws an error indicating that there's no input to parse.
tbx <- Rsamtools::TabixFile(target_path)
out <- rtracklayer::import(tbx, which=param)
I'll look into this more closely; I also remember the 1000 genomes VCFs working.
Unrelated, but file.path()
is not the right way to assemble URLs, because it uses as separator the default for the platform -- on Windows that's \
rather than /
, at least in principle. Also philosophically those things at the end of a URL are not paths, just identifiers, as in object stores such as Amazon S3 where the 'object' just happens to have an identifier that looks like a file path.
thanks @mtmorgan. ah, hadn't thought of that with URLs, i'll be more careful with those in the future.
Just a heads up, this also affects rtracklayer
, which also has some functionality for importing VCFs (which I never knew it could do!). Tagging the rtracklayer
maintainer as well. @sanchit-saini
I've updated the reprex above to demonstrate that the same error occurs with this method.
Interestingly, seqminer
is still able to successfully perform these VCF queries. I believe this is because it does not rely on Rhtslib
or Rsamtools
. I may need to switch to this method for my packages while the issues with Rsamtools
and/or Rhtslib
are being resolved, since querying VCFs is pretty crucial to my packages. That said, I'd prefer to use VariantAnnotation
once it is working again since it has the added benefit of subsetting VCFs by sample IDs. @vjcitn
Tagging the seqminer
maintainers here: @zhanxw @yang-lina
target_path <- paste(
"ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/",
"ALL.chr4.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz",
sep="/")
out <- seqminer::tabix.read.table(target_path, tabixRange = "4:14737349-16737284")
dim(out)
## [1] 28228 1101
Thanks for these reports. I am having a look at migrating Rsamtools/Rhtslib to more current htslib. I do not know how long it will take. @hpages @mtmorgan -- Herve has a number of fixes to htslib 1.7 sources that may or may not need to be migrated to an updated Rhtslib.
@bschilder would you consider forking the relevant Bioconductor packages and adding unit tests that exhibit the problems you have identified, and adding these unit tests that will fail until these conditions are fixed?
Just a heads up, this also affects
rtracklayer
, which also has some functionality for importing VCFs (which I never knew it could do!). Tagging thertracklayer
maintainer as well. @sanchit-sainiI've updated the reprex above to demonstrate that the same error occurs with this method.
@lawremi I'm just a collaborator tagging the actual maintainer.
@bschilder would you consider forking the relevant Bioconductor packages and adding unit tests that exhibit the problems you have identified, and adding these unit tests that will fail until these conditions are fixed?
I'm afraid this is a bit more than I can commit to atm, but please feel free to use the example I provided above. I think that should contain all of the information you need. @vjcitn
Thanks for these reports. I am having a look at migrating Rsamtools/Rhtslib to more current htslib. I do not know how long it will take. @hpages @mtmorgan -- Herve has a number of fixes to htslib 1.7 sources that may or may not need to be migrated to an updated Rhtslib.
Just checking in, has there been progress on fixing this? @hpages @mtmorgan
Progress is being made but we need to release 3.15 of the whole ecosystem. After that I will try to deal with this.
An update on this: This is done for Linux and Mac (see https://github.com/Bioconductor/Rhtslib/issues/4#issuecomment-1118098667). We still need to make sure things work properly on Windows.
thanks for the update @hpages
Rhtslib
via GitHubremotes::install_github('Bioconductor/Rhtslib')
I just tried installing the latest Rhtslib
from GitHub (v1.99.2) and rerunning the examples above. Even after restarting R, I still seem to be getting the exact same issues as before.
Rhtslib
via Bioc 3.16BiocManager::install(version='devel')
That said, when I updated from Bioc 3.15 --> 3.16 within a Bioc Docker container, I noticed that the above examples work as expected! As in, they return the correct number of rows back without error.
So i'm wondering if this difference between installation methods might have something to do with the new version of htslib not replacing the old one when I install via GitHub? Or perhaps some other Bioc libraries also need to be upgraded in order for this to work, in which case maybe some minimum versions could be specified in the Rhtslib
DESCRIPTION file.
VariantAnnotation needs to be re-installed (it calls Rsamtools' C code from C code). It has had a version bump so should be installable via BiocManager either later today or later on Sunday, all being well...
It looks like Rhtslib is available via BiocManager https://bioconductor.org/packages/3.16/bioc/html/Rhtslib.html.
Packages need to be installed in the correct order (which BiocManager::install() takes care of, once updated versions have successfully propagated...) ... first Rhtslib then Rsamtools then VariantAnnotation. If you installed Rsamtools (using a previous version of Rhtslib), then Rhtslib, Rsamtools will be statically linked to the previous version of Rhtslib, which explains why you see the same behavior. If you install Rhtslib then Rsamtools but don't install VariantAnnotation, the readVcf() etc will result in a segfault because VariantAnnotation is expecting a different version of the Rsamtools C code.
I'm not completely familiar with the macOS build system, but in general it is important that the same compiler and compiler settings are used for each library, so in general one would want to either install all from source, or all as binaries.
The latest Rsamtools (2.13.2) was updated to work with the new Rhtslib (based on htslib 1.15.1). It is now available in BioC 3.16 (current devel) via BiocManager::install()
. This should grab the new Windows or Mac binary for Rsamtools 2.13.2 if you are on these platforms.
Can someone confirm that this issue is gone with Rsamtools 2.13.2? We want to make sure that this is tested on Windows before we close. Thanks!
H.
I confirm that
vcf <- Rsamtools::scanTabix(file = target_path, param=param)
from https://github.com/Bioconductor/Rsamtools/issues/33#issuecomment-1088048436 succeeds with
R version 4.2.0 (2022-04-22 ucrt)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19044)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.utf8
[2] LC_CTYPE=English_United States.utf8
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C
[5] LC_TIME=English_United States.utf8
attached base packages:
[1] stats4 stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] Rsamtools_2.13.2 Biostrings_2.65.0 XVector_0.37.0
[4] GenomicRanges_1.49.0 GenomeInfoDb_1.33.3 IRanges_2.31.0
[7] S4Vectors_0.35.0 BiocGenerics_0.43.0
loaded via a namespace (and not attached):
[1] crayon_1.5.1 bitops_1.0-7 zlibbioc_1.43.0
[4] BiocParallel_1.31.3 tools_4.2.0 RCurl_1.98-1.6
[7] parallel_4.2.0 compiler_4.2.0 GenomeInfoDbData_1.2.8
Excellent. Thanks Vince!
Hello,
So I seem to be having some issues with querying remote tabix files (e.g from ENCODE). Though I'm not sure if this is strictly related to the file being remote, or some other difference in how the file is formatted.
Reprex
Main example
Extended examples
Session info