Closed romanzenka closed 5 years ago
Can you illustrate what you are trying to do, using for instance the data that comes with the package and is used in the example ?TabixFile
?
This is just an example, but it is something like this:
library(Rsamtools)
library(data.table)
fl <- system.file("extdata", "example.gtf.gz", package="Rsamtools",
mustWork=TRUE)
tbx <- TabixFile(fl)
param <- GRanges(c("chr1"), IRanges(c(1, 1), width=100000))
res <- scanTabix(tbx, param=param)
resAsDataFrame <- fread(paste0(res[[1]], collapse="\n"), fill=TRUE, sep="\t")
Sorry, pushed wrong button!
There are two solutions available. The first is to iterate through file in chunks, using the yieldSize=
argument to TabixFile.
tbx <- open(TabixFile(fl, yieldSize=100))
while (length(chunk <- scanTabix(tbx)[[1]])) {
message(".")
## do work on this chunk
}
close(tbx)
The second is to provide ranges that tile across the file, perhaps splitting into convenient groups (e.g., by chromosome) for memory management.
param <- GRanges(sprintf("chr%d:1-100000", 1:2))
tbx <- open(TabixFile(fl))
lapply(split(param, seqnames(param)), function(p) {
message(".")
res <- scanTabix(tbx, param=p)
## do work on this GRange, e.g.,
res <- do.call(paste, c(res, list(collapse="\n")))
dim(read.delim(text=res, header=FALSE))
})
close(tbx)
The use of paste()
is one alternative, another is
read.delim(textConnection(unlist(res)), header = FALSE)
Using a text connection seems to be faster and more scalable.
Okay, I will look into read.delim(textConnection) but I have very large amounts of data to read, so any extra steps I can not do would help my performance. A lot of functions that parse things (including read.delim) take a stream, which seems to be a perfect match here.
Are you opposed to the notion of using streams for some reason besides the expense of implementing them? If not, I might try to look if I can help with implementation and make a pull request.
Note that fread(text = unlist(res, use.names=FALSE), sep="\t", header=FALSE)
also works, without pasting or using a textConnection.
I'm not really sure why, and obviously not the right test dataset, but
> system.time(fread(fl))
user system elapsed
0.048 0.004 0.053
is about 1/2 as fast as chunk-wise iteration
> system.time({
+ tbx <- open(TabixFile(fl, yieldSize=100))
+ tbl <- NULL
+ while (length(chunk <- scanTabix(tbx)[[1]])) {
+ tbl <- rbind(tbl, fread(text=chunk, sep="\t"))
+ }
+ close(tbx)
+ })
>
user system elapsed
0.016 0.005 0.020
Reading larger chunks is even faster
> system.time(fread(text=scanTabix(fl)[[1]], sep="\t"))
user system elapsed
0.006 0.004 0.009
I did not know that I could do that, and this indeed solves my problem! It sounded like going through one parser that splits by newlines and then creates all those string objects, just that yet another parser can do the same thing again would be wasting some cycles and memory, but I am not so performance starved not to survive a minor hit if it keeps the code simple.
Thank you, closing this!
I am having a problem with scanTabix.
I am reading tabix-indexed data that I would like to parse into a data.frame, quickly.
All I expect scanTabix to do is to give me a stream that I can feed into a table parsing function, together with defined columns.
scanTabix instead provides a list of strings. So I have to paste those together to get a parseable format.
That errors out because paste0 implementation fails to produce a string > 2GB in size.
Can you please offer a stream option to get data out of scanTabix? You must have a stream under the covers, you just choose to parse it by newlines, which is something that not only does not help me, it actually harms performance of my code...