scanTabix should return a stream

Bioconductor / Rsamtools

Binary alignment (BAM), FASTA, variant call (BCF), and tabix file import

https://bioconductor.org/packages/Rsamtools

Other

27 stars 27 forks source link

scanTabix should return a stream #7

Closed romanzenka closed 5 years ago

romanzenka commented 5 years ago

I am having a problem with scanTabix.

I am reading tabix-indexed data that I would like to parse into a data.frame, quickly.

All I expect scanTabix to do is to give me a stream that I can feed into a table parsing function, together with defined columns.

scanTabix instead provides a list of strings. So I have to paste those together to get a parseable format.

That errors out because paste0 implementation fails to produce a string > 2GB in size.

Can you please offer a stream option to get data out of scanTabix? You must have a stream under the covers, you just choose to parse it by newlines, which is something that not only does not help me, it actually harms performance of my code...

mtmorgan commented 5 years ago

Can you illustrate what you are trying to do, using for instance the data that comes with the package and is used in the example ?TabixFile ?

romanzenka commented 5 years ago

This is just an example, but it is something like this:

library(Rsamtools)
library(data.table)

fl <- system.file("extdata", "example.gtf.gz", package="Rsamtools",
                  mustWork=TRUE)
tbx <- TabixFile(fl)

param <- GRanges(c("chr1"), IRanges(c(1, 1), width=100000))
res <- scanTabix(tbx, param=param)
resAsDataFrame <- fread(paste0(res[[1]], collapse="\n"), fill=TRUE, sep="\t")

romanzenka commented 5 years ago

Sorry, pushed wrong button!

mtmorgan commented 5 years ago

There are two solutions available. The first is to iterate through file in chunks, using the yieldSize= argument to TabixFile.

tbx <- open(TabixFile(fl, yieldSize=100))
while (length(chunk <- scanTabix(tbx)[[1]])) {
    message(".")
    ## do work on this chunk
}
close(tbx)

The second is to provide ranges that tile across the file, perhaps splitting into convenient groups (e.g., by chromosome) for memory management.

param <- GRanges(sprintf("chr%d:1-100000", 1:2))

tbx <- open(TabixFile(fl))
lapply(split(param, seqnames(param)), function(p) {
    message(".")
    res <- scanTabix(tbx, param=p)
    ## do work on this GRange, e.g.,
    res <- do.call(paste, c(res, list(collapse="\n")))
    dim(read.delim(text=res, header=FALSE))
})
close(tbx)

The use of paste() is one alternative, another is

read.delim(textConnection(unlist(res)), header = FALSE)

Using a text connection seems to be faster and more scalable.

romanzenka commented 5 years ago

Okay, I will look into read.delim(textConnection) but I have very large amounts of data to read, so any extra steps I can not do would help my performance. A lot of functions that parse things (including read.delim) take a stream, which seems to be a perfect match here.

Are you opposed to the notion of using streams for some reason besides the expense of implementing them? If not, I might try to look if I can help with implementation and make a pull request.

mtmorgan commented 5 years ago

Note that fread(text = unlist(res, use.names=FALSE), sep="\t", header=FALSE) also works, without pasting or using a textConnection.

I'm not really sure why, and obviously not the right test dataset, but

> system.time(fread(fl))
   user  system elapsed 
  0.048   0.004   0.053

is about 1/2 as fast as chunk-wise iteration

> system.time({
+ tbx <- open(TabixFile(fl, yieldSize=100))
+ tbl <- NULL
+ while (length(chunk <- scanTabix(tbx)[[1]])) {
+     tbl <- rbind(tbl, fread(text=chunk, sep="\t"))
+ }
+ close(tbx)
+ })
> 
   user  system elapsed 
  0.016   0.005   0.020

Reading larger chunks is even faster

> system.time(fread(text=scanTabix(fl)[[1]], sep="\t"))
   user  system elapsed 
  0.006   0.004   0.009

romanzenka commented 5 years ago

I did not know that I could do that, and this indeed solves my problem! It sounded like going through one parser that splits by newlines and then creates all those string objects, just that yet another parser can do the same thing again would be wasting some cycles and memory, but I am not so performance starved not to survive a minor hit if it keeps the code simple.

Thank you, closing this!