Rdatatable / data.table

R's data.table package extends data.frame:
http://r-datatable.com
Mozilla Public License 2.0
3.59k stars 978 forks source link

fpaste: fwrite output as a character vector #4572

Open mrdwab opened 4 years ago

mrdwab commented 4 years ago

Given the speed of fwrite, it can be used in conjunction with fread as an alternative to do.call(paste, ...) to flatten multiple columns into a character vector. It would be nice to be able to capture the output of fwrite directly as a character vector.

It is much faster than some of the other idiomatic approaches that are often considered.

Here's the behavior I'm hoping to be able to replicate:

fpaste <- function(dt, sep = ",") {
  x <- tempfile()
  fwrite(dt, file = x, sep = sep, col.names = FALSE)
  fread(x, sep = "\n", header = FALSE)
}

d <- data.frame(a = 1:3, b = c('a','b','c'), c = c('d','e','f'), d = c('g','h','i')) 
cols = c("b", "c", "d")

fpaste(d[cols], "-")
#       V1
# 1: a-d-g
# 2: b-e-h
# 3: c-f-i

Here's a comparison with a straightforward paste in a data.table:

set.seed(1) 
d2 <- d[sample(1:3,1e6,TRUE),]
d3 <- as.data.table(d2)

bench::mark(fpaste(d2[cols], "-")$V1, d3[, paste(b, c, d, sep = "-")])
## # A tibble: 2 x 13
##   expression                          min  median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_time
##   <bch:expr>                      <bch:t> <bch:t>     <dbl> <bch:byt>    <dbl> <int> <dbl>   <bch:tm>
## 1 fpaste(d2[cols], "-")$V1         90.2ms  93.8ms     10.8     8.41MB     3.60     3     1      278ms
## 2 d3[, paste(b, c, d, sep = "-")] 220.9ms 223.2ms      4.43   30.55MB     0        3     0      678ms
## # … with 4 more variables: result <list>, memory <list>, time <list>, gc <list>
MichaelChirico commented 4 years ago

for the record I tried using capture.output instead of disk I/O and it's way, way slower (I gave up running the benchmark)

mrdwab commented 4 years ago

@MichaelChirico I had forgotten to mention in my original post that I tried with capture.output and also gave up, and then tried with R.utils::captureOutput which performed much better, but still slower than fpaste.

ColeMiller1 commented 4 years ago

This sounds impressive. I do not understand how writing to file is faster than manipulating it in RAM. What is happening? Why is capture.output discussed?

jangorecki commented 4 years ago

You can still manipulate it in RAM with fwrite and fread if you use tempfile having tempdir set to ramdisk (search NEWS.md for "ramdisk"). I assume that capture.output is discussed because fwrite can print to console.

mrdwab commented 4 years ago

@ColeMiller1 My initial thought was to just use fwrite with file = "" and use fread on that. But that just prints the output. capture.output could be used to convert that into a string, but it's really slow.

Using the relevant parts of R.utils::captureOutput I tried:

fpaste2 <- function(dt, sep = ",", envir = parent.frame()) {
  eval({
    file <- rawConnection(raw(0L), open = "w")
    on.exit({
      if (!is.null(file)) close(file)
    })
    capture.output(fwrite(dt, sep = sep, col.names = FALSE), file = file)
    fread(rawToChar(rawConnectionValue(file)), sep = "\n", header = FALSE)
  }, envir = envir, enclos = envir)
}

This performs well. It's at least as fast if not faster than do.call(stringi::stri_join, c(d2[cols], sep = "-")) but not as fast as writing to file and re-reading it.

jangorecki commented 3 years ago

This will not work for a sep="" because fwrite expect non-zero char separator.

jangorecki commented 3 years ago

My use case for sep="" is to mimic paste0("id",1:1e9). Just this paste0 command alone takes 40 minutes to evaluate. Most probably due to R's string global cache. If I could do fwrite(data.frame(a="id",b=1:1e9), sep="") then I can potentially save 40 minutes. I actually need to write it to csv rather than console, so populating R's global cache just to dump that to csv is really sub-efficient.

msummersgill commented 3 years ago

Just now seeing this today, but I think there certainly is an opportunity to improve vectorized string concatenation performance with a fpaste() function.

Back in 2018, I had a use case where this was the bottleneck in a data pipeline. I posted to stack overflow, https://stackoverflow.com/questions/48233309/fast-concatenation-of-data-table-columns-into-one-string-column , and in the course of investigating, I was suprised to find the same thing others described here - it was faster to fwrite the dataset to disk, use sed to perform the concatenation, and fread to pull back in to memory.

One of the answers by Matrin Modrák proposed repurposing some of the code from /src/fwrite.c that ran 8x faster the previous best - an optimized sprintf call. From there, I put that code into a single function package - fastConcat - that we still use in production at my employer. https://github.com/msummersgill/fastConcat

fastConcat::concat() only supports single digit integers (the use case was highly specific), but it is a working proof of concept that the code in /src/fwrite.c could probably be re-purposed to create a data.table::fpaste() with performance at least an order of magnitude better than base::paste().