Open mrdwab opened 4 years ago
for the record I tried using capture.output
instead of disk I/O and it's way, way slower (I gave up running the benchmark)
@MichaelChirico I had forgotten to mention in my original post that I tried with capture.output
and also gave up, and then tried with R.utils::captureOutput
which performed much better, but still slower than fpaste
.
This sounds impressive. I do not understand how writing to file is faster than manipulating it in RAM. What is happening? Why is capture.output
discussed?
You can still manipulate it in RAM with fwrite and fread if you use tempfile having tempdir set to ramdisk (search NEWS.md for "ramdisk"). I assume that capture.output
is discussed because fwrite
can print to console.
@ColeMiller1 My initial thought was to just use fwrite
with file = ""
and use fread
on that. But that just prints the output. capture.output
could be used to convert that into a string, but it's really slow.
Using the relevant parts of R.utils::captureOutput
I tried:
fpaste2 <- function(dt, sep = ",", envir = parent.frame()) {
eval({
file <- rawConnection(raw(0L), open = "w")
on.exit({
if (!is.null(file)) close(file)
})
capture.output(fwrite(dt, sep = sep, col.names = FALSE), file = file)
fread(rawToChar(rawConnectionValue(file)), sep = "\n", header = FALSE)
}, envir = envir, enclos = envir)
}
This performs well. It's at least as fast if not faster than do.call(stringi::stri_join, c(d2[cols], sep = "-"))
but not as fast as writing to file and re-reading it.
This will not work for a sep=""
because fwrite
expect non-zero char separator.
My use case for sep=""
is to mimic paste0("id",1:1e9)
.
Just this paste0
command alone takes 40 minutes to evaluate. Most probably due to R's string global cache.
If I could do fwrite(data.frame(a="id",b=1:1e9), sep="")
then I can potentially save 40 minutes. I actually need to write it to csv rather than console, so populating R's global cache just to dump that to csv is really sub-efficient.
Just now seeing this today, but I think there certainly is an opportunity to improve vectorized string concatenation performance with a fpaste()
function.
Back in 2018, I had a use case where this was the bottleneck in a data pipeline. I posted to stack overflow, https://stackoverflow.com/questions/48233309/fast-concatenation-of-data-table-columns-into-one-string-column , and in the course of investigating, I was suprised to find the same thing others described here - it was faster to fwrite
the dataset to disk, use sed
to perform the concatenation, and fread
to pull back in to memory.
One of the answers by Matrin Modrák proposed repurposing some of the code from /src/fwrite.c that ran 8x faster the previous best - an optimized sprintf
call. From there, I put that code into a single function package - fastConcat - that we still use in production at my employer. https://github.com/msummersgill/fastConcat
fastConcat::concat()
only supports single digit integers (the use case was highly specific), but it is a working proof of concept that the code in /src/fwrite.c could probably be re-purposed to create a data.table::fpaste()
with performance at least an order of magnitude better than base::paste()
.
Given the speed of
fwrite
, it can be used in conjunction withfread
as an alternative todo.call(paste, ...)
to flatten multiple columns into a character vector. It would be nice to be able to capture the output offwrite
directly as a character vector.It is much faster than some of the other idiomatic approaches that are often considered.
Here's the behavior I'm hoping to be able to replicate:
Here's a comparison with a straightforward
paste
in adata.table
: