Open jeroen opened 6 years ago
Ah maybe I should add support for disk streaming first. It's all in memory now.
Hi @jeroen, thanks for your request, love all your work!
I could definitely add the protolite
benchmarks to the list, that would more or less look like:
library(protolite)
# wrapper for writing to file
write_prot <- function(df, path) {
buf <- serialize_pb(df)
writeBin(buf, path)
}
# wrapper for reading from file
read_prot <- function(path) {
raw_vec <- readBin(path, "raw", file.size(path))
unserialize_pb(raw_vec)
}
# sample table
df <- data.frame(X = 1L:10000000L)
# measure write timings
write_timings <- microbenchmark::microbenchmark(
write_prot(df, "1.buf"),
times = 1
)
# restart system here (or empty disk buffers)
# measure timings
read_timings <- microbenchmark::microbenchmark(
read_prot("1.buf"),
times = 1
)
# write speed (in MB/s)
1000 * as.numeric(object.size(df)) / write_timings$time
#> [1] 22.39255
# read speed (in MB/s)
1000 * as.numeric(object.size(df)) / read_timings$time
#> [1] 55.32596
(note that the speeds are lower because I'm using the reprex
package to calculate these speeds).
Would that be in line with your request?
thanks & greetings
You can actually use serialize_pb
directly instead of write_prot
because it already has a second optional argument that takes a path. Other than that it looks good.
I'm going to add C++ level reading/writing from files in the next version. That way it uses less memory I think.
Hi @jeroen, that's very interesting, so you convert the R
object to a generic structure that can also be read from other languages with wrappers to the protocol-buffers
library?
Is that library multi-threaded? Would it be possible to convert the elements of an R
object to the protocol-buffers
format in parallel? For most of these types in the format that should not interfere with R
's main thread I think.
Yes the benefit of serializing to protocol buffers is that it is portable. You can easily read the data in any another language. I use it in opencpu as a binary alternative to JSON.
Not sure about threading, I haven't worried too much about performance so far. But I'm curious to see how it compares.
Nice, if the format is general enough, it might be a nice wrapper for serializing arbitrary list elements in the fst
package as well (which is not implemented yet).
I was thinking of marking such elements as a raw blob
and just leave it up to the client what to do with them. But with protocol buffers there would at least be a cross-language format for many types of objects. On the other side, the native R
serialization will probably be faster and therefore more in line with the goals of fst
.
Anyway, I will post the protolite
benchmarks once I have the numbers!
Hi @jeroen, I've run the same benchmark as was used in the fst
README file with protolite
. To compare, feather
was added (protocol- vs flatbuffers):
As you can see, protolite
becomes faster for larger datasets, but it's a slower than feather
for reading (and more for writing).
The tests were done using a mixed-column data frame:
# the sample generator generates a table with a mixture of common column types
sample_generator <- function(nr_of_rows) {
data.table(
# Logical column with mostly value TRUE's, some FALSE's and few NA's
Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),
# Integer column with values between 1 and 100
Integer = sample(1L:100L, nr_of_rows, replace = TRUE),
# Real column simulating some prices
Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),
# Factor column
Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
)
}
My laptop has a 4/8 core i7 4710HQ @2.5 GHz CPU with a relatively fast SSD (>1.5 GB/s). So IO is not disk limited. Both packages compress the data to about 80 percent of the original in-memory file size, probably because the logical column is stored efficiently (at least that's what happens in feather).
NrOfRows | Mode | Size | Time | FileSize | Speed | FileSpeed |
---|---|---|---|---|---|---|
1e+05 | Write | 2 | 161.46 | 1.53 | 12.40 | 9.47 |
1e+05 | Read | 2 | 160.86 | 1.53 | 12.45 | 9.51 |
2e+05 | Write | 4 | 173.46 | 3.07 | 23.07 | 17.68 |
2e+05 | Read | 4 | 177.81 | 3.07 | 22.51 | 17.25 |
1e+06 | Write | 20 | 271.04 | 15.36 | 73.80 | 56.68 |
1e+06 | Read | 20 | 412.25 | 15.36 | 48.52 | 37.26 |
2e+06 | Write | 40 | 412.90 | 31.68 | 96.88 | 76.73 |
2e+06 | Read | 40 | 688.86 | 31.68 | 58.07 | 45.99 |
1e+07 | Write | 200 | 1651.12 | 162.64 | 121.13 | 98.50 |
1e+07 | Read | 200 | 2374.45 | 162.64 | 84.23 | 68.50 |
2e+07 | Write | 400 | 3212.84 | 326.34 | 124.50 | 101.58 |
2e+07 | Read | 400 | 4518.45 | 326.34 | 88.53 | 72.22 |
The FileSpeed column does not really reflect the disk IO speed, because serializing with protolite
and feather
is a 2 step process. First serialization and then writing to disk (with fst
both are done in parallel).
I hope you can use these results to your benefit!
Hi @jeroen, just one note: if you use compression just before writing to disk, you can compact the file significantly without loosing too much speed. Using the sample generated from the previous test:
wrap_proto_write_compressed <- function(dt, path) {
raw_vec <- serialize_pb(dt)
raw_com <- fst::compress_fst(raw_vec)
writeBin(raw_com, path)
}
wrap_proto_read_compressed <- function(path) {
raw_com <- readBin(path, "raw", file.size(path))
raw_vec <- fst::decompress_fst(raw_com)
unserialize_pb(raw_vec)
}
# generate sample and write / read cycle
dt <- sample_generator(1e7)
wrap_proto_write_compressed(dt, "1.pl")
dt2 <- wrap_proto_read_compressed("1.pl")
# 1 / compression ratio
as.numeric(file.size("1.pl") / object.size(dt))
#> [1] 0.280506
So the file is 28 percent of the in-memory size, much smaller than the original 80 percent. For compression, fst::fstcompress()
uses a multi-threaded implementation of the ZSTD
compressor at a fast setting as it's default.
The resulting files won't be protocol buffer files obviously, but there might be use-cases where that is less important.
all the best!
Cool very interesting results, thanks. Protocol Buffers is primarily a format for binary data structure exchange in networking, rather than storage of large data, but it seems to work for that if needed.
Thanks for pointing me to zstd
! I've wrapped brotli a while ago but I was pretty disappointed with the performance. I don't understand why it seems so slow for me in comparison to the benchmarks.
PS: ik ben regelmatig op de uithof --> ☕?
Hi @jeroen, Brotli
relies heavily on it's internal dictionary, perhaps that makes it slower for small vectors? Also, in your post, you are using the slowest setting of Brotli
(11). The slowest setting of ZSTD
compresses at a speed of just a few MB/s, so probably even slower than Brotli
.
The only use-cases for these settings are networking of perhaps multi-threaded compression if you have a lot of cores. fst
uses LZ4
and ZSTD
internally after applying some custom (byte-ordering) filters, but the compress_fst()
and uncompress_fst()
methods use the linked libraries directly (without additional filtering). For speed, the multi-threaded (de-)compression mode first splits the data in batches and adds a special header to the result, so the result is not a pure LZ4
or ZSTD
compressed vector (the single-core mode is).
I was thinking of comparing Brotli
to the dictionary mode of ZSTD
and use one of them for text compression of character columns in a fst
file, so perhaps I have more benchmarks to share with you in the near future!
PS: ha, ik dacht dat je nu permanent in Californie zat, maar goed idee, ik werk nog geen 5 minuten van de Uithof af, ik spreek graag een keer af (ik trakteer :-))!
Is it possible to add
serialize_pb
andunserialize_pb
from the protolite pkg to the benchmarks?