fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
619 stars 41 forks source link

Add protolite to benchmarks #134

Open jeroen opened 6 years ago

jeroen commented 6 years ago

Is it possible to add serialize_pb and unserialize_pb from the protolite pkg to the benchmarks?

jeroen commented 6 years ago

Ah maybe I should add support for disk streaming first. It's all in memory now.

MarcusKlik commented 6 years ago

Hi @jeroen, thanks for your request, love all your work!

I could definitely add the protolite benchmarks to the list, that would more or less look like:


library(protolite)

# wrapper for writing to file
write_prot <- function(df, path) {
  buf <- serialize_pb(df)
  writeBin(buf, path)
}

# wrapper for reading from file
read_prot <- function(path) {
  raw_vec <- readBin(path, "raw", file.size(path))
  unserialize_pb(raw_vec)
}

# sample table
df <- data.frame(X = 1L:10000000L)

# measure write timings
write_timings <- microbenchmark::microbenchmark(
  write_prot(df, "1.buf"),
  times = 1
)

# restart system here (or empty disk buffers)

# measure timings
read_timings <- microbenchmark::microbenchmark(
  read_prot("1.buf"),
  times = 1
)

# write speed (in MB/s)
1000 * as.numeric(object.size(df)) / write_timings$time
#> [1] 22.39255

# read speed (in MB/s)
1000 * as.numeric(object.size(df)) / read_timings$time
#> [1] 55.32596

(note that the speeds are lower because I'm using the reprex package to calculate these speeds).

Would that be in line with your request?

thanks & greetings

jeroen commented 6 years ago

You can actually use serialize_pb directly instead of write_prot because it already has a second optional argument that takes a path. Other than that it looks good.

I'm going to add C++ level reading/writing from files in the next version. That way it uses less memory I think.

MarcusKlik commented 6 years ago

Hi @jeroen, that's very interesting, so you convert the R object to a generic structure that can also be read from other languages with wrappers to the protocol-buffers library?

Is that library multi-threaded? Would it be possible to convert the elements of an R object to the protocol-buffers format in parallel? For most of these types in the format that should not interfere with R's main thread I think.

jeroen commented 6 years ago

Yes the benefit of serializing to protocol buffers is that it is portable. You can easily read the data in any another language. I use it in opencpu as a binary alternative to JSON.

Not sure about threading, I haven't worried too much about performance so far. But I'm curious to see how it compares.

MarcusKlik commented 6 years ago

Nice, if the format is general enough, it might be a nice wrapper for serializing arbitrary list elements in the fst package as well (which is not implemented yet).

I was thinking of marking such elements as a raw blob and just leave it up to the client what to do with them. But with protocol buffers there would at least be a cross-language format for many types of objects. On the other side, the native R serialization will probably be faster and therefore more in line with the goals of fst.

Anyway, I will post the protolite benchmarks once I have the numbers!

MarcusKlik commented 6 years ago

Hi @jeroen, I've run the same benchmark as was used in the fst README file with protolite. To compare, feather was added (protocol- vs flatbuffers):

image

As you can see, protolite becomes faster for larger datasets, but it's a slower than feather for reading (and more for writing).

The tests were done using a mixed-column data frame:

# the sample generator generates a table with a mixture of common column types
sample_generator <- function(nr_of_rows) {
  data.table(

    # Logical column with mostly value TRUE's, some FALSE's and few NA's
    Logical = sample(c(TRUE, FALSE, NA), prob = c(0.85, 0.1, 0.05), nr_of_rows, replace = TRUE),

    # Integer column with values between 1 and 100
    Integer = sample(1L:100L, nr_of_rows, replace = TRUE),

    # Real column simulating some prices
    Real = sample(sample(1:10000, 20) / 100, nr_of_rows, replace = TRUE),

    # Factor column
    Factor = as.factor(sample(labels(UScitiesD), nr_of_rows, replace = TRUE))
  )
}

My laptop has a 4/8 core i7 4710HQ @2.5 GHz CPU with a relatively fast SSD (>1.5 GB/s). So IO is not disk limited. Both packages compress the data to about 80 percent of the original in-memory file size, probably because the logical column is stored efficiently (at least that's what happens in feather).

NrOfRows Mode Size Time FileSize Speed FileSpeed
1e+05 Write 2 161.46 1.53 12.40 9.47
1e+05 Read 2 160.86 1.53 12.45 9.51
2e+05 Write 4 173.46 3.07 23.07 17.68
2e+05 Read 4 177.81 3.07 22.51 17.25
1e+06 Write 20 271.04 15.36 73.80 56.68
1e+06 Read 20 412.25 15.36 48.52 37.26
2e+06 Write 40 412.90 31.68 96.88 76.73
2e+06 Read 40 688.86 31.68 58.07 45.99
1e+07 Write 200 1651.12 162.64 121.13 98.50
1e+07 Read 200 2374.45 162.64 84.23 68.50
2e+07 Write 400 3212.84 326.34 124.50 101.58
2e+07 Read 400 4518.45 326.34 88.53 72.22

The FileSpeed column does not really reflect the disk IO speed, because serializing with protolite and feather is a 2 step process. First serialization and then writing to disk (with fst both are done in parallel).

I hope you can use these results to your benefit!

MarcusKlik commented 6 years ago

Hi @jeroen, just one note: if you use compression just before writing to disk, you can compact the file significantly without loosing too much speed. Using the sample generated from the previous test:

wrap_proto_write_compressed <- function(dt, path) {
  raw_vec <- serialize_pb(dt)
  raw_com <- fst::compress_fst(raw_vec)
  writeBin(raw_com, path)
}

wrap_proto_read_compressed <- function(path) {
  raw_com <- readBin(path, "raw", file.size(path))
  raw_vec <- fst::decompress_fst(raw_com)
  unserialize_pb(raw_vec)
}

# generate sample and write / read cycle
dt <- sample_generator(1e7)
wrap_proto_write_compressed(dt, "1.pl")
dt2 <- wrap_proto_read_compressed("1.pl")

# 1 / compression ratio
as.numeric(file.size("1.pl") / object.size(dt))
#> [1] 0.280506

So the file is 28 percent of the in-memory size, much smaller than the original 80 percent. For compression, fst::fstcompress() uses a multi-threaded implementation of the ZSTD compressor at a fast setting as it's default.

The resulting files won't be protocol buffer files obviously, but there might be use-cases where that is less important.

all the best!

jeroen commented 6 years ago

Cool very interesting results, thanks. Protocol Buffers is primarily a format for binary data structure exchange in networking, rather than storage of large data, but it seems to work for that if needed.

Thanks for pointing me to zstd! I've wrapped brotli a while ago but I was pretty disappointed with the performance. I don't understand why it seems so slow for me in comparison to the benchmarks.

PS: ik ben regelmatig op de uithof --> ☕?

MarcusKlik commented 6 years ago

Hi @jeroen, Brotli relies heavily on it's internal dictionary, perhaps that makes it slower for small vectors? Also, in your post, you are using the slowest setting of Brotli (11). The slowest setting of ZSTD compresses at a speed of just a few MB/s, so probably even slower than Brotli.

The only use-cases for these settings are networking of perhaps multi-threaded compression if you have a lot of cores. fst uses LZ4 and ZSTD internally after applying some custom (byte-ordering) filters, but the compress_fst() and uncompress_fst() methods use the linked libraries directly (without additional filtering). For speed, the multi-threaded (de-)compression mode first splits the data in batches and adds a special header to the result, so the result is not a pure LZ4 or ZSTD compressed vector (the single-core mode is).

I was thinking of comparing Brotli to the dictionary mode of ZSTD and use one of them for text compression of character columns in a fst file, so perhaps I have more benchmarks to share with you in the near future!

PS: ha, ik dacht dat je nu permanent in Californie zat, maar goed idee, ik werk nog geen 5 minuten van de Uithof af, ik spreek graag een keer af (ik trakteer :-))!