fstpackage / fst

Lightning Fast Serialization of Data Frames for R
http://www.fstpackage.org/fst/
GNU Affero General Public License v3.0
619 stars 42 forks source link

API for raw-vector outputs #171

Closed dipterix closed 6 years ago

dipterix commented 6 years ago

Is there any method that maps R objects to raw vectors (in-memory) instead of writing to a file? I was wondering if I can avoid using file system.

Some of the background here:

Recently there is a R package plumber that provides restful web apis for R powered servers. It uses JSON by default to serialize data. However, this method is too slow when data size exceed 50MB.

I want to be able to serialize data (~400MB) within 5 seconds. fst package really helps for fast-serializing data. However, when I move my server to HDD from SSD, the speed is not fast anymore due to the hard drive limit.

Is there any way to create in-memory raw vectors and map the data.table to that vector?

Thanks!

MarcusKlik commented 6 years ago

Hi @dipterix, that's a very interesting question!

Currently, fst can only serialize to- and from- file, but writing to/from a consecutive piece of memory would definitely be a nice feature to have. And it would be useful for other use cases as well. For example, two applications that implement the fst API, would be able to use shared memory to transfer data among each other (e.g. from R to Python), which would require less memory. Also, applications could 'share' objects in such a scenario.

It would require substitution of all C++ ofstream objects in the fstcore library with a more general type which can be used to write to RAM as well. This would definitely make fstcore much more flexible as a library.

So if I understand correctly, you want to materialize a data.table from a raw vector that was transfered via the plumber API, and that raw vector has data identical to a (maximally) compressed fst file?

The advantage of using the fst format to serialize data as compared to other formats (such as R's native rds format) would be that the stream could be accessed at random, and would have a better speed to compression ratio (just as with fst files). But isn't the network bandwidth of your server the real speed bottleneck?

dipterix commented 6 years ago

Think about the following scenario, I have a data server which requires secure connections and there are several logic servers connecting to the data server via intranet. Bandwidth is hence no problem. However, since my data is generated dynamically and usually large, it'll be slow to use database or any file systems, all matters is the speed of transferring data from data server to logic servers. Internet users only access to outputs from the logic servers (usually small sizes).

Or to be brief, I want to have a data server that works similar to database, but in R.

I understand that fst package needs complete access to the files for random access and parallel writing. However, if splitting data into several chunks and serialize each chunks, we are able to stream data in a nearly sequential way.

Right now I'm developing a project which requires loading, processing and transferring data of GBs from one central data server to different module servers. Think of streaming videos but interactive videos.

MarcusKlik commented 6 years ago

Hi @dipterix, thanks for clarifying your setup. I think that in your case a fast serializer is the most important component:

# server generates any object
obj <- list(100, data.frame(X = sample(1:100, 1e7, replace = TRUE)), "anything")

# serialize
serialized_obj_server <- serialize(obj, NULL)

# high speed compression using multithreaded ZSTD at the lowest setting
compressed_obj_server <- fst::compress_fst(serialized_obj_server, "ZSTD", 0)

#
# send the serialized data to the logic server here
#

# decompress
serialized_obj_client <- fst::decompress_fst(compressed_obj_server)

# verify equality
testthat::expect_equal(serialized_obj_server, serialized_obj_client)

# deserialize
obj_client <- unserialize(serialized_obj_client)

# verify equality
testthat::expect_equal(obj, obj_client)

with this scheme you can take advantage of the multi-threaded compression that fst offers (with ZSTD and LZ4 compressors), while still being able to sent any type of (generated) object as needed by the client.

And the performance:


bench <- function(statement) {
  microbenchmark::microbenchmark(
    statement,
    times = 1
  )
}

bench(obj <- list(100, data.frame(X = sample(1:100, 250e6, replace = TRUE)), "anything"))
#> Unit: seconds
#>       expr      min       lq     mean   median       uq      max neval
#>  statement 6.290586 6.290586 6.290586 6.290586 6.290586 6.290586     1

bench(serialized_obj_server <- serialize(obj, NULL))
#> Unit: seconds
#>    expr     min      lq    mean  median      uq     max neval
#>   statement 5.25887 5.25887 5.25887 5.25887 5.25887 5.25887     1

bench(compressed_obj_server <- fst::compress_fst(serialized_obj_server, "ZSTD", 0))
#> Unit: seconds
#>       expr      min       lq     mean   median       uq      max neval
#>  statement 3.355347 3.355347 3.355347 3.355347 3.355347 3.355347     1

bench(serialized_obj_client <- fst::decompress_fst(compressed_obj_server))
#> Unit: milliseconds
#>       expr      min       lq     mean   median       uq      max neval
#>  statement 676.5218 676.5218 676.5218 676.5218 676.5218 676.5218     1

bench(obj_client <- unserialize(serialized_obj_client))
#> Unit: seconds
#>       expr      min       lq     mean   median       uq      max neval
#>  statement 1.417588 1.417588 1.417588 1.417588 1.417588 1.417588     1

As you can see, the speed of serialization is around 200 MB/s and compression (at this setting) can be done at around 300 MB/s. And deserialization and decompression are a lot faster.

These speeds can be lower than speeds measured for write_fst() and read_fst() because compress_fst() doesn't have type-specific compression algorithms, it just uses the default ZSTD compressor (or LZ4). But you have the network speed as your bottleneck anyway (1 Gb/s ?), so in this case it's probably best to allow all type of objects to be serialized to offer a more general API on your data server...

I'm curious to hear what you think!

MarcusKlik commented 6 years ago

By comparison, it's interesting to see the timings for R's build-in compressors using the same setup:

# use an object 10x smaller (or wait forever)
obj_small <- serialize(list(100,
   data.frame(X = sample(1:100, 25e6, replace = TRUE)), "anything"), NULL)

bench(obj_gzip <- memCompress(obj_small, "gzip"))
#> Unit: seconds
#>       expr      min       lq     mean   median       uq      max neval
#>  statement 17.91412 17.91412 17.91412 17.91412 17.91412 17.91412     1

bench(obj_bzip2 <- memCompress(obj_small, "bzip2"))
#> Unit: seconds
#>       expr      min       lq     mean   median       uq      max neval
#>  statement 7.601339 7.601339 7.601339 7.601339 7.601339 7.601339     1

bench(obj_xy <- memCompress(obj_small, "xz"))
#> Unit: seconds
#>       expr      min       lq     mean   median       uq      max neval
#>  statement 127.9709 127.9709 127.9709 127.9709 127.9709 127.9709     1

and the resulting compression ratio's:

# compression ratio fst
as.numeric(object.size(obj) / object.size(compressed_obj_server))
#> [1] 3.491831

# compression ratio gzip
as.numeric(object.size(obj_small) / object.size(obj_gzip))
#> [1] 3.124468

# compression ratio bzip2
as.numeric(object.size(obj_small) / object.size(obj_bzip2))
#> [1] 4.767889

# compression ratio xy
as.numeric(object.size(obj_small) / object.size(obj_xy))
#> [1] 4.592449

So while the compression ratio's of bzip2 and xy are higher, the speeds are much lower and impractical for your purposes, from about 13 MB/s for bzip2 to less than 1 MB/s for xy.

(the comparison is a bit unfair, because the build-in compressors are only using a single thread).

dipterix commented 6 years ago

Thanks for the comparison. I did almost the same onbzip2 and it's super slow compared to LZ4 (100x on 800MB raw vectors? but I was using 8 cores so if single thread 10x slower).

I think fst speed is pretty good. Maybe my requirement is too high. Probably I need to split data into chunks and send them individually.

One thing that caught my attention is the speed for serialize+compress_fst is slower than write_fst sometimes. Is write_fst uses native R serialize function or it has its own serialize function?

Thanks!

MarcusKlik commented 6 years ago

Hi @dipterix, thanks, yes, the fst format was designed from the ground up to allow maximum throughput of data to- and from disk when using multiple threads. It also writes to disk in parallel with compression of the data. So the actual writing only uses a single core and the remaining cores are available for compression.

The data is compressed with the LZ4 and ZSTD compressors, which have a very good speed- to compression ratio. Before the actual compression, I'm also using knowledge of the specific type of column-data to pre-compress or shuffle the data. This can greatly speed things up. For example, fst reorders the bytes of an integer vector to bring corresponding bytes next to each other (so all least significant bytes of a 4 byte integer go into one bucket and so on). This can allow for much higher compression ratio's, especially for the lower (high-speed) compression settings.

Although compress_fst() can also utilize multiple threads, it can't know the type of data that's being compressed, because a raw vector can contain any type of data. So the type dependent optimizations cannot be used. The difference in compression speed between write_fst() and compress_fst() is mostly due to those optimizations, and write_fst() can be faster even though it needs to use one core for the actual writing to disk...

Thanks for your question and comparison!

dipterix commented 6 years ago

Thank you too for solving my problems!