fstpackage / fstlib

A C++ library for lightning fast multi-threaded serialization of tabular data. Home to the `fst` file format.
Mozilla Public License 2.0
37 stars 9 forks source link

Build Status License: AGPLv3

The fst format and fstlib library

Overview

The fstlib library is home to the fst storage format for columnar tabular data. It also contains very fast multi-threaded streamers for fst files and a computational framework that allows for effective use of the format's features for parallel calculations on larger-than-memory datasets.

The fst format

The fst format is used to store columnar tabular data. The format uses hashing and compression for stability, correctness and compactness. A wide range of data-types is available in the format and tabular data can be compressed with a wide range of settings to maximize throughput to storage devices.

Streaming

The fstlib library is build to access tabular data in the fst format with maximum possible speeds. It employs multi-threading for background reading and writing, and can (de-)compress using the full resources of the CPU. Speeds of multiple GB/s can be reached on fast (NVME SSD) storage devices.

fstlib uses the excellent LZ4 compressor for high speed compression at lower ratio’s and the ZSTD compressor for medium speed compression at higher ratio’s. Compression is done on small (16kB) blocks of data, which allows for (almost) random access of data. Each column uses it’s own compression scheme and different compressors can be mixed within a single column. This flexible setup allows for better optimized and faster compression of data, boosting speeds.

Computational framework

The fstlib library allows for computations on tabular data blocks during loading and decompression of data. This unique approach to processing compressed tabular data enables high-speed computing on large-than-memory datasets.

Goals

The fstlib library was designed with four goals in mind:

Use cases

Currently, the main use case for fstlib is R's fst package. In that package, fstlib provides the backend for accessing fst files with very high speeds up to multiple GB/s. In the future, fstlib will be part of similar packages for other languages such as Python, Julia, and Rust.