An Empirical Evaluation of Columnar Storage Formats

Abstract

Columnar storage is a core component of a modern data analytics system. Although many database management systems (DBMSs) have proprietary storage formats, most provide extensive support to open-source storage formats such as Parquet and ORC to facilitate cross-platform data sharing. But these formats were developed over a decade ago, in the early 2010s, for the Hadoop ecosystem. Since then, both the hardware and workload landscapes have changed. In this paper, we revisit the most widely adopted open-source columnar storage formats (Parquet and ORC) with a deep dive into their internals. We designed a benchmark to stress-test the formats’ performance and space efficiency under different workload configurations. From our comprehensive evaluation of Parquet and ORC, we identify design decisions advantageous with modern hardware and real-world data distributions. These include using dictionary encoding by default, favoring decoding speed over compression ratio for integer encoding algorithms, making block compression optional, and embedding finer-grained auxiliary data structures. We also point out the inefficiencies in the format designs when handling common machine learning workloads and using GPUs for decoding. Our analysis identified important considerations that may guide future formats to better fit modern technology trends.

We summarize the lessons learned from our evaluation of Parquet and ORC to guide future innovations in columnar storage formats.

Lesson 1. Dictionary Encoding is effective across data types (even for floating-point values) because most real-world data have low NDV ratios. Future formats should continue to apply the technique aggressively, as in Parquet.

Lesson 2. It is important to keep the encoding scheme simple in a columnar format to guarantee a competitive decoding performance. Future format designers should pay attention to the performance cost of selecting from many codec algorithms during decoding.

Lesson 3. The bottleneck of query processing is shifting from storage to (CPU) computation on modern hardware. Future formats should limit the use of block compression and other heavyweight encodings unless the benefits are justified in specific cases.

Lesson 4. The metadata layout in future formats should be centralized and friendly to random access to better support wide (feature) tables common in ML training. The size of the basic I/O block should be optimized for high-latency cloud storage.

Lesson 5. As storage is getting cheaper, future formats could afford to store more sophisticated indexing and filtering structures to speed up query processing.

Lesson 6. Nested data models should be designed with an affinity to modern in-memory formats to reduce the translation overhead.

Lesson 7. The characteristics of common machine learning workloads require future formats to support both wide-table projections and low-selectivity selections efficiently. This calls for better metadata organization and more effective indexing. Besides, future formats should allocate separate regions for large binary objects and incorporate compression techniques specifically designed for floats.

Lesson 8. Future formats should consider the decoding efficiency with GPUs. This requires not only sufficient parallel data blocks at the file level but also encoding algorithms that are parallelizable to fully utilize the computation within a GPU thread block

Weijun-H / Read-Some-Paper

An Empirical Evaluation of Columnar Storage Formats #29