Parquet: Explore ways to accelerate table writing

rcaudy commented 3 years ago

Currently, there's nothing at all parallel about our Parquet table writing, barring using multiple tables/files. We should investigate options here if it becomes a performance bottleneck for users.

malhotrashivam commented 1 year ago

Potential optimization idea (found during #4334): Currently, while using dictionary encoding, if we hit the limit on dictionary size, we discard all the work done so far and fall back to plain encoding for all the pages. A more optimized way would be to add a dictionary page first with the data collected so far and then use plain encoding for all the following pages.

malhotrashivam commented 1 year ago

Some more optimization opportunities:

In the current writing code, we copy the contents of the table into a buffer (inside TransferObject class) and then use that buffer for writing to the parquet file (inside ColumnWriter class). We can skip this intermediate step of creating the buffer and directly write to the parquet file. An example for long type column is here: #4587.
Currently, we use RLE only when using dictionary encoding for strings and we use bit packing for booleans. Other than that, we always use Plain encoding. We should look into using RLE encoding by default for all data types since that can dramatically reduce the number of bytes written, especially for logical types used for byte or char data type values which are all written as Int32 on disk.
As found during #4541, when writing identical content, pyarrow generally writes fewer number of pages per file than our code. This can lead to performance benefits since writing each page requires writing additional metadata, which is a performance hit. One major difference is that pyarrow uses RLE whereas our code uses Plain encoding. This could lead to significantly fewer actual bytes being written by pyarrow compared to deephaven.

malhotrashivam commented 1 year ago

The first and third suggestion above didn't show much improvement. More details can be found on this doc.

deephaven / deephaven-core

Parquet: Explore ways to accelerate table writing #946