Intermediate Result File Format

We use intermediate result files for recursive planning (CTE or subquery processing), INSERT/SELECT with repartitioning, and repartitioned joins. Hence improvements to the speed or size of intermediate result files can improve performance of wide range of queries.

Current we use postgresql's COPY format for intermediate result files, which is either csv or binary, depending on data types used in columns. Binary format uses send/receive functions of the data type, and tries to be version independent, and to some extent architecture independent (e.g. endian-ness of architecture matters in binary format, but most other aspects don't).

We don't care about version/architecture independency for intermediate result file formats, so we can use an alternative format to optimize for performance.

In this document we propose a new intermediate result file format.

Goals

Extensible: we might decide in future we want to add more statistics or data structures to the format. For example, it should allow to add compression, join data structures, histograms, etc.
Streaming write: often we need to do "tuple source -> in-memory intermediate result encoder -> stream over connection". This rules out formats which need to buffer the entire tuple set into the memory. For example, data format shouldn't write the row count in the file header.
Optional: streaming read. currently we always read the intermediate result format from files, so we don't need streaming read to cover current use-cases. In future, if we want to use pipes, having streaming reads will be useful.
Performance.

Proposed format

File format is:

version number (4 bytes)
header metadata length in bytes (4 bytes)
header metadata items
data blocks
footer metadata items
footer metadata length in bytes (4 bytes)

Each metadata item is:

key (4 bytes)
type-id (4 bytes) (to allow rolling upgrades, in case type of a key changes)
value (datum format) (type decided by key)

Header contains metadata that is necessary for parsing the file, and incompatibilities cause error. Footer contains other metadata, like statistics, and incompatibilities doesn't cause error.

Simplest data block type is tuple store, which contains tuples sequentially. Other additions in future can be columnar format, hash map format, ...

Tuple store block type

Consists of blocks. We group multiple rows into a block.

Each block is:

type (4 bytes) equal to BLOCK_TYPE_TUPLESTORE
length (8 bytes)
subformat (4 bytes)
_metadatacount (4 bytes)
metadata
data

Where data consists of tuple data, which is data for each row.

Data for each row for

subformat = 0: null bitmask, followed by memcpy of datums (this is used when there is no variable length data types, and is fast)
subformat = 1: null bitmask, followed by datums serialized similar to cstore_fdw.

Discussion

Why do we need header metadata?

If we want to stream read data, we might need some information before streaming starts.

Why do we need footer metadata?

We want to store statistics like row count. We cannot have this in the header and also allow streaming writes.

Why divide data into blocks?

To allow streaming reads which doesn't require too much buffering. For example, think about columnar storage.

citusdata / citus