citusdata / citus

Distributed PostgreSQL as an extension
https://www.citusdata.com
GNU Affero General Public License v3.0
10.61k stars 671 forks source link

Intermediate Result File Format #3570

Open pykello opened 4 years ago

pykello commented 4 years ago

We use intermediate result files for recursive planning (CTE or subquery processing), INSERT/SELECT with repartitioning, and repartitioned joins. Hence improvements to the speed or size of intermediate result files can improve performance of wide range of queries.

Current we use postgresql's COPY format for intermediate result files, which is either csv or binary, depending on data types used in columns. Binary format uses send/receive functions of the data type, and tries to be version independent, and to some extent architecture independent (e.g. endian-ness of architecture matters in binary format, but most other aspects don't).

We don't care about version/architecture independency for intermediate result file formats, so we can use an alternative format to optimize for performance.

In this document we propose a new intermediate result file format.

Goals

  1. Extensible: we might decide in future we want to add more statistics or data structures to the format. For example, it should allow to add compression, join data structures, histograms, etc.
  2. Streaming write: often we need to do "tuple source -> in-memory intermediate result encoder -> stream over connection". This rules out formats which need to buffer the entire tuple set into the memory. For example, data format shouldn't write the row count in the file header.
  3. Optional: streaming read. currently we always read the intermediate result format from files, so we don't need streaming read to cover current use-cases. In future, if we want to use pipes, having streaming reads will be useful.
  4. Performance.

Proposed format

File format is:

  1. version number (4 bytes)
  2. header metadata length in bytes (4 bytes)
  3. header metadata items
  4. data blocks
  5. footer metadata items
  6. footer metadata length in bytes (4 bytes)

Each metadata item is:

  1. key (4 bytes)
  2. type-id (4 bytes) (to allow rolling upgrades, in case type of a key changes)
  3. value (datum format) (type decided by key)

Header contains metadata that is necessary for parsing the file, and incompatibilities cause error. Footer contains other metadata, like statistics, and incompatibilities doesn't cause error.

Simplest data block type is tuple store, which contains tuples sequentially. Other additions in future can be columnar format, hash map format, ...

Tuple store block type

Consists of blocks. We group multiple rows into a block.

Each block is:

  1. type (4 bytes) equal to BLOCK_TYPE_TUPLESTORE
  2. length (8 bytes)
  3. subformat (4 bytes)
  4. _metadatacount (4 bytes)
  5. metadata
  6. data

Where data consists of tuple data, which is data for each row.

Data for each row for

Discussion

Why do we need header metadata?

If we want to stream read data, we might need some information before streaming starts.

Why do we need footer metadata?

We want to store statistics like row count. We cannot have this in the header and also allow streaming writes.

Why divide data into blocks?

To allow streaming reads which doesn't require too much buffering. For example, think about columnar storage.

marcocitus commented 4 years ago

One thing that would be immediately useful is to use gzip compression on top of the current format. The network transfer cost can be quite significant in some environments.