apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.46k stars 2.23k forks source link

Unencoded Variable Length Column Size Statistics #10703

Open Samrose-Ahmed opened 4 months ago

Samrose-Ahmed commented 4 months ago

Proposed Change

it is not currently possible to determine the uncompressed unencoded size of variable length columns. It is possible to do so for fixed length data types using the null count and row count statistics but not possible for variable length column types like strings (or binary), since the data is encoded and compressed. Tracking data size is useful for many purposes, including engine planning/query optimization for e.g. planning for data exchange or join, as well as for readers to estimate memory for reading data.

We propose adding a new optional property similar to columnSizes inside the manifest files. This will be a map from field id to number of uncompressed unencoded size bytes. This should only be set for variable length type columns (String/Binary).

Add the following to the manifest_entry.data_file struct:

optional optional 142 variable_length_column_sizes map<143: int, 144: long> Map from column id to the uncompressed unencoded size of all regions that store the column. Only valid for variable length types like string/byte array.

See also Parquet format SizeStatistics and unencoded_byte_array_data_bytes.

Relevant Github Issues:

Proposal document

https://docs.google.com/document/d/189kIZxx_dUloBCDPUz2Fh0BBOZSm2fXHHXWpdpq3DrU

Specifications

ajantha-bhat commented 4 months ago

Tagging @emkornfield, since he worked on the parquet proposal.

emkornfield commented 4 months ago

Thanks, left a few comments on the design doc. Overall, it seems pretty reasonable to include this IMO