it is not currently possible to determine the uncompressed unencoded size of variable length columns. It is possible to do so for fixed length data types using the null count and row count statistics but not possible for variable length column types like strings (or binary), since the data is encoded and compressed. Tracking data size is useful for many purposes, including engine planning/query optimization for e.g. planning for data exchange or join, as well as for readers to estimate memory for reading data.
We propose adding a new optional property similar to columnSizes inside the manifest files. This will be a map from field id to number of uncompressed unencoded size bytes. This should only be set for variable length type columns (String/Binary).
Add the following to the manifest_entry.data_file struct:
optional
optional
142 variable_length_column_sizes
map<143: int, 144: long>
Map from column id to the uncompressed unencoded size of all regions that store the column. Only valid for variable length types like string/byte array.
Proposed Change
it is not currently possible to determine the uncompressed unencoded size of variable length columns. It is possible to do so for fixed length data types using the null count and row count statistics but not possible for variable length column types like strings (or binary), since the data is encoded and compressed. Tracking data size is useful for many purposes, including engine planning/query optimization for e.g. planning for data exchange or join, as well as for readers to estimate memory for reading data.
We propose adding a new optional property similar to
columnSizes
inside the manifest files. This will be a map from field id to number of uncompressed unencoded size bytes. This should only be set for variable length type columns (String/Binary).Add the following to the
manifest_entry.data_file
struct:142 variable_length_column_sizes
map<143: int, 144: long>
See also Parquet format SizeStatistics and
unencoded_byte_array_data_bytes
.Relevant Github Issues:
Proposal document
https://docs.google.com/document/d/189kIZxx_dUloBCDPUz2Fh0BBOZSm2fXHHXWpdpq3DrU
Specifications