Unencoded Variable Length Column Size Statistics

Samrose-Ahmed commented 4 months ago

Proposed Change

it is not currently possible to determine the uncompressed unencoded size of variable length columns. It is possible to do so for fixed length data types using the null count and row count statistics but not possible for variable length column types like strings (or binary), since the data is encoded and compressed. Tracking data size is useful for many purposes, including engine planning/query optimization for e.g. planning for data exchange or join, as well as for readers to estimate memory for reading data.

We propose adding a new optional property similar to columnSizes inside the manifest files. This will be a map from field id to number of uncompressed unencoded size bytes. This should only be set for variable length type columns (String/Binary).

Add the following to the manifest_entry.data_file struct: