apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.81k stars 431 forks source link

PARQUET-2241: Update wording of BYTE_STREAM_SPLIT encoding #192

Closed wgtmac closed 1 year ago

wgtmac commented 1 year ago

Propose to explicitly state that no padding is allowed within a data page. This makes it easier for BYTE_STREAM_SPLIT decoder to decode page with nulls. In this way, it can simply get the number of encoded values by total_length_encoded_stream / K (4 for float and 8 for double). Otherwise, it has to decode def/rep levels to get exact number of non-null values.

wgtmac commented 1 year ago

cc @shangxinli @gszadovszky @ggershinsky @pitrou @emkornfield

pitrou commented 1 year ago

cc @wjones127

mapleFU commented 1 year ago

I think should we check that no more padding is added in all impl? At least, seems C++, Rust, parquet-mr didn't padding at the end of data.

emkornfield commented 1 year ago

Seems OK to me.