apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.76k stars 428 forks source link

PARQUET-2362: Clarify parquet encoding #217

Closed letian-jiang closed 11 months ago

letian-jiang commented 1 year ago

Dictionary page format: the entries in the dictionary - in dictionary order - using the plain encoding.

The dictionary entries are not sorted (or at least not always sorted).

There is no padding between values (except for the last byte) which is padded with 0s.

Minor change.

Jira

Commits

mapleFU commented 1 year ago

Would you mind first create an issue like: https://issues.apache.org/jira/browse/PARQUET-2299 or use MINOR?

mapleFU commented 1 year ago

There is no padding between values (except for the last byte) which is padded with 0s.

This change looks good to me.

Dictionary page format: the entries in the dictionary - in dictionary order - using the plain encoding.

Does this means value in data page is same as position in dictionary page? 🤔

Also cc @wgtmac @gszadovszky

letian-jiang commented 1 year ago

Would you mind first create an issue like: https://issues.apache.org/jira/browse/PARQUET-2299 or use MINOR?

I will create a related issue once my JIRA account request is approved.

Does this means value in data page is same as position in dictionary page? 🤔

I think so. The data page contains dictionary code (i.e. offset in dictionary page )

letian-jiang commented 12 months ago

Made some updates. Please take another look. @mapleFU @gszadovszky @JFinis