apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.81k stars 431 forks source link

PARQUET-2218: [Format] Clarify CRC computation #188

Closed pitrou closed 1 year ago

pitrou commented 1 year ago

When trying to implement CRC computation in Parquet C++, we found the wording to be ambiguous.

Clarify that CRC computation happens on the exact binary serialization (instead of a long-winded and confusing elaboration about v1 and v2 data page layout).

Also, clarify that CRC computation can apply to all page kinds, not only data pages (for reference, parquet-mr currently support checksumming v1 data pages as well as dictionary pages).

Also, see discussion on https://github.com/apache/parquet-format/pull/126#issuecomment-1348081137 and below.

pitrou commented 1 year ago

@bbraams @gszadovszky @mapleFU thoughts?

mapleFU commented 1 year ago

The change looks good to me! Thanks a lot!