apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.69k stars 422 forks source link

PARQUET-2473: Clarify records can not be split across v2 pages or PageIndex #244

Closed alamb closed 1 month ago

alamb commented 1 month ago

This was sparked by a mailing list discussion: https://lists.apache.org/thread/rd8twnvg4bg3558r507rzpxckcxt5wdn

Several implementors of the Parquet spec have been confused by this point

Notes

There seems to be clear consensus that record boundaries can't span pages in V2 pages or if there is a page index, so let's make that clear in the spec to avoid future confusion

There also seemed to be consensus that Row Groups must start on record boundaries, and that the existing spec language was clear on this point, so I did not propose any changes to that language

https://issues.apache.org/jira/browse/PARQUET-2473

Jira

Commits

Documentation

This PR is only a clarification

alamb commented 1 month ago

Updated to clarify that records can't repeat when an OffsetIndex is present

alamb commented 1 month ago

I think this PR is now ready to go @wgtmac . Is there anything else we are waiting on?

alamb commented 1 month ago

Sorry for missing this. I'll merge it now.

No worries -- thank you!

I'll file a JIRA shortly to improve the spec to use row terminology

alamb commented 1 month ago

Update: here is a PR to consistently use the "row" terminology: https://github.com/apache/parquet-format/pull/256