Closed alamb closed 1 month ago
Updated to clarify that records can't repeat when an OffsetIndex
is present
I think this PR is now ready to go @wgtmac . Is there anything else we are waiting on?
Sorry for missing this. I'll merge it now.
No worries -- thank you!
I'll file a JIRA shortly to improve the spec to use row terminology
Update: here is a PR to consistently use the "row" terminology: https://github.com/apache/parquet-format/pull/256
This was sparked by a mailing list discussion: https://lists.apache.org/thread/rd8twnvg4bg3558r507rzpxckcxt5wdn
Several implementors of the Parquet spec have been confused by this point
Notes
There seems to be clear consensus that record boundaries can't span pages in V2 pages or if there is a page index, so let's make that clear in the spec to avoid future confusion
There also seemed to be consensus that Row Groups must start on record boundaries, and that the existing spec language was clear on this point, so I did not propose any changes to that language
https://issues.apache.org/jira/browse/PARQUET-2473
Jira
Commits
Documentation
This PR is only a clarification