apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.69k stars 422 forks source link

PARQUET-2480: Clarify what "page index" means in Parquet.thrift #245

Closed alamb closed 1 month ago

alamb commented 1 month ago

https://issues.apache.org/jira/browse/PARQUET-2480

See the proposed update as rendered markdown: https://github.com/alamb/parquet-format/blob/alamb/page-index/PageIndex.md

I have always found it very confusing that people refer to the term parquet "page index", for example this message

However, the term "page index" is not used in the the parquet.thrift file itself, but only appears as the name of the file that describes the ColumnIndex and OffsetIndex, PageIndex.md

This means I can't search for "page index" in the spec and find out what people are talking about

Proposed Clarifications

  1. Update the introductory paragraph of PageIndex.md to clarify use the term "page index" and explain that it is encoded as ColumnIndex and OffsetIndex
  2. Update the description of ColumnIndex and OffsetIndex to include the term "page index" and clarify what those structures are used for.

Jira

Commits

Documentation

This PR has no spec changes, only clarifications

mapleFU commented 1 month ago

I also think that currently PageIndex means "offset index and column index". They're all page-level index

alamb commented 1 month ago

Thank you for the comments @mapleFU @tustvold and @wgtmac - I believe I have implemented your suggestions and I think the PR is much clearer because of it.

wgtmac commented 1 month ago

@gszadovszky @pitrou @emkornfield @julienledem Would you like to take a look?