apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.63k stars 804 forks source link

Consistent naming for Parquet page index structures #6097

Open etseidl opened 4 months ago

etseidl commented 4 months ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do. The Parquet Page Index consists of two thrift structures: a ColumnIndex containing per-page statistics useful for filtering, and an OffsetIndex used to locate individual pages in the Parquet file. Size statistics were recently added to these structures (PARQUET-2261) and these are currently being added to arrow-rs (#5022). When #5022 is completed, the Parquet page index will be represented in the parquet crate by the Index enum for the column index, and a new OffsetIndexMetaData struct for the offset index (replacing the old offset index which simply returned the page_locations member of the thrift OffsetIndex). Somewhat confusingly, the Index encapsulates a vector of PageIndex structs, which contain per-page column index information.

Further, ParquetMetaData packages the page indexes for each row group into type aliased 2-D arrays: ParquetColumnIndex = Vec<Vec<Index>> and ParquetOffsetIndex = Vec<Vec<OffsetIndexMetaData>>.

Describe the solution you'd like It would be desirable to make the naming for all of these structures a bit more consistent, so their intended use is a bit more obvious. One possible mapping could be:

  1. Rename Index to ParquetColumnIndex (and perhaps PageIndex -> ColumnIndexForPage).
  2. Rename OffsetIndexMetaData to ParquetOffsetIndex.
  3. The type aliases could be made plural, so the current ParquetColumnIndex -> ParquetColumnIndexes and ParquetOffsetIndex -> ParquetOffsetIndexes.

Describe alternatives you've considered Maintain the status quo, or some other better names.

Additional context See the comments here and here for more context.

etseidl commented 4 months ago

cc @alamb

alamb commented 4 months ago

Thank you for writing this ticket @etseidl. I think the naming of the various metadata structures in the parquet-rs crate to be quite confusing

I believe the core of my confusion stems from the two sets of structures:

The current naming of the structures does not make the distinction clear and this I think the boundary gets muddled

I think the best thing we can do is to pick a naming convention that makes this distinction clear

alamb commented 4 months ago

What do you think about creating a Rust struct PageIndex that contains all page level metadata in it (along with a nicer API) rather than exactly mirroring the multiple thrift structures?

This would be more in line with how the existing RowGroupMetaDataand ColumnChunkMetaData represent the contents of thrift structures.

So that might look like

  1. Deprecate ParquetOffsetIndex and ParquetColumnIndex indvidually
  2. Add a new struct like PageIndexMetadata to access various parts of the PageIndex
  3. Add methods like PageIndexMetadata::offsets() and PageIndexMetadata::indexes() that returns the page offsets and indexes
  4. Add PageIndexMetadata::unencoded_byte_sizes() to access the byte sizes

Maybe we could do something similar to wrap the BloomFilters so ParquetMetadata also had BloomFilterMetada

🤔

cc @sunchao @adriangb @tustvold @XiangpengHao @Ted-Jiang @liukun4515 @viirya in case you have thoughts about the metadata structures

etseidl commented 4 months ago

What do you think about creating a Rust struct PageIndex that contains all page level metadata in it (along with a nicer API) rather than exactly mirroring the multiple thrift structures?

I'm in favor. I like how the current Index takes the struct-of-arrays ColumnIndex and turns it into an array-of-structs so the info for each page is co-located. I could see doing something similar and fusing PageLocation and unencoded_byte_array_data_bytes into a single new OffsetIndex struct.

sunchao commented 4 months ago

+1. The current state of Index and OffsetIndexMetaData does seem inconsistent. I like the idea of having a PageIndexMetadata struct which holds all information related to page indexes.

adriangb commented 4 months ago

I don't have a strong opinion, I haven't spent as much time as the rest of you with these structs, but I do find them confusing and am in favor of changes to make it easier to understand 😃

Some top-level documentation of how these things interact could also be helpful.

etseidl commented 4 months ago

Some top-level documentation of how these things interact could also be helpful.

Perhaps we can steal the excellent description here. 😉

alamb commented 4 months ago

I could also commission some more monodraw diagrams 😆