Open etseidl opened 4 months ago
cc @alamb
Thank you for writing this ticket @etseidl. I think the naming of the various metadata structures in the parquet-rs crate to be quite confusing
I believe the core of my confusion stems from the two sets of structures:
format::metadata
file::metadata
and map more directly to parquet structuresThe current naming of the structures does not make the distinction clear and this I think the boundary gets muddled
I think the best thing we can do is to pick a naming convention that makes this distinction clear
What do you think about creating a Rust struct PageIndex
that contains all page level metadata in it (along with a nicer API) rather than exactly mirroring the multiple thrift structures?
This would be more in line with how the existing RowGroupMetaDataand ColumnChunkMetaData represent the contents of thrift structures.
So that might look like
ParquetOffsetIndex
and ParquetColumnIndex
indviduallyPageIndexMetadata
to access various parts of the PageIndexPageIndexMetadata::offsets()
and PageIndexMetadata::indexes()
that returns the page offsets and indexesPageIndexMetadata::unencoded_byte_sizes()
to access the byte sizesMaybe we could do something similar to wrap the BloomFilter
s so ParquetMetadata
also had BloomFilterMetada
🤔
cc @sunchao @adriangb @tustvold @XiangpengHao @Ted-Jiang @liukun4515 @viirya in case you have thoughts about the metadata structures
What do you think about creating a Rust struct
PageIndex
that contains all page level metadata in it (along with a nicer API) rather than exactly mirroring the multiple thrift structures?
I'm in favor. I like how the current Index
takes the struct-of-arrays ColumnIndex
and turns it into an array-of-structs so the info for each page is co-located. I could see doing something similar and fusing PageLocation
and unencoded_byte_array_data_bytes
into a single new OffsetIndex
struct.
+1. The current state of Index
and OffsetIndexMetaData
does seem inconsistent. I like the idea of having a PageIndexMetadata
struct which holds all information related to page indexes.
I don't have a strong opinion, I haven't spent as much time as the rest of you with these structs, but I do find them confusing and am in favor of changes to make it easier to understand 😃
Some top-level documentation of how these things interact could also be helpful.
Some top-level documentation of how these things interact could also be helpful.
Perhaps we can steal the excellent description here. 😉
I could also commission some more monodraw diagrams 😆
Is your feature request related to a problem or challenge? Please describe what you are trying to do. The Parquet Page Index consists of two thrift structures: a
ColumnIndex
containing per-page statistics useful for filtering, and anOffsetIndex
used to locate individual pages in the Parquet file. Size statistics were recently added to these structures (PARQUET-2261) and these are currently being added to arrow-rs (#5022). When #5022 is completed, the Parquet page index will be represented in theparquet
crate by theIndex
enum for the column index, and a newOffsetIndexMetaData
struct for the offset index (replacing the old offset index which simply returned thepage_locations
member of the thriftOffsetIndex
). Somewhat confusingly, theIndex
encapsulates a vector ofPageIndex
structs, which contain per-page column index information.Further,
ParquetMetaData
packages the page indexes for each row group into type aliased 2-D arrays:ParquetColumnIndex = Vec<Vec<Index>>
andParquetOffsetIndex = Vec<Vec<OffsetIndexMetaData>>
.Describe the solution you'd like It would be desirable to make the naming for all of these structures a bit more consistent, so their intended use is a bit more obvious. One possible mapping could be:
Index
toParquetColumnIndex
(and perhapsPageIndex
->ColumnIndexForPage
).OffsetIndexMetaData
toParquetOffsetIndex
.ParquetColumnIndex
->ParquetColumnIndexes
andParquetOffsetIndex
->ParquetOffsetIndexes
.Describe alternatives you've considered Maintain the status quo, or some other better names.
Additional context See the comments here and here for more context.