Open alamb opened 1 month ago
I can see two paths forward here. In the near term we could add convenience functions to ParquetMetaDataReader/Writer
to allow fetching/writing the bloom filters given an existing ParquetMetaData
.
Longer term, I think we'd want to move the bloom filters into the ParquetMetaData
struct to enable @alamb's example above. This could be part of the larger refactoring of the metadata (#6129, #6097).
I can see two paths forward here. In the near term we could add convenience functions to
ParquetMetaDataReader/Writer
to allow fetching/writing the bloom filters given an existingParquetMetaData
.Longer term, I think we'd want to move the bloom filters into the
ParquetMetaData
struct to enable @alamb's example above. This could be part of the larger refactoring of the metadata (#6129, #6097).
I agree with this breakdown -- I think the first (move the code to read/write bloom filters into ParquetMetaDataReader
/ ParquetMetaDataWriter
) would likely be a step towards the second (and would consolidate the code in a single location), so if someone is interested in trying it that would be neat.
As for a major metadata revamp, that is a bit more disruptive in my mind as it would cause significant downstream churn. 🤔
@progval noted in #6505 (comment) that BloomFilters are similiar to the PageIndex, but are not currently read/written by the ParquetMetaDataReader
Not that similar though, because Bloom Filters are considerably larger and won't fit in RAM for a large dataset, so they shouldn't be cached (at least not by default)
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Parquet now has the wonderful
ParquetMetaDataReader
structure from @adriangb and @etseidlThis handles reading the footer metadata as well as the page indexes.
@progval noted in https://github.com/apache/arrow-rs/pull/6505#discussion_r1787879839 that BloomFilters are similiar to the PageIndex, but are not currently read/written by the ParquetMetaDataReader
Describe the solution you'd like I would like to be able to configure the ParquetMetaDataReader (and writer) to read BloomFilters as well
Describe alternatives you've considered This might look something like
Additional context