apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.62k stars 803 forks source link

Consider adding BloomFilter reading support to `ParquetMetadataReader` #6514

Open alamb opened 1 month ago

alamb commented 1 month ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Parquet now has the wonderful ParquetMetaDataReader structure from @adriangb and @etseidl

This handles reading the footer metadata as well as the page indexes.

@progval noted in https://github.com/apache/arrow-rs/pull/6505#discussion_r1787879839 that BloomFilters are similiar to the PageIndex, but are not currently read/written by the ParquetMetaDataReader

Describe the solution you'd like I would like to be able to configure the ParquetMetaDataReader (and writer) to read BloomFilters as well

Describe alternatives you've considered This might look something like

// read parquet metadata including page indexes
let file = open_parquet_file("some_path.parquet");
let mut reader = ParquetMetaDataReader::new()
    .with_bloom_filters(true);
reader.try_parse(&file).unwrap();
let metadata = reader.finish().unwrap();
// Somehow get access to the bloom filters (not sure what that API would look like)

Additional context

etseidl commented 1 month ago

I can see two paths forward here. In the near term we could add convenience functions to ParquetMetaDataReader/Writer to allow fetching/writing the bloom filters given an existing ParquetMetaData.

Longer term, I think we'd want to move the bloom filters into the ParquetMetaData struct to enable @alamb's example above. This could be part of the larger refactoring of the metadata (#6129, #6097).

alamb commented 1 month ago

I can see two paths forward here. In the near term we could add convenience functions to ParquetMetaDataReader/Writer to allow fetching/writing the bloom filters given an existing ParquetMetaData.

Longer term, I think we'd want to move the bloom filters into the ParquetMetaData struct to enable @alamb's example above. This could be part of the larger refactoring of the metadata (#6129, #6097).

I agree with this breakdown -- I think the first (move the code to read/write bloom filters into ParquetMetaDataReader / ParquetMetaDataWriter) would likely be a step towards the second (and would consolidate the code in a single location), so if someone is interested in trying it that would be neat.

As for a major metadata revamp, that is a bit more disruptive in my mind as it would cause significant downstream churn. 🤔

progval commented 1 month ago

@progval noted in #6505 (comment) that BloomFilters are similiar to the PageIndex, but are not currently read/written by the ParquetMetaDataReader

Not that similar though, because Bloom Filters are considerably larger and won't fit in RAM for a large dataset, so they shouldn't be cached (at least not by default)