Open alamb opened 4 days ago
This sounds quite a lot like https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.MetadataLoader.html ?
This sounds quite a lot like https://docs.rs/parquet/latest/parquet/arrow/async_reader/struct.MetadataLoader.html ?
That is quite similar -- thank you. Some differences might be also be the with a normal (non async
) API as well as an equivalent encoder
Ye I think the asyncness would be an important difference. Also that the existing APIs kind of want to load from an entire file. I suppose you could give it a "file" with just the footer and tell it to load just that range... but it feels a bit forced? Same with the asyncness. For my use case I could do some pointless async work (as in, make an async file like thing that just points to a Vec<u8>
but in general unnecessary async work is not ideal. My general experience is that it's nice to decouple IO from encoding / decoding logic.
My general experience is that it's nice to decouple IO from encoding / decoding logic.
Yes I agree this would be ideal. Having two things:
MetadataLoader
seems to do)
Is your feature request related to a problem or challenge? Please describe what you are trying to do. There are several cases where we would like to have more control over the encoding/deocing of Parquet metadata:
At the time of writing, the current APIs exposed
decode_metadata
, has no way for finer grained controlDescribe the solution you'd like I would like an API that allows more fine grained control over reading/writing metadata and that permits adding additional features over time in a backwards compatible way
Describe alternatives you've considered
Here is one potential idea -- to create
Encoder
/Decoder
structs that can encode and decode the metadata along with various configuration options.Ideally this struct would be integrated into the rest of the crate, e.g. used in SerializedFileWriter?
Similarly for decoding
Additional context This ticket is based on the discussion with @adriangb here https://github.com/apache/arrow-rs/discussions/5988
There are a bunch of discussions on metadata speed here https://github.com/apache/arrow-rs/issues/5770
Here is a PR with a proposed 'encode_metadata' function: https://github.com/apache/arrow-rs/pull/6000