Open trungda opened 3 months ago
Seems reasonable to me -- the key would be to add the API and document it sufficiently so it isn't hard
I believe this idea is similar to the APIs provided in https://github.com/jorgecarleitao/parquet2 (now unmaintained) which might be interesting to look at for inspiration
Thanks! I also learnt the hard way that data_page_offset
doesn't necessarily point to the first data page (it could actually point to the dictionary page 🤷 ). The actual source of truth is in the page index.
I also learnt the hard way that
data_page_offset
doesn't necessarily point to the first data page
Is that a known bug in some specific parquet writer? Would be really unexpected since there is a separate dictionary_page_offset
, which I think is guaranteed to point to before the first data page.
dictionary_page_offset
I think that there was a bug with the Java's parquet-mr impl a while back: https://stackoverflow.com/questions/55225108/why-is-dictionary-page-offset-0-for-plain-dictionary-encoding
At least for me, it wasn't easy to find this info and it was very confusing. Maybe worth putting this in a document somewhere in the the parquet writer at least for the Rust implementation? @alamb .
Thank you, that is good to know! I'm continuously surprised how many of these edge cases lurk in such a standardized format.
Is your feature request related to a problem or challenge? Please describe what you are trying to do. We want ability to read an arbitrary page in a column chunk,
SerializedPageReader
has almost everything we need to. The only catch is that there is an implicit constraint that we have to pass the first data page to thepage_locations
argument of thenew_with_properties
constructor. This is fine but it makes working with this reader less ergonomic (you have to skip a page to get to the page you actually want to read).Describe the solution you'd like
page_locations
argument. Internally, we can read page_index (if available) to infer the dictionary page size;Describe alternatives you've considered
Additional context