apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.69k stars 422 forks source link

DRAFT: Alternative Strawman proposal for a new V3 footer format in Parquet #250

Open emkornfield opened 1 month ago

emkornfield commented 1 month ago

As a point of discussion a slightly different version of showing how column metadata could be decoupled from FileMetadata then https://github.com/apache/parquet-format/pull/242

In particular this takes a slightly different approach:

  1. It introduces a new random access encoding for Parquet to store serialized data instead of relying a one-off index scheme based in the thrift structure. By taking this approach is allows flexibility for implemenations to further balance size vs compute trade-offs and can potentially make use of any further encoding improvements in the future. Two downside of this approach is it requires a little bit more work up-front and has slightly more overhead then directly doing this as thrift structures.
  2. It places the serialized data page completely outside of thrift metadata and instead provides an offset within the footer. This is mostly a micro-optimization (likely not critical) to allow parquet implementors to avoid unnecessary copies of string data if the Thrift library supporting it does not allow it. There is no reason that the pages could not be inlined as a "binary" field in FileMetadata as is done in https://github.com/apache/parquet-format/pull/242
  3. Moves a few other fields out of FileMetadata into a metadata page and raises discussion points on others.
  4. Re-uses existing Thrift objects in attempt to make the transition easier for implementors.

Things it does not do:

  1. Enumerate all fields that should be deprecated https://github.com/apache/parquet-format/pull/242 is a good start and can consolidated on a final list once a general approach is taken.
  2. Incorporate changes in https://github.com/apache/parquet-format/pull/248 these also likely make sense but can be incorporated into any final proposal.
  3. Micro-optimizations to separate scan use cases from filter evaluation use-cases (ColumnChunk structure could potentially be further split apart to give finer grained access to elements that are only needed in once case or another).
emkornfield commented 1 month ago

I would prefer absolute file location rather than an offset to the start of the metadata; but it's probably harder to calculate, correct?

Probably just as easy. However, it means the footer is not self-contained for systems that might want to cache it (they would need to store an additional size value for the file).

emkornfield commented 1 month ago

Finally, is there anything needed to support an extended par3 trailing footer in future?

Sorry I missed this. I think the bitmap gives flexibility for extending par3. Or was there something else you were concerned with?

emkornfield commented 1 month ago

By splitting metadata we are fixing (1) but we are requiring a secondary fetch from the object store that depends on the first fetch and we make (2) twice as long. At this point we are back to square 1...

@alkis I think I might have misphrased something here or misdesigned something here. Where do you see the require second fetch as a requirement. I imagine performant readers will still read a significant portion of the tail of the file to avoid the two fetches.

alkis commented 1 month ago

@alkis I think I might have misphrased something here or misdesigned something here. Where do you see the require second fetch as a requirement. I imagine performant readers will still read a significant portion of the tail of the file to avoid the two fetches.

Apologies that was my misunderstanding. I conflated #242 which has offsets for metadata elsewhere.

Now that I understand this better it seems that the main trick we are employing is converting list<ExpensiveToDecodePerColumnStruct> to something like list<binary> (with optional compression) to make deserialization lazy.

  1. How is this going to affect workloads that read all the columns (and thus require all the metadata). Are they getting any benefit?
  2. The effort to fix readers is not going to be trivial. Perhaps it will be similar if we revamp the schema substantially. Why should we stick to thrift if we are going through the effort to refactor readers?
emkornfield commented 1 month ago

Now that I understand this better it seems that the main trick we are employing is converting list<ExpensiveToDecodePerColumnStruct> to something like list<binary> (with optional compression) to make deserialization lazy.

  1. How is this going to affect workloads that read all the columns (and thus require all the metadata). Are they getting any benefit?

No it does not. How much regression is present is something we need to benchmark.

  1. The effort to fix readers is not going to be trivial. Perhaps it will be similar if we revamp the schema substantially. Why should we stick to thrift if we are going through the effort to refactor readers?

I think we might have different priors on effort involved and scope of changes for readers. I was hoping with these changes existing readers could update to use it without the benefit of laziness in O(hours) - O(days) (e.g. if they happen to be exposing thrift structures as a public API in there system like DuckDB). If systems already have abstractions in place for metadata making use of the laziness should also be pretty quick. I see changing to a different encoding format as more efforts and when multiplied across the number of custom parquet readers makes it less likely for ecosystem penetration or at least take longer. I don't want to rule out moving to flatbuffers in the future (this is yet another reason for the feature bitmap).

CC @JFinis I think your perspective here might be valuable here.

steveloughran commented 1 month ago

(e.g. if they happen to be exposing thrift structures as a public API)

what do we know here about this? which apps do/don't?

wgtmac commented 1 month ago

(e.g. if they happen to be exposing thrift structures as a public API)

what do we know here about this? which apps do/don't?

@steveloughran There is a preliminary investigation: https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit#heading=h.wkbvu2wh1ahj

emkornfield commented 1 month ago

@emkornfield I have a prototype with flatbuffers that generates footers of similar size to the current thrift protocol. I left some notes in this PR with some of the optimizations I did to achieve this.

I tried to respond to the all the comments.

That said I feel the end state with flatbuffers is much better. The parse time (with verification) is already at 2.5ms/0.8ms for footer1/footer2 respectively, without substantially deviating from the current thrift object model. There is a lot of room for improvements if we deviate.

I tend to agree, I think we should pose this question on the ML thread to see what peoples thoughts are on going directly flatbuffers vs something more iterative on thrift. Deviation might be fine, but I think in terms of development effort we should be aiming for something that we can go back and forth without loss of data with the original footer. I think this would likely make it much easier for implementations to adopt flatbuffers and simply translate them back to Thrift structures as a bare minimum of support.

Should we join forces and iterate to make flatbuffers even better, while we collect footers from the fleet to compile a benchmark database to further validate the results? I feel there is a lot of room for shrinking Statistics further at which point we will have both smaller and >10x faster to parse footers.

I'm happy to collaborate on the flatbuffers approach. Do you want to open up a draft PR with your current state (maybe update to include the responses here). I think the last major sticking point might be EncodingStatistics and I wonder how much that effects flatbuffers, or if we can use the trick here to embed those within there own string either as flatbuffers or keep them as thrift).

emkornfield commented 1 month ago

@alkis posted a question to ML to get more thoughts on thrift vs flat buffers. I think before we commit fully to flat buffers I might try to prototype a more shredded version of this that effectively uses mini parquet files for each of the core structs to see what that might look like

emkornfield commented 4 weeks ago

@alkis posted a question to ML to get more thoughts on thrift vs flat buffers. I think before we commit fully to flat buffers I might try to prototype a more shredded version of this that effectively uses mini parquet files for each of the core structs to see what that might look like

It seems like we might be headed towards flatbuffers as a consenus, so I will put this on the back-burner.