Open emkornfield opened 1 month ago
I would prefer absolute file location rather than an offset to the start of the metadata; but it's probably harder to calculate, correct?
Probably just as easy. However, it means the footer is not self-contained for systems that might want to cache it (they would need to store an additional size value for the file).
Finally, is there anything needed to support an extended par3 trailing footer in future?
Sorry I missed this. I think the bitmap gives flexibility for extending par3. Or was there something else you were concerned with?
By splitting metadata we are fixing (1) but we are requiring a secondary fetch from the object store that depends on the first fetch and we make (2) twice as long. At this point we are back to square 1...
@alkis I think I might have misphrased something here or misdesigned something here. Where do you see the require second fetch as a requirement. I imagine performant readers will still read a significant portion of the tail of the file to avoid the two fetches.
@alkis I think I might have misphrased something here or misdesigned something here. Where do you see the require second fetch as a requirement. I imagine performant readers will still read a significant portion of the tail of the file to avoid the two fetches.
Apologies that was my misunderstanding. I conflated #242 which has offsets for metadata elsewhere.
Now that I understand this better it seems that the main trick we are employing is converting list<ExpensiveToDecodePerColumnStruct>
to something like list<binary>
(with optional compression) to make deserialization lazy.
Now that I understand this better it seems that the main trick we are employing is converting
list<ExpensiveToDecodePerColumnStruct>
to something likelist<binary>
(with optional compression) to make deserialization lazy.
- How is this going to affect workloads that read all the columns (and thus require all the metadata). Are they getting any benefit?
No it does not. How much regression is present is something we need to benchmark.
- The effort to fix readers is not going to be trivial. Perhaps it will be similar if we revamp the schema substantially. Why should we stick to thrift if we are going through the effort to refactor readers?
I think we might have different priors on effort involved and scope of changes for readers. I was hoping with these changes existing readers could update to use it without the benefit of laziness in O(hours) - O(days) (e.g. if they happen to be exposing thrift structures as a public API in there system like DuckDB). If systems already have abstractions in place for metadata making use of the laziness should also be pretty quick. I see changing to a different encoding format as more efforts and when multiplied across the number of custom parquet readers makes it less likely for ecosystem penetration or at least take longer. I don't want to rule out moving to flatbuffers in the future (this is yet another reason for the feature bitmap).
CC @JFinis I think your perspective here might be valuable here.
(e.g. if they happen to be exposing thrift structures as a public API)
what do we know here about this? which apps do/don't?
(e.g. if they happen to be exposing thrift structures as a public API)
what do we know here about this? which apps do/don't?
@steveloughran There is a preliminary investigation: https://docs.google.com/document/d/1PQpY418LkIDHMFYCY8ne_G-CFpThK15LLpzWYbc7rFU/edit#heading=h.wkbvu2wh1ahj
@emkornfield I have a prototype with flatbuffers that generates footers of similar size to the current thrift protocol. I left some notes in this PR with some of the optimizations I did to achieve this.
I tried to respond to the all the comments.
That said I feel the end state with flatbuffers is much better. The parse time (with verification) is already at 2.5ms/0.8ms for footer1/footer2 respectively, without substantially deviating from the current thrift object model. There is a lot of room for improvements if we deviate.
I tend to agree, I think we should pose this question on the ML thread to see what peoples thoughts are on going directly flatbuffers vs something more iterative on thrift. Deviation might be fine, but I think in terms of development effort we should be aiming for something that we can go back and forth without loss of data with the original footer. I think this would likely make it much easier for implementations to adopt flatbuffers and simply translate them back to Thrift structures as a bare minimum of support.
Should we join forces and iterate to make flatbuffers even better, while we collect footers from the fleet to compile a benchmark database to further validate the results? I feel there is a lot of room for shrinking Statistics further at which point we will have both smaller and >10x faster to parse footers.
I'm happy to collaborate on the flatbuffers approach. Do you want to open up a draft PR with your current state (maybe update to include the responses here). I think the last major sticking point might be EncodingStatistics and I wonder how much that effects flatbuffers, or if we can use the trick here to embed those within there own string either as flatbuffers or keep them as thrift).
@alkis posted a question to ML to get more thoughts on thrift vs flat buffers. I think before we commit fully to flat buffers I might try to prototype a more shredded version of this that effectively uses mini parquet files for each of the core structs to see what that might look like
@alkis posted a question to ML to get more thoughts on thrift vs flat buffers. I think before we commit fully to flat buffers I might try to prototype a more shredded version of this that effectively uses mini parquet files for each of the core structs to see what that might look like
It seems like we might be headed towards flatbuffers as a consenus, so I will put this on the back-burner.
As a point of discussion a slightly different version of showing how column metadata could be decoupled from FileMetadata then https://github.com/apache/parquet-format/pull/242
In particular this takes a slightly different approach:
Things it does not do: