apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.46k stars 727 forks source link

Reduce Allocations When Reading Parquet Metadata #5775

Open tustvold opened 3 months ago

tustvold commented 3 months ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

Currently when reading the parquet metadata, in particular decode_metadata, we read into structures within parquet::format that are generated by the thrift compiler. Despite reading from a fixed slice of memory, these allocate new buffers for all variable width data, including all statistics. This is especially unfortunate as these thrift data structures are temporary, and quickly get parsed into more optimal data structures.

For example when reading the schema of a 26 column parquet we see almost 10,000 allocations, most as a result of the thrift data structures.

image

Almost all of them are very small

image

Describe the solution you'd like

The optimal solution would be for the parquet decoder to borrow data rather than allocating new data structures, this would avoid the vast majority of these allocations and is possible because the thrift binary encoding is just a side-prefixed string - https://github.com/apache/thrift/blob/master/doc/specs/thrift-compact-protocol.md#binary-encoding

Given that a lot of the allocations are small, deploying a small string optimisation might also be valuable, but just borrowing string slices will be optimal.

Describe alternatives you've considered

Additional context

alamb commented 3 months ago

I think @jorgecarleitao already took a shot at updating the parquet format in https://crates.io/crates/parquet-format-safe (though I am not sure if that is any better performing or just a rust Safe api)

I am not sure but it seems polars (derived from parquet2) also has its own thrift reader: https://github.com/pola-rs/polars/tree/main/crates/polars-parquet/src/parquet/schema/io_thrift

thinkharderdev commented 3 months ago

What would really be cool is to be able to skip decoding entirely for metadata not needed for a particular operation (eg skip decoding of ColumnChunks that are not being projected by the reader)

tustvold commented 3 months ago

What would really be cool is to be able to skip decoding entirely for metadata not needed for a particular operation (eg skip decoding of ColumnChunks that are not being projected by the reader)

This would effectively achieve that by making the decoding step as inexpensive as skipping over data. The nature of variable length encodings means you have to scan through regardless

thinkharderdev commented 3 months ago

What would really be cool is to be able to skip decoding entirely for metadata not needed for a particular operation (eg skip decoding of ColumnChunks that are not being projected by the reader)

This would effectively achieve that by making the decoding step as inexpensive as skipping over data. The nature of variable length encodings means you have to scan through regardless

Yeah true, looking at the compact encoding again I guess there is no way to skip variable length structs/list elements without decoding them

marcin-krystianc commented 3 months ago

What would really be cool is to be able to skip decoding entirely for metadata not needed for a particular operation (eg skip decoding of ColumnChunks that are not being projected by the reader)

This would effectively achieve that by making the decoding step as inexpensive as skipping over data. The nature of variable length encodings means you have to scan through regardless

Yeah true, looking at the compact encoding again I guess there is no way to skip variable length structs/list elements without decoding them

I looked at this problem few month ago, and my conclusion was that half of the time is spent in decoding the thrift data and the other half of the time is spent on the objects allocation and initialisation (this was in c++). So without changing the metadata format and without using different encoding it is only possible to get like 2x improvement.

tustvold commented 3 months ago

Agree 100%, this is only going to be an incremental benefit, however, every little helps and that's 2x on top of something that is already pretty fast (faster than full flatbuffer validation)

jhorstmann commented 3 months ago

I had an attack of "not invented here" syndrome the last few days 😅 and worked on an alternative code generator for thrift, that would allow me to more easily try out some changes to the generated code. The repo can be found at https://github.com/jhorstmann/compact-thrift/ and the output for parquet.thrift at https://github.com/jhorstmann/compact-thrift/blob/main/src/main/rust/tests/parquet.rs.

The current output is still doing allocations for string and binary, but running the benchmarks from https://github.com/tustvold/arrow-rs/tree/thrift-bench shows some nice improvements. This is the comparison with current arrow-rs code, so both versions should be doing the same amount of allocations:

decode metadata      time:   [32.592 ms 32.645 ms 32.702 ms]

decode metadata new  time:   [17.440 ms 17.476 ms 17.532 ms]

So incidentally very close to that 2x improvement.

The main difference in the code should be avoiding most of the abstractions from TInputProtocol and avoiding stack moves by directly writing into default-initialized structs instead of moving from local variables.

tustvold commented 3 months ago

https://github.com/apache/arrow-rs/pull/5777 is very much a hack and I'm really not a major fan of the current code, for much the same reasons you identified. Switching to a custom code generator that lets us better control these things is definitely interesting, especially if it can yield major performance returns

alamb commented 3 months ago

2x for decoding parquet metadata, often cited as the long pole for handling parquet files with large schemas, seems a very good justification for a custom code generator (or just hard coded implementation) to me

alamb commented 3 months ago

I had an attack of "not invented here" syndrome the last few days 😅 and worked on an alternative code generator for thrift,

I filed https://github.com/apache/arrow-rs/issues/5854 to track this work, thank you @jhorstmann

I also started collecting various ideas to improve to the parquet metadata decoding process here https://github.com/apache/arrow-rs/issues/5853