apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.53k stars 1.39k forks source link

Benchmark the assembly of thrift objects, and possibly create a more efficient ReplayingTProtocol #1391

Open asfimport opened 10 years ago

asfimport commented 10 years ago

The current implementation of parquet thrift creates an instance of TProtocol for each value of each record and builds a stack of these events, which are then replayed back to the TBase.

I'd be curious to benchmark this, and if it's slow, try building a "ReplayingTProtocol" that instead of having a stack of TProtocol instances, contains a primitive array of each type. As events are fed into this replaying TProtocol, it would just add these primitives to its buffers, and then the TBase would drain them. This would effectively let us stream the values into the TBase without making an object allocation for each value.

The buffers could be set to a certain size, and if they fill up (which they sholdn't in most cases), the TBase could begin draining the protocol until it is empty again, at which point the TProtocol can block the TBase from draining further while the parque record assembly feeds it more events.

This is all moot if it turns out not to be bottleneck though :)

Reporter: Alex Levenson / @isnotinvain

Note: This issue was originally created as PARQUET-33. Please see the migration documentation for further details.

asfimport commented 9 years ago

Dmitriy V. Ryaboy / @dvryaboy: I have the same gut feeling as you, that the protocol stacking is costing us.

asfimport commented 9 years ago

Alex Levenson / @isnotinvain: I haven't had a chance to do a thorough benchmark, but I did run through a gig of data on my laptop and if I'm reading YourKit right 15% of the time was spent in record assembly. I'm not sure whether that includes parquet's assembly algorithm, or just the thrift layer. I started a prototype for this but haven't had a chance to try it out.