apache / accumulo

Apache Accumulo
https://accumulo.apache.org
Apache License 2.0
1.07k stars 445 forks source link

RFile writes should utilize multiple threads #4124

Open FineAndDandy opened 10 months ago

FineAndDandy commented 10 months ago

Is your feature request related to a problem? Please describe. The write operations to an rfile are serialized. When writing large rfiles in map reduce jobs this can produces very large tales to the jobs. The bottleneck is often compression rather than i/o.

Describe the solution you'd like Utilizing multiple threads to process multiple blocks in parallel could dramatically improve write performance. Having a dedicated thread to write completed blocks in order would still be necessary, but should be possible. This could be scaled based on available memory for buffering.

Describe alternatives you've considered Adding pipelines to the existing code could be a smaller lift, and have a big performance improvement as well.

keith-turner commented 10 months ago

Prototyped having another thread write in this branch. Wrote this little test program and when running it w/o thread seeing it take ~1300ms and with the separate write thread it takes ~760ms. Setting up the code to use a separate write thread was done via a manual code modification to the RFileWriter class to make it use the new ThreadedFileSKVWriter class. When it was using a seperate write thread ran top and noticed the java process was using 200% CPU.