Open FineAndDandy opened 10 months ago
Prototyped having another thread write in this branch. Wrote this little test program and when running it w/o thread seeing it take ~1300ms and with the separate write thread it takes ~760ms. Setting up the code to use a separate write thread was done via a manual code modification to the RFileWriter class to make it use the new ThreadedFileSKVWriter class. When it was using a seperate write thread ran top and noticed the java process was using 200% CPU.
Is your feature request related to a problem? Please describe. The write operations to an rfile are serialized. When writing large rfiles in map reduce jobs this can produces very large tales to the jobs. The bottleneck is often compression rather than i/o.
Describe the solution you'd like Utilizing multiple threads to process multiple blocks in parallel could dramatically improve write performance. Having a dedicated thread to write completed blocks in order would still be necessary, but should be possible. This could be scaled based on available memory for buffering.
Describe alternatives you've considered Adding pipelines to the existing code could be a smaller lift, and have a big performance improvement as well.