Open janrueth opened 6 years ago
After further thinking about it I came to the conclusion that I should manually buffer my writes via a bufio.Writer.
However, I'm unsure what buffer sizes I should choose, can you give recommendation? I read that for the highest compression level I should compress blocks auf 900k byte, should this be my buffer size, or can bzip2 profit from even larger bulk writes?
The Python bzip2 module is a wrapper around the C bzip2 library. My benchmarks show the C library is typically 2x faster, but it being 3x faster on certain workloads is definitely possible.
Unlike most other compression algorithms which rely on LZ77 as their primary backbone, bzip2 is unique in that it uses Burrows-Wheeler Transform as the primary mechanism for deduplicating strings. Since BWT is the slowest part of bzip2, attempts to improve encoding performance should tackle the problem at that stage.
There are many approaches to performing the BWT:
In attempting to implement BWT, there are several questions:
SACA algorithms are a large area of active research. I'm not sure which algorithm the C implementation uses, but other implementations like Java uses DivSufSort.
The algorithm used in this repo is called SAIS, and is a transliteration of the C implementation here.
I chose SAIS because it is simple (~600 lines) and decently fast for it's simplicity. It is actually one of the faster algorithms since it is O(n), while many other SACAs like DivSufSort are O(n*log n). Unfortunately, the bzip2 format limits blocks to 900k, so SAIS is unable to show its O(n) advantage over DivSufSort since n is not large enough.
I did not use DivSufSort like the Java implementation because I was not comfortable with maintaining the complexity of the DivSufSort approach.
I don't know. It's possible, but I haven't had the bandwidth to really pick apart any specific SACA to really how to alter it to output the BWT required by bzip2 directly.
I don't have the bandwidth these days keep working on bzip2, so I don't have any plans to improve the bzip2 encoder performance.
I think it would be nice to keep the algorithm as SAIS, but if you can figure out how to have SAIS directly output BWT, that would increase performance immediately by 2x since you won't need to duplicate the input string.
After further thinking about it I came to the conclusion that I should manually buffer my writes via a bufio.Writer.
I'm not sure how a bufio.Writer
benefits your use case since the bzip2.Writer
does buffering internally.
Since the BWT algorithm used in this repo is O(n), the blocksize has little impact on the performance speed. Thus, it probably makes sense to use level 9 compression. It will cost you more memory, but will produce smaller output files.
I'm not sure how a bufio.Writer benefits your use case since the bzip2.Writer does buffering internally.
Yeah I also realized that when testing to buffer manually...
Unfortunately, apart from having no time, I'm not an expert in compression so I'm not up to improving this.
I now ended up streaming my data through a bzip2 process.
I'm currently converting a python tool that uses bzip2 for compression to go.
I stumbled upon missing bzip2 compression in golang and found your package (thanks for implementing it!).
Yet, bzip2 compression speed seems to be way lower than with my python variant. It seems that the speed is independent of the compression level that I apply. According to pprof most of the time is spent in
computeSA_byte
which in turn spends most of its time insortLMS2_int
andinduceSA_int
.My python implementation is at least 3x faster.
Do you have any hints how I can speed it up?
My application feeds lines of (json) data to the bzip2.Writer, i.e., basically a loop over the lines (actually they are streamed from somewhere else (so I don't know when this end)) calling the Write method for each line. In contrast to my python implementation in which I can manually call a flush method when I expect no more data (basically after a timeout not receiving further lines), it seems that the go implementation calls flush after each call to write automatically (I have only briefly looked at the code but the above mentioned functions seem to originate from flush). Also I found that your implementation causes larger files compared to my python equivalent (which I suppose might also come from prematurely flushing data?!).
Thanks for any help Jan