Open james-rms opened 2 years ago
Maybe worth a look at SIMD implementations? https://chromium.googlesource.com/chromium/src/+/HEAD/third_party/zlib/crc32_simd.c
Some comments on this post claim 14x speedup for SIMD over zlib crc32: https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html
@wkalt notes:
for the chunk CRC (not the data CRC), it looks like we do an update every time a message gets written, but we also have the full uncompressed chunk data in hand when we finalize the chunk. If that understanding is correct, I wonder if there would be a benefit to computing the uncompressed CRC for the chunk in one shot when the chunk gets finalized rather than updating on each message we write if the issue is the lookup table falling out of CPU cache, seems like that could potentially help. Also seems like it would eliminate sensitivity to message size though I'm not really sure we actually see that in the charts. Weirdly the charts seem to show that mixed message sizes are quicker than either small or large - though kinda hard to say
related to above, I observe some improvement between the two cases in this go program: https://gist.github.com/wkalt/22dbfad2a353443b4a812fe950b5bb2d simulating per-message CRC (kilobyte size) vs per-chunk CRC (5 MB size).
It's a 34% speedup on a single-core digitalocean vm. On my much more capable laptop it narrows to a bit over 7% improvement. This isn't testing the same C++ code but might help to validate the strategy.
CRC calculation in the writer can be a bottleneck in some situations. Here lies the ticket to track making it faster. Relates to: #707 #706