foxglove / mcap

MCAP is a modular, performant, and serialization-agnostic container file format, useful for pub/sub and robotics applications.
https://mcap.dev
MIT License
525 stars 96 forks source link

perf: C++ CRC calculation speed is suboptimal #708

Open james-rms opened 2 years ago

james-rms commented 2 years ago

CRC calculation in the writer can be a bottleneck in some situations. Here lies the ticket to track making it faster. Relates to: #707 #706

foxymiles commented 2 years ago

Maybe worth a look at SIMD implementations? https://chromium.googlesource.com/chromium/src/+/HEAD/third_party/zlib/crc32_simd.c

foxymiles commented 2 years ago

Some comments on this post claim 14x speedup for SIMD over zlib crc32: https://fastcompression.blogspot.com/2019/03/presenting-xxh3.html

james-rms commented 2 years ago

@wkalt notes:

for the chunk CRC (not the data CRC), it looks like we do an update every time a message gets written, but we also have the full uncompressed chunk data in hand when we finalize the chunk. If that understanding is correct, I wonder if there would be a benefit to computing the uncompressed CRC for the chunk in one shot when the chunk gets finalized rather than updating on each message we write if the issue is the lookup table falling out of CPU cache, seems like that could potentially help. Also seems like it would eliminate sensitivity to message size though I'm not really sure we actually see that in the charts. Weirdly the charts seem to show that mixed message sizes are quicker than either small or large - though kinda hard to say

wkalt commented 2 years ago

related to above, I observe some improvement between the two cases in this go program: https://gist.github.com/wkalt/22dbfad2a353443b4a812fe950b5bb2d simulating per-message CRC (kilobyte size) vs per-chunk CRC (5 MB size).

It's a 34% speedup on a single-core digitalocean vm. On my much more capable laptop it narrows to a bit over 7% improvement. This isn't testing the same C++ code but might help to validate the strategy.