Open kevinzs2048 opened 1 year ago
If checksum is the bottleneck, the first thing I'd recommend doing is shifting away from using crc32, which, even fully optimized on x86_64 is less than 1/4th the speed of hash algorithms designed for speed like XXH3. XXH3 in particular should be well optimized for AArch64.
It does appear that there are equivalent hardware instructions to do the CRC32 hashing on ARM, we just haven't implemented it yet since we haven't needed it.
The checksum performance in folly is not optimized on Arm64 with Neon, which induce that the performance is quite slow.
./folly/hash/detail/ChecksumDetail.h
Cachelib heavily rely on Folly to realize the checksum.
From the perf top, in the cachelib with hyprid cache configuration, the checksum is consuming a lot of CPU time, which has been a bottleneck.