facebook / folly

An open-source C++ library developed and used at Facebook.
https://groups.google.com/forum/?fromgroups#!forum/facebook-folly
Apache License 2.0
28.04k stars 5.53k forks source link

Checksum performance is slow on Arm64 #2027

Open kevinzs2048 opened 1 year ago

kevinzs2048 commented 1 year ago

The checksum performance in folly is not optimized on Arm64 with Neon, which induce that the performance is quite slow.

./folly/hash/detail/ChecksumDetail.h

Cachelib heavily rely on Folly to realize the checksum.

From the perf top, in the cachelib with hyprid cache configuration, the checksum is consuming a lot of CPU time, which has been a bottleneck.

Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz
$ checksum_benchmark --bm_min_usec=10000
============================================================================
folly/hash/test/ChecksumBenchmark.cpp           relative  time/iter  iters/s
============================================================================
crc32_512                                                   55.73ns   17.94M
crc32_1024                                                  85.15ns   11.74M
crc32_2048                                                 116.29ns    8.60M
crc32_4096                                                 191.03ns    5.23M
crc32_8192                                                 341.44ns    2.93M
crc32_16384                                                627.76ns    1.59M
crc32_32768                                                  1.21us  827.16K
============================================================================
Comparison:

============================================================================
[...]folly/hash/test/ChecksumBenchmark.cpp     relative  time/iter   iters/s
============================================================================
crc32_512                                                   1.80us   554.82K
crc32_1024                                                  3.58us   279.35K
crc32_2048                                                  7.14us   140.13K
crc32_4096                                                 14.25us    70.18K
crc32_8192                                                 28.47us    35.12K
crc32_16384                                                56.93us    17.57K
crc32_32768                                               113.83us     8.79K
Orvid commented 1 year ago

If checksum is the bottleneck, the first thing I'd recommend doing is shifting away from using crc32, which, even fully optimized on x86_64 is less than 1/4th the speed of hash algorithms designed for speed like XXH3. XXH3 in particular should be well optimized for AArch64.

It does appear that there are equivalent hardware instructions to do the CRC32 hashing on ARM, we just haven't implemented it yet since we haven't needed it.