bytedance / sonic-cpp

A fast JSON serializing & deserializing library, accelerated by SIMD.
Apache License 2.0
835 stars 101 forks source link

arm: optimize decoder on Arm SVE2 platform #92

Closed cyb70289 closed 4 weeks ago

cyb70289 commented 2 months ago

This patch improves sonic json decoder performance on Arm SVE2 CPU. It leverages SVMATCH instruction to locate multiple tokens in a vector efficiently.

Enable this feature by specifying cmake option "-DENABLE_SVE2_128=ON". Please note the binary can only run on hardware with SVE2 supported, and the vector size must be 128 bits, like Neoverse-N2. Otherwise, the code behaviour is undefined.

As shown in the table below, tested on Bluewhale server, obvious performance uplift is observed from sonic decoder benchmarks. No side effect observed for other benchmarks.

Benchmark Original SVE2 Improvement
gsoc-2018/Decode_SonicDyn 2.38736 2.76677 15.89%
citm_catalog/Decode_SonicDyn 1.41729 1.76191 24.32%
otfcc/Decode_SonicDyn 399.916 413.417 3.38%
fgo/Decode_SonicDyn 691.597 716.301 3.57%
twitter/Decode_SonicDyn 1.33604 1.58737 18.81%
twitterescaped/Decode_SonicDyn 1.24759 1.30216 4.37%
github_events/Decode_SonicDyn 1.38961 1.65635 19.20%
canada/Decode_SonicDyn 526.145 524.517 -0.31%
poet/Decode_SonicDyn 2.06297 2.40383 16.52%
lottie/Decode_SonicDyn 419.902 438.824 4.51%
book/Decode_SonicDyn 456.615 487.196 6.70%
xiegx94 commented 2 months ago

@cyb70289 What's unit of your benchmark results? HIB or LIB?

cyb70289 commented 2 months ago

@cyb70289 What's unit of your benchmark results? HIB or LIB?

Gi/s and Mi/s, bytes per second.

As an example

$ build/benchmark/bench --benchmark_filter=Decode_Sonic
gsoc-2018/Decode_SonicDyn         1299148 ns      1299146 ns          537 bytes_per_second=2.38563Gi/s testdata/gsoc-2018.json
citm_catalog/Decode_SonicDyn      1136378 ns      1136290 ns          617 bytes_per_second=1.41565Gi/s testdata/citm_catalog.json
otfcc/Decode_SonicDyn           158508828 ns    158472460 ns            4 bytes_per_second=399.646Mi/s testdata/otfcc.json
fgo/Decode_SonicDyn              67084470 ns     67084360 ns            9 bytes_per_second=692.246Mi/s testdata/fgo.json
......
xiegx94 commented 2 months ago

see #56,support sve as a different arch.

cyb70289 commented 2 months ago

Thanks, will try to refactor following that PR. Instead of adding a complete SVE implementation, I'm thinking about "inherit" from NEON and only override code that can benefit from SVE. Looks to me many code will be the same for NEON and SVE.

cyb70289 commented 2 months ago

@xiegx94 , sve2-128 implementation is added. Arm common code is moved to common/arm_common/. I checked sonic decoder benchmarks, no performance regression is found.

cyb70289 commented 2 months ago

Any convenient way to run clang-format job locally?

xiegx94 commented 2 months ago

Any convenient way to run clang-format job locally?

Could you install clang in your machine? If you have a clang-format, run git clang-format

cyb70289 commented 2 months ago

Any convenient way to run clang-format job locally?

Could you install clang in your machine? If you have a clang-format, run git clang-format

Thanks, format should be fixed now.

cyb70289 commented 2 months ago

"Test coverage" runs successfully on my local x86 server. Not sure why CI job fails. Looks it's only for x86?

xiegx94 commented 1 month ago

@cyb70289 pls update cmake/set_arch_flags.cmake.

xiegx94 commented 1 month ago

93 FYI @cyb70289

cyb70289 commented 1 month ago

@cyb70289 pls update cmake/set_arch_flags.cmake.

@xiegx94 updated