bytedance / sonic-cpp

A fast JSON serializing & deserializing library, accelerated by SIMD.
Apache License 2.0
835 stars 101 forks source link

feat: support neon, sse simd and dynamic dispatch #56

Closed xiegx94 closed 1 year ago

xiegx94 commented 1 year ago

Main changes

xiegx94 commented 1 year ago

This PR provides 2 ways to support multi-arch dispatch: dispatch at compile (static dispatch) and dispatch at runtime (dynamic dispatch). Dynamic dispatch is implemented by using gcc/clang multiversioning-functions, which causes these function cannot be inlined when compile and the performance will be worse.

The structure of arch folder

├── avx2
│   ├── base.h
│   ├── itoa.h
│   ├── quote.h
│   ├── simd.h
│   ├── skip.h
│   ├── str2int.h
│   └── unicode.h
├── common
│   ├── quote_common.h
│   ├── quote_tables.h
│   ├── skip_common.h
│   ├── unicode_common.h
│   └── x86_common
│       ├── itoa.h
│       ├── quote.inc.h
│       └── skip.inc.h
├── neon
│   ├── base.h
│   ├── itoa.h
│   ├── quote.h
│   ├── simd.h
│   ├── skip.h
│   ├── str2int.h
│   └── unicode.h
├── simd_base.h
├── simd_dispatch.h
├── simd_itoa.h
├── simd_quote.h
├── simd_skip.h
├── simd_str2int.h
├── sonic_cpu_feature.h
├── sse
│   ├── base.h
│   ├── itoa.h
│   ├── quote.h
│   ├── simd.h
│   ├── skip.h
│   ├── str2int.h
│   └── unicode.h
├── target_macro.h
└── x86_ifuncs
    ├── base.h
    ├── ifunc_macro.h
    ├── itoa.h
    ├── quote.h
    ├── skip.h
    └── str2int.h

How to add a new function

If you want to add a new simd function which is called foo, then, you should follow below steps:

  1. implement foo for every arch, such as:
    
    namespace sonic_json {
    namespace internal {
    namespace avx2 {

void foo() { return; }

} // namespace avx2 } // namespace internal } // namespace sonic_json

2. provide dynamic dispatch functions for x86 (or others platform)
```c++
namespace sonic_json {
namespace internal {

__attribute__((target(HASWELL))) inline void foo() { return avx2::foo(); }
__attribute__((target(WESTMERE))) inline void foo() { return sse::foo(); }
__attribute__((target("default"))) inline void foo() { return sse::foo(); }

}
}
  1. If you want implement foo in a new header file foo.h, you should provide such file for every arch and x86_ifuncs. then add a new file simd_foo.h in arch floder:
    
    #pragma once

include "simd_dispatch.h"

include INCLUDE_ARCH_FILE(foo.h)

namespace sonic_json { namespace internal {

SONIC_USING_ARCH_FUNC(foo);

} }

# How to add a new architecture
If there is a new architecture named `Y86`, you should do:
1. write a new rule to detect `Y86` macro ( provide by gcc/clang) in `sonic_cpu_feature.h`
```c++
#if defined(__Y86__)
#define SONIC_HAVE_Y86
#endif
  1. write a new rule about how to dispatch in simd_dispatch.h
    #if defined(SONIC_STATIC_DISPATCH)
    #if defined(SONIC_HAVE_Y86)
    #define SONIC_USING_ARCH_FUNC(func) using Y86::func
    #define INCLUDE_ARCH_FILE(file) SONIC_STRINGIFY(Y86/file)
    #endif
    #elif defined(SONIC_DYNAMIC_DISPATCH)
    #if defined(SONIC_HAVE_Y86)
    #define SONIC_USING_ARCH_FUNC(func)
    #define INCLUDE_ARCH_FILE(file) SONIC_STRINGIFY(y86_ifuncs/file)
    #endif
    #endif
  2. create y86 folder and implement all simd functions
  3. create y86 folder and implement all multiversioning-functions.

sonic 的多架构设计同时支持在编译期间选择指定的指令和在运行时根据运行的平台选择合适的指令。同时支持两种方式是因为在运行时抉择会让使用 simd 的函数/接口无法在编译期间 inline,这会引起一些性能下降。

arch 目录结构

├── avx2
│   ├── base.h
│   ├── itoa.h
│   ├── quote.h
│   ├── simd.h
│   ├── skip.h
│   ├── str2int.h
│   └── unicode.h
├── common
│   ├── quote_common.h
│   ├── quote_tables.h
│   ├── skip_common.h
│   ├── unicode_common.h
│   └── x86_common
│       ├── itoa.h
│       ├── quote.inc.h
│       └── skip.inc.h
├── neon
│   ├── base.h
│   ├── itoa.h
│   ├── quote.h
│   ├── simd.h
│   ├── skip.h
│   ├── str2int.h
│   └── unicode.h
├── simd_base.h
├── simd_dispatch.h
├── simd_itoa.h
├── simd_quote.h
├── simd_skip.h
├── simd_str2int.h
├── sonic_cpu_feature.h
├── sse
│   ├── base.h
│   ├── itoa.h
│   ├── quote.h
│   ├── simd.h
│   ├── skip.h
│   ├── str2int.h
│   └── unicode.h
├── target_macro.h
└── x86_ifuncs
    ├── base.h
    ├── ifunc_macro.h
    ├── itoa.h
    ├── quote.h
    ├── skip.h
    └── str2int.h

avx2, sse, neon。特定架构下的 simd 实现代码 common, 通用的一些实现 x86_ifuncs x86 平台动态 dispatch 代码

如何添加新的函数

  1. 在每个 arch 下添加新的函数,如:
    
    namespace sonic_json {
    namespace internal {
    namespace avx2 {

void foo() { return; }

} // namespace avx2 } // namespace internal } // namespace sonic_json

5. 在 x86_ifunc 下添加 x86 动态 dispatch 支持:
```c++
namespace sonic_json {
namespace internal {

__attribute__((target(HASWELL))) inline void foo() { return avx2::foo(); }
__attribute__((target(WESTMERE))) inline void foo() { return sse::foo(); }
__attribute__((target("default"))) inline void foo() { return sse::foo(); }

}
}
  1. (可选)如果添加了新的头文件,则在 arch 下添加 simd_foo.h,在各 arch 下添加 foo.h 文件。simd_foo 如下:
    
    #pragma once

include "simd_dispatch.h"

include INCLUDE_ARCH_FILE(foo.h)

namespace sonic_json { namespace internal {

SONIC_USING_ARCH_FUNC(foo);

} }

# 如何添加新的架构
假如有个新的架构叫Y86,需要在 sonic 中添加其 simd 支持,则:
1. 在 sonic_cpu_feature.h 中检测Y86的宏:
```c++
#if defined(__Y86__)
#define SONIC_HAVE_Y86
#endif
  1. 在 simd_dispatch 中添加 dispatch 规则
    #if defined(SONIC_STATIC_DISPATCH)
    #if defined(SONIC_HAVE_Y86)
    #define SONIC_USING_ARCH_FUNC(func) using Y86::func
    #define INCLUDE_ARCH_FILE(file) SONIC_STRINGIFY(Y86/file)
    #endif
    #elif defined(SONIC_DYNAMIC_DISPATCH)
    #if defined(SONIC_HAVE_Y86)
    #define SONIC_USING_ARCH_FUNC(func)
    #define INCLUDE_ARCH_FILE(file) SONIC_STRINGIFY(y86_ifuncs/file)
    #endif
    #endif
  2. 添加 y86 文件夹,添加所有的 simd 函数的 y86 实现
  3. 添加 y86_ifuncs 文件夹,添加 y86 的 multiversioning-function 实现
codecov-commenter commented 1 year ago

Codecov Report

Merging #56 (9980dc1) into master (80cdba0) will increase coverage by 0.84%. The diff coverage is 91.61%.

:mega: This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@            Coverage Diff             @@
##           master      #56      +/-   ##
==========================================
+ Coverage   95.04%   95.88%   +0.84%     
==========================================
  Files          22       21       -1     
  Lines        2785     2431     -354     
==========================================
- Hits         2647     2331     -316     
+ Misses        138      100      -38     
Impacted Files Coverage Δ
include/sonic/allocator.h 90.43% <ø> (ø)
include/sonic/dom/dynamicnode.h 96.08% <ø> (ø)
include/sonic/dom/serialize.h 93.39% <ø> (ø)
include/sonic/internal/arch/avx2/base.h 100.00% <ø> (ø)
include/sonic/internal/ftoa.h 97.34% <ø> (ø)
include/sonic/internal/itoa.h 100.00% <ø> (ø)
include/sonic/internal/arch/simd_skip.h 89.23% <89.23%> (ø)
include/sonic/dom/handler.h 99.04% <100.00%> (ø)
include/sonic/dom/parser.h 94.23% <100.00%> (ø)
include/sonic/internal/arch/avx2/simd.h 100.00% <100.00%> (ø)
... and 4 more

... and 3 files with indirect coverage changes

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

xiegx94 commented 1 year ago
Performance test case master(haswell) sse haswell dynamic dispatch
book/Decode_SonicDyn 980 ns 813 ns 849 ns 1128
gsoc-2018/Decode_SonicDyn 1406878 ns 1470898 ns 1339296 ns 1752588
fgo/Decode_SonicDyn 129952490 ns 112165070 ns 117719364 ns 150338769
lottie/Decode_SonicDyn 948184 ns 805143 ns 842756 ns 1187414
canada/Decode_SonicDyn 4068896 ns 3756878 ns 3789520 ns 4085432
github_events/Decode_SonicDyn 42468 ns 39368 ns 39716 ns 54755
otfcc/Decode_SonicDyn 321242929 ns 292141676 ns 320360184 ns 377427578
poet/Decode_SonicDyn 1611831 ns 1572923 ns 1534339 ns 1743444
citm_catalog/Decode_SonicDyn 1212610 ns 1137476 ns 1217241 ns 1439325
twitter/Decode_SonicDyn 194191 ns 181451 ns 185673 ns 260165
twitterescaped/Decode_SonicDyn 572412 ns 492546 ns 555098 ns 671564
book/Encode_SonicDyn 598 ns 619 ns 631 ns 616
gsoc-2018/Encode_SonicDyn 702591 ns 796307 ns 680203 ns 672574
fgo/Encode_SonicDyn 75789720 ns 75432301 ns 76930374 ns 75568517
lottie/Encode_SonicDyn 846441 ns 858753 ns 839591 ns 871623
canada/Encode_SonicDyn 6078378 ns 6152427 ns 6009102 ns 6074922
github_events/Encode_SonicDyn 21617 ns 22432 ns 21035 ns 20380
otfcc/Encode_SonicDyn 222879330 ns 155490041 ns 159626245 ns 160975575
poet/Encode_SonicDyn 864328 ns 840780 ns 720925 ns 721330
citm_catalog/Encode_SonicDyn 600309 ns 533967 ns 560394 ns 554867
twitter/Encode_SonicDyn 94764 ns 97690 ns 93807 ns 89528
twitterescaped/Encode_SonicDyn 281830 ns 284284 ns 263186 ns 269042
liuq19 commented 1 year ago

Performance

test case sse haswell dynamic dispatch book/Decode_SonicDyn 813 ns 849 ns 1128 ns gsoc-2018/Decode_SonicDyn 1470898 ns 1339296 ns 1752588 ns fgo/Decode_SonicDyn 112165070 ns 117719364 ns 150338769 ns lottie/Decode_SonicDyn 805143 ns 842756 ns 1187414 ns canada/Decode_SonicDyn 3756878 ns 3789520 ns 4085432 ns github_events/Decode_SonicDyn 39368 ns 39716 ns 54755 ns otfcc/Decode_SonicDyn 292141676 ns 320360184 ns 377427578 ns poet/Decode_SonicDyn 1572923 ns 1534339 ns 1743444 ns citm_catalog/Decode_SonicDyn 1137476 ns 1217241 ns 1439325 ns twitter/Decode_SonicDyn 181451 ns 185673 ns 260165 ns twitterescaped/Decode_SonicDyn 492546 ns 555098 ns 671564 ns book/Encode_SonicDyn 619 ns 631 ns 616 ns gsoc-2018/Encode_SonicDyn 796307 ns 680203 ns 672574 ns fgo/Encode_SonicDyn 75432301 ns 76930374 ns 75568517 ns lottie/Encode_SonicDyn 858753 ns 839591 ns 871623 ns canada/Encode_SonicDyn 6152427 ns 6009102 ns 6074922 ns github_events/Encode_SonicDyn 22432 ns 21035 ns 20380 ns otfcc/Encode_SonicDyn 155490041 ns 159626245 ns 160975575 ns poet/Encode_SonicDyn 840780 ns 720925 ns 721330 ns citm_catalog/Encode_SonicDyn 533967 ns 560394 ns 554867 ns twitter/Encode_SonicDyn 97690 ns 93807 ns 89528 ns twitterescaped/Encode_SonicDyn 284284 ns 263186 ns 269042 ns

最好分别贴下static 模式和 dynamic 模式下,目前分支和master分支的相对性能测试数据,这样应该更清楚一点

xiegx94 commented 1 year ago

Performance test case sse haswell dynamic dispatch book/Decode_SonicDyn 813 ns 849 ns 1128 ns gsoc-2018/Decode_SonicDyn 1470898 ns 1339296 ns 1752588 ns fgo/Decode_SonicDyn 112165070 ns 117719364 ns 150338769 ns lottie/Decode_SonicDyn 805143 ns 842756 ns 1187414 ns canada/Decode_SonicDyn 3756878 ns 3789520 ns 4085432 ns github_events/Decode_SonicDyn 39368 ns 39716 ns 54755 ns otfcc/Decode_SonicDyn 292141676 ns 320360184 ns 377427578 ns poet/Decode_SonicDyn 1572923 ns 1534339 ns 1743444 ns citm_catalog/Decode_SonicDyn 1137476 ns 1217241 ns 1439325 ns twitter/Decode_SonicDyn 181451 ns 185673 ns 260165 ns twitterescaped/Decode_SonicDyn 492546 ns 555098 ns 671564 ns book/Encode_SonicDyn 619 ns 631 ns 616 ns gsoc-2018/Encode_SonicDyn 796307 ns 680203 ns 672574 ns fgo/Encode_SonicDyn 75432301 ns 76930374 ns 75568517 ns lottie/Encode_SonicDyn 858753 ns 839591 ns 871623 ns canada/Encode_SonicDyn 6152427 ns 6009102 ns 6074922 ns github_events/Encode_SonicDyn 22432 ns 21035 ns 20380 ns otfcc/Encode_SonicDyn 155490041 ns 159626245 ns 160975575 ns poet/Encode_SonicDyn 840780 ns 720925 ns 721330 ns citm_catalog/Encode_SonicDyn 533967 ns 560394 ns 554867 ns twitter/Encode_SonicDyn 97690 ns 93807 ns 89528 ns twitterescaped/Encode_SonicDyn 284284 ns 263186 ns 269042 ns

最好分别贴下static 模式和 dynamic 模式下,目前分支和master分支的相对性能测试数据,这样应该更清楚一点

Updated.