Blosc / c-blosc2

A fast, compressed, persistent binary data store library for C.
https://www.blosc.org
Other
447 stars 84 forks source link

`shuffle.c` is compiled with `-mavx512*` which results in `SIGILL` on lower CPUs #621

Closed mgorny closed 4 months ago

mgorny commented 5 months ago

Describe the bug When compiler supports SSE2/AVX2/AVX512 instructions, the respective -m flags are added to the flags for shuffle.c. As a result, the compiler can compile arbitrary code with these instruction sets. For me, GCC 14 creates a library that uses AVX512 instructions on unsupported CPU and effectively crashes with SIGILL, e.g.:

$ ninja test
[0/1] Running tests...
Test project /tmp/c-blosc2/build
          Start    1: test_plugin_test_ndlz
   1/1746 Test    #1: test_plugin_test_ndlz .....................................***Exception: Illegal  0.11 sec
          Start    2: test_plugin_test_zfp_acc_float
   2/1746 Test    #2: test_plugin_test_zfp_acc_float ............................***Exception: Illegal  0.10 sec
          Start    3: test_plugin_test_zfp_prec_float

For example, plugins/codecs/ndlz/test_ndlz crashes in get_shuffle_implementation:

─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│   0x555555582115 <get_shuffle_implementation+37>  jne    0x5555555821d0 <get_shuffle_implementation+224>                            │
│   0x55555558211b <get_shuffle_implementation+43>  vmovq  0x10ae55(%rip),%xmm0        # 0x55555568cf78                               │
│   0x555555582123 <get_shuffle_implementation+51>  vmovq  0x10adad(%rip),%xmm1        # 0x55555568ced8                               │
│   0x55555558212b <get_shuffle_implementation+59>  lea    0xe250a(%rip),%rcx        # 0x55555566463c                                 │
│   0x555555582132 <get_shuffle_implementation+66>  lea    0xa2807(%rip),%rax        # 0x555555624940 <unshuffle_generic>             │
│   0x555555582139 <get_shuffle_implementation+73>  lea    0xa2fc0(%rip),%rdx        # 0x555555625100 <bshuf_untrans_bit_elem_scal>   │
│   0x555555582140 <get_shuffle_implementation+80>  vpinsrq $0x1,%rdx,%xmm1,%xmm1                                                     │
│   0x555555582146 <get_shuffle_implementation+86>  vpinsrq $0x1,%rax,%xmm0,%xmm0                                                     │
│   0x55555558214c <get_shuffle_implementation+92>  mov    %rcx,(%rdi)                                                                │
│   0x55555558214f <get_shuffle_implementation+95>  mov    %rdi,%rax                                                                  │
│  >0x555555582152 <get_shuffle_implementation+98>  vinserti32x4 $0x1,%xmm1,%ymm0,%ymm0                                               │
│   0x555555582159 <get_shuffle_implementation+105> vmovdqu %ymm0,0x8(%rdi)                                                           │
│   0x55555558215e <get_shuffle_implementation+110> vzeroupper                                                                        │
│   0x555555582161 <get_shuffle_implementation+113> ret                                                                               │
│   0x555555582162 <get_shuffle_implementation+114> data16 cs nopw 0x0(%rax,%rax,1)                                                   │
│   0x55555558216d <get_shuffle_implementation+125> nopl   (%rax)                                                                     │
│   0x555555582170 <get_shuffle_implementation+128> cmp    $0x208000,%edx                                                             │
│   0x555555582176 <get_shuffle_implementation+134> je     0x5555555821a0 <get_shuffle_implementation+176>                            │
│   0x555555582178 <get_shuffle_implementation+136> vmovq  0x10ad68(%rip),%xmm0        # 0x55555568cee8                               │
│   0x555555582180 <get_shuffle_implementation+144> vmovq  0x10ae30(%rip),%xmm1        # 0x55555568cfb8                               │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
multi-thre Thread 0x7ffff7caa7 In: get_shuffle_implementation                                                 L306  PC: 0x555555582152 

To Reproduce

cmake -G Ninja ${srcdir} -DCMAKE_C_FLAGS='-march=znver2 -O2 -pipe' -DCMAKE_BUILD_TYPE=RelWithDebInfo
ninja
ninja test

Expected behavior Not crashing, i.e. not using SSE2/AVX2/AVX512 instructions outside specific files intended for them :-).

Logs CMake output: cmake.txt Ninja output (for build): build.txt

System information:

Additional context n/a

FrancescAlted commented 5 months ago

Yes, shuffle.c is designed to statically include many SIMD instructions and then dynamically select the correct version at runtime. This worked pretty well for many years now.

I don't have an AMD Zen handy, but I have tried this out with an Intel i13900K and an assortment of different compilers (gcc-12, gcc-13, gcc-14, clang-16, clang-17 and clang-18), and only gcc-14 exhibits this issue. Could it be a bug introduced in gcc-14?

FWIW, removing "-march=znver2" in your example works for me.

mgorny commented 5 months ago

It's not a bug. If you tell compiler to use AVX512, then it is permitted to use AVX512. That's what the flag does, by design. If anything, it just means that GCC 14 improved the optimizer that it finds new optimization possibilities.

FrancescAlted commented 5 months ago

Ok. Then I'm curious on why removing "-march=znver2" in your example works, as blosc is still telling the compiler to use AVX512.

mgorny commented 5 months ago

It probably triggers a different optimization profile.

thesamesam commented 4 months ago

Indeed, see https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-mmmx (scroll down a bit);

These options enable GCC to use these extended instructions in generated code, even without -mfpmath=sse. Applications that perform run-time CPU detection must compile separate files for each supported architecture, using the appropriate flags. In particular, the file containing the CPU detection code should be compiled without these options.