alpaka-group / llama

A Low-Level Abstraction of Memory Access
https://llama-doc.rtfd.io/
Mozilla Public License 2.0
79 stars 10 forks source link

SEGV in `std::simd` tests #777

Closed bernhardmgruber closed 11 months ago

bernhardmgruber commented 11 months ago

The CI often, but not deterministically, crashes inside a test using std::simd, with the following errors:

Randomness seeded to: 4119962326
AddressSanitizer:DEADLYSIGNAL

=================================================================
==8399==ERROR: AddressSanitizer: SEGV on unknown address 0x03e9000020cf (pc 0x55b54af5f43c bp 0x7ffd8de6d430 sp 0x7ffd8de6b780 T0)
==8399==The signal is caused by a READ memory access.
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
tests is a Catch2 v3.4.0 host application.
Run with -? for options

-------------------------------------------------------------------------------
simd.Simd.stdsimd
-------------------------------------------------------------------------------
/home/runner/work/llama/llama/tests/simd.cpp:[13](https://github.com/alpaka-group/llama/actions/runs/6318373925/job/17170551919?pr=775#step:14:14)9
...............................................................................

/home/runner/work/llama/llama/tests/simd.cpp:[14](https://github.com/alpaka-group/llama/actions/runs/6318373925/job/17170551919?pr=775#step:14:15)1: FAILED:
  {Unknown expression after the reported line}
due to a fatal error condition:
  SIGSEGV - Segmentation violation signal

===============================================================================
test cases:   646 |   645 passed | 1 failed
assertions: 86551 | 86550 passed | 1 failed

AddressSanitizer:DEADLYSIGNAL
AddressSanitizer: nested bug in the same thread, aborting.
Error: Process completed with exit code 1.

This only happens on g++-12 and g++-13. I cannot reproduce it locally, though.

bernhardmgruber commented 11 months ago

I used the Debugging with tmate GH action to ssh to a runner and debug the situation. The failure occurs in constructField, called from the default constructor of a llama::Simd, when the field type is std::native_simd<float> an machines with AVX512. The crashing instruction is a vmovaps to an address which is not a multiple of 64, thus misaligned.

The crash only happens when ASAN is turned on. If turned off on the same machine, the tests pass. Also the address on which placement new is called inside constructField is appropriately aligned (also in ASAN builds).

Upon further inspection: it looks like ASAN generates additional code around placement new, including the misaligned store.

bernhardmgruber commented 11 months ago

PR #783 tried to fix this issue, but now more cases appeared where g++ generates wrong code when ASan is enabled and SIMD is involved.