Blosc / c-blosc2

A fast, compressed, persistent binary data store library for C.
https://www.blosc.org
Other
446 stars 83 forks source link

Multiple test failures when running tests -j12 #432

Open mgorny opened 1 year ago

mgorny commented 1 year ago

Describe the bug When I'm running the test suite with ctest -j12 (i.e. 12 parallel jobs), I'm getting 2-3 different test failures in a run. Over a few runs, the following tests failed:

    289 - test_fill_special (Failed)
    291 - test_frame_get_offsets (SEGFAULT)
    706 - test_schunk_frame (Failed)
    707 - test_schunk_header (Failed)
    709 - test_sframe (Failed)
    710 - test_sframe_lazychunk (Failed)

Segfaults are especially concerning.

To Reproduce

mkdir build
cd build
cmake .. -G Ninja -DCMAKE_INSTALL_PREFIX=/usr -DBUILD_STATIC=OFF -DBUILD_TESTS=yes -DBUILD_BENCHMARKS=OFF -DBUILD_EXAMPLES=OFF -DBUILD_FUZZERS=OFF -DDEACTIVATE_ZLIB=no -DDEACTIVATE_ZSTD=no -DPREFER_EXTERNAL_LZ4=ON -DPREFER_EXTERNAL_ZLIB=ON -DPREFER_EXTERNAL_ZSTD=ON -DCMAKE_BUILD_TYPE=RelWithDebInfo
ninja
ctest -j12

Expected behavior Tests should pass when run in parallel.

Logs LastTest.log from the last run: LastTest.log

System information:

DimitriPapadopoulos commented 1 year ago

I am able to reproduce segfaults even with a mere ctest, without -j12:

$ ctest
Test project /my/path/c-blosc2/build
[...]
          Start 1736: b2nd_example_serialize
1736/1736 Test #1736: b2nd_example_serialize ....................................   Passed    0.00 sec

99% tests passed, 1 tests failed out of 1736

Label Time Summary:
b2nd    =   0.50 sec*proc (8 tests)

Total Test time (real) =  53.04 sec

The following tests FAILED:
    1703 - test_lz4_bitshuffle_n (SEGFAULT)
Errors while running CTest
Output from these tests are in: /my/path/c-blosc2/build/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
$ 
$ ctest --rerun-failed --output-on-failure
Test project /my/path/c-blosc2/build
    Start 1703: test_lz4_bitshuffle_n
1/1 Test #1703: test_lz4_bitshuffle_n ............   Passed    0.41 sec

100% tests passed, 0 tests failed out of 1

Total Test time (real) =   0.45 sec
$ 

As you can see, in my case, errors seem to differ between ctest runs. Do tests fail consistently for you, or “randomly” as in my case?

FrancescAlted commented 1 year ago

Today we have fixed something that may have created this: https://github.com/Blosc/c-blosc2/commit/ca9d7c6f42e9c95d78b896ebd875bcf54b2affce

Could you give it another go?

mgorny commented 1 year ago

I can still reproduce.

FrancescAlted commented 1 year ago

Sorry, I was not explicit enough; I meant without parallelism (just ctest). For ctest -j12 this should require more work (although it is not a high priority).

DimitriPapadopoulos commented 1 year ago

I do not see segfaults without -j12 any more – but in that case segfaults were sporadic.

bnavigator commented 1 year ago

Still an issue with 2.7.1 and -j$N with N>1

keszybz commented 1 year ago

I'm seeing this too, c51d050dfa154411d776d84771fd74ca83bd232b and v2.9.1. Most of the time there are test failures, but occasionally segfualts. I didn't capture a coredump yet.

The following tests FAILED: 302 - test_copy (Failed) 311 - test_frame_offset (Failed) 726 - test_schunk_header (Failed) 1722 - test_example_frame_offset (Failed)

The following tests FAILED: 302 - test_copy (Failed) 308 - test_fill_special (Failed) 310 - test_frame_get_offsets (Failed) 311 - test_frame_offset (Failed) 1315 - test_example_frame_simple (Failed)

The following tests FAILED: 11 - test_b2nd_copy (Failed) 302 - test_copy (Failed)

The failure rate is 100% (i.e. at least one) on multiple machines.

DimitriPapadopoulos commented 1 year ago

Tests could be modified to be run in a debugger. To get GDB to automatically print a backtrace in case of a crash:

gdb --batch --ex run --ex bt --args ./myprogram "$@" > gdb-backtrace.txt 2>&1

The above runs GDB in batch mode (--batch) and tells it to run the program (--ex run) and print a backtrace (--ex bt) if it crashes. The output is redirected to a file called gdb-backtrace.txt.

That said: