PRUNERS / archer

Archer, a data race detection tool for large OpenMP applications
https://pruners.github.io/archer
Apache License 2.0
62 stars 13 forks source link

Incorrect results with Archer #42

Open rolandschulz opened 7 years ago

rolandschulz commented 7 years ago

With Archer (dc4e363) build with out of source with LLVM 4.0 and OMP-TR4. Running GROMACS unit tests:

git init gromacs && cd gromacs
git fetch https://gerrit.gromacs.org/gromacs refs/changes/48/6648/1 && git checkout FETCH_HEAD
mkdir build && cd build
CC=clang-archer CXX=clang-archer++ cmake -GNinja -DGMX_OPENMP_MAX_THREADS=256 -DGMX_BUILD_HELP=OFF -DBUILD_SHARED_LIBS=yes -DCMAKE_BUILD_TYPE=RelWithDebInfo -DGMX_HWLOC=no .. -DGMX_SIMD=None -DTMPI_ATOMICS_DISABLED=yes -DGMX_MPI=on
ninja check

Produces:

The following tests FAILED:
          7 - EwaldUnitTests (Failed)
         12 - MdrunUtilityMpiUnitTests (Failed)
         23 - CorrelationsTest (Failed)

Compiling and running with clang 4.0 without archer (with or without using OMP-TR4) all tests pass. All unit tests also pass with VS2015, ICC 16&17, GCC 4.8-7.1 and a few other compilers we check less regularly. Thus it is highly unlikely that the unit tests failures are GROMACS source code problems.

dongahn commented 7 years ago

@rolandschulz: thank you for using Archer on your build-and-test system.

Could you describe how those unit tests failed more specifically? One of the common cases of something like this in our environment is Archer/TSan detects an error (directly in a unit test code or some other tester components) and causes it to exit with a return code 66. https://github.com/google/sanitizers/wiki/ThreadSanitizerFlags

Could you try to set EnvVar, TSAN_OPTIONS="exitcode=0" before you run these unit tests and see if this makes any difference?

rolandschulz commented 7 years ago

All 3 failures are incorrect results not TSAN errors. They also all 3 still occur with OMP_NUM_THREADS=1. Thus it seems that the LLVM pass added by archer causes incorrect results. Compiling and run all unit tests with clang 4.0 with Tsan (without archer) all unit tests give the correct result (some tests produce false positive tsan warnings - different tests from the ones which fail with archer) E.g. running the first test which fails by itself gives

./bin/ewald-test --gtest_filter=SaneInput1/PmeBSplineModuliCorrectnessTest.ReproducesValues/0
Note: Google Test filter = SaneInput1/PmeBSplineModuliCorrectnessTest.ReproducesValues/0
[==========] Running 1 test from 1 test case.
[----------] Global test environment set-up.
[----------] 1 test from SaneInput1/PmeBSplineModuliCorrectnessTest
[ RUN      ] SaneInput1/PmeBSplineModuliCorrectnessTest.ReproducesValues/0
../src/testutils/refdata.cpp:900: Failure
  In item: /X/Length
   Actual: '-1782689792'
Reference: '64'
Google Test trace:
../src/gromacs/ewald/tests/pmebsplinetest.cpp:95: Testing B-spline moduli creation (plain) for PME order 3, grid size 64 32 64
../src/testutils/refdata.cpp:900: Failure
  In item: /Y/Length
   Actual: '-1782689792'
Reference: '32'
Google Test trace:
../src/gromacs/ewald/tests/pmebsplinetest.cpp:95: Testing B-spline moduli creation (plain) for PME order 3, grid size 64 32 64
../src/testutils/refdata.cpp:900: Failure
  In item: /Z/Length
   Actual: '-1782689792'
Reference: '64'
Google Test trace:
../src/gromacs/ewald/tests/pmebsplinetest.cpp:95: Testing B-spline moduli creation (plain) for PME order 3, grid size 64 32 64
[  FAILED  ] SaneInput1/PmeBSplineModuliCorrectnessTest.ReproducesValues/0, where GetParam() = (12-byte object <40-00 00-00 20-00 00-00 40-00 00-00>, 3, 4-byte object <00-00 00-00>) (52 ms)
[----------] 1 test from SaneInput1/PmeBSplineModuliCorrectnessTest (55 ms total)

[----------] Global test environment tear-down
[==========] 1 test from 1 test case ran. (56 ms total)
[  PASSED  ] 0 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SaneInput1/PmeBSplineModuliCorrectnessTest.ReproducesValues/0, where GetParam() = (12-byte object <40-00 00-00 20-00 00-00 40-00 00-00>, 3, 4-byte object <00-00 00-00>)

 1 FAILED TEST
dongahn commented 7 years ago

Thanks. Seems something that @simoatze should look into. @rolandschulz: how should we reproduce this failures?

rolandschulz commented 7 years ago

I provided the git, cmake, and ninja commands in my first message. That should let you be able to reproduce the error. If you have difficulty reproducing I'm happy to help.

simoatze commented 7 years ago

@rolandschulz thanks for reporting these issues. I'll look into it as soon as I can and get back to you.