Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

clang DirectoryWatcherTest unittest hangs on some filesystems, but not others #45794

Open Quuxplusone opened 4 years ago

Quuxplusone commented 4 years ago
Bugzilla Link PR46825
Status NEW
Importance P enhancement
Reported by Brian Friesen (bfriesen@lbl.gov)
Reported on 2020-07-23 09:58:35 -0700
Last modified on 2020-07-23 14:30:52 -0700
Version trunk
Hardware PC Linux
CC daniel@zuster.org, llvm-bugs@lists.llvm.org, naromero@anl.gov
Fixed by commit(s)
Attachments
Blocks
Blocked by
See also

Greetings,

I am very close to having a working x86 flang buildbot - everything about the bot's workflow succeeds except for a single lit test, 'DirectoryWatcherTest'. This test reads and writes some temporary files/directories in TMPDIR.

On the system where this buildbot is running (a Cray XC40), I have observed the following behaviors (all results tested on commit 724bf4ee23a of llvm-project):

On "login nodes", the test runs without any issues. TMPDIR is set to /tmp, which is a ramdisk:

cori10:DirectoryWatcher> ./DirectoryWatcherTests 
[==========] Running 8 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 8 tests from DirectoryWatcherTest
[ RUN      ] DirectoryWatcherTest.InitialScanSync
[       OK ] DirectoryWatcherTest.InitialScanSync (7 ms)
[ RUN      ] DirectoryWatcherTest.InitialScanAsync
[       OK ] DirectoryWatcherTest.InitialScanAsync (8 ms)
[ RUN      ] DirectoryWatcherTest.AddFiles
[       OK ] DirectoryWatcherTest.AddFiles (8 ms)
[ RUN      ] DirectoryWatcherTest.ModifyFile
[       OK ] DirectoryWatcherTest.ModifyFile (16 ms)
[ RUN      ] DirectoryWatcherTest.DeleteFile
[       OK ] DirectoryWatcherTest.DeleteFile (29 ms)
[ RUN      ] DirectoryWatcherTest.DeleteWatchedDir
[       OK ] DirectoryWatcherTest.DeleteWatchedDir (0 ms)
[ RUN      ] DirectoryWatcherTest.InvalidatedWatcher
[       OK ] DirectoryWatcherTest.InvalidatedWatcher (7 ms)
[ RUN      ] DirectoryWatcherTest.InvalidatedWatcherAsync
[       OK ] DirectoryWatcherTest.InvalidatedWatcherAsync (16 ms)
[----------] 8 tests from DirectoryWatcherTest (91 ms total)

[----------] Global test environment tear-down
[==========] 8 tests from 1 test case ran. (91 ms total)
[  PASSED  ] 8 tests.
cori10:DirectoryWatcher> strace -f -e open ./DirectoryWatcherTests  2>&1|grep open
[pid 49990] open("/tmp/dirwatcher-adcb34/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
open("/tmp/dirwatcher-adcb34", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
open("/tmp/dirwatcher-adcb34/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4
[pid 49995] open("/tmp/dirwatcher-0cf5ec/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 7
open("/tmp/dirwatcher-0cf5ec", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
open("/tmp/dirwatcher-0cf5ec/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4
[pid 49990] open("/tmp/dirwatcher-ab6ab0/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
open("/tmp/dirwatcher-ab6ab0", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
open("/tmp/dirwatcher-ab6ab0/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4
[pid 49990] open("/tmp/dirwatcher-f06c15/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
open("/tmp/dirwatcher-f06c15", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
open("/tmp/dirwatcher-f06c15/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4
[pid 49990] open("/tmp/dirwatcher-3204e2/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
open("/tmp/dirwatcher-3204e2", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
open("/tmp/dirwatcher-3204e2/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4
[pid 49990] open("/tmp/dirwatcher-390f43/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
[pid 49990] open("/tmp/dirwatcher-390f43/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 7
open("/tmp/dirwatcher-390f43", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
[pid 49990] open("/tmp/dirwatcher-34f29d/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 6
[pid 49990] open("/tmp/dirwatcher-34f29d", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
open("/tmp/dirwatcher-34f29d/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4
[pid 50018] open("/tmp/dirwatcher-f385d3/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 7
[pid 49990] open("/tmp/dirwatcher-f385d3", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
open("/tmp/dirwatcher-f385d3/watch", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 4

However, the same test on an XC "compute node" hangs:

nid00049:DirectoryWatcher> ./DirectoryWatcherTests  
[==========] Running 8 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 8 tests from DirectoryWatcherTest
[ RUN      ] DirectoryWatcherTest.InitialScanSync
[       OK ] DirectoryWatcherTest.InitialScanSync (5 ms)
[ RUN      ] DirectoryWatcherTest.InitialScanAsync
[       OK ] DirectoryWatcherTest.InitialScanAsync (8 ms)
[ RUN      ] DirectoryWatcherTest.AddFiles
[       OK ] DirectoryWatcherTest.AddFiles (8 ms)
[ RUN      ] DirectoryWatcherTest.ModifyFile
[       OK ] DirectoryWatcherTest.ModifyFile (0 ms)
[ RUN      ] DirectoryWatcherTest.DeleteFile
[       OK ] DirectoryWatcherTest.DeleteFile (16 ms)
[ RUN      ] DirectoryWatcherTest.DeleteWatchedDir
/path/to/llvm-project/clang/unittests/DirectoryWatcher/DirectoryWatcherTest.cpp:247: Failure
Value of: WaitForExpectedStateResult.wait_for(std::chrono::seconds(3)) == std::future_status::ready
  Actual: false
Expected: true
The expected result state wasn't reached before the time-out.
/path/to/llvm-project/clang/unittests/DirectoryWatcher/DirectoryWatcherTest.cpp:250: Failure
Value of: TestConsumer.result().hasValue()
  Actual: false
Expected: true
Expected initial events: 
Expected non-initial events: 
WatchedDirRemoved 
WatcherGotInvalidated 
Expected but not seen non-initial events: 
WatchedDirRemoved 
WatcherGotInvalidated 

The program hangs on that last line and never returns. lldb shows the following:

nid00049:DirectoryWatcher> lldb ./DirectoryWatcherTests 
(lldb) target create "./DirectoryWatcherTests"
Current executable set to './DirectoryWatcherTests' (x86_64).
(lldb) run
Process 19658 launched: '/path/to/build/tools/clang/unittests/DirectoryWatcher/DirectoryWatcherTests' (x86_64)
[==========] Running 8 tests from 1 test case.
[----------] Global test environment set-up.
[----------] 8 tests from DirectoryWatcherTest
[ RUN      ] DirectoryWatcherTest.InitialScanSync
[       OK ] DirectoryWatcherTest.InitialScanSync (8 ms)
[ RUN      ] DirectoryWatcherTest.InitialScanAsync
[       OK ] DirectoryWatcherTest.InitialScanAsync (16 ms)
[ RUN      ] DirectoryWatcherTest.AddFiles
[       OK ] DirectoryWatcherTest.AddFiles (16 ms)
[ RUN      ] DirectoryWatcherTest.ModifyFile
[       OK ] DirectoryWatcherTest.ModifyFile (16 ms)
[ RUN      ] DirectoryWatcherTest.DeleteFile
[       OK ] DirectoryWatcherTest.DeleteFile (16 ms)
[ RUN      ] DirectoryWatcherTest.DeleteWatchedDir
/path/to/llvm-project/clang/unittests/DirectoryWatcher/DirectoryWatcherTest.cpp:247: Failure
Value of: WaitForExpectedStateResult.wait_for(std::chrono::seconds(3)) == std::future_status::ready
  Actual: false
Expected: true
The expected result state wasn't reached before the time-out.
/path/to/llvm-project/clang/unittests/DirectoryWatcher/DirectoryWatcherTest.cpp:250: Failure
Value of: TestConsumer.result().hasValue()
  Actual: false
Expected: true
Expected initial events: 
Expected non-initial events: 
WatchedDirRemoved 
WatcherGotInvalidated 
Expected but not seen non-initial events: 
WatchedDirRemoved 
WatcherGotInvalidated 
Process 19658 stopped
* thread #1, name = 'DirectoryWatche', stop reason = signal SIGSTOP
    frame #0: 0x00002aaaaace1339 libpthread.so.0`__pthread_cond_destroy + 105
libpthread.so.0`__pthread_cond_destroy:
->  0x2aaaaace1339 <+105>: cmpq   $-0x1000, %rax            ; imm = 0xF000 
    0x2aaaaace133f <+111>: jbe    0x2aaaaace1320            ; <+80>
    0x2aaaaace1341 <+113>: leal   0xb(%rax), %ecx
    0x2aaaaace1344 <+116>: cmpl   $0xb, %ecx
(lldb) bt
error: DirectoryWatcherTests :: Class '_Alloc_hider' has a base class 'std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::allocator_type' which does not have a complete definition.
error: DirectoryWatcherTests :: Try compiling the source file with -fstandalone-debug.
* thread #1, name = 'DirectoryWatche', stop reason = signal SIGSTOP
  * frame #0: 0x00002aaaaace1339 libpthread.so.0`__pthread_cond_destroy + 105
    frame #1: 0x000000000040b666 DirectoryWatcherTests`(anonymous namespace)::VerifyingConsumer::~VerifyingConsumer(this=0x00007fffffff5f78) at DirectoryWatcherTest.cpp:100:8
    frame #2: 0x000000000040d62a DirectoryWatcherTests`DirectoryWatcherTest_DeleteWatchedDir_Test::TestBody(this=0x000000000082f8e0) at DirectoryWatcherTest.cpp:430:1
    frame #3: 0x000000000054a634 DirectoryWatcherTests`void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(object=0x000000000082f8e0, method=21 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00, location="the test body")(), char const*) at gtest.cc:2402:10
    frame #4: 0x0000000000537a82 DirectoryWatcherTests`void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(object=0x000000000082f8e0, method=21 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00, location="the test body")(), char const*) at gtest.cc:2455:12
    frame #5: 0x0000000000525a26 DirectoryWatcherTests`testing::Test::Run(this=0x000000000082f8e0) at gtest.cc:2474:5
    frame #6: 0x00000000005262bb DirectoryWatcherTests`testing::TestInfo::Run(this=0x000000000082dfc0) at gtest.cc:2656:11
    frame #7: 0x0000000000526854 DirectoryWatcherTests`testing::TestCase::Run(this=0x000000000082d660) at gtest.cc:2774:28
    frame #8: 0x000000000052be25 DirectoryWatcherTests`testing::internal::UnitTestImpl::RunAllTests(this=0x000000000082d1e0) at gtest.cc:4649:43
    frame #9: 0x000000000054dc34 DirectoryWatcherTests`bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(object=0x000000000082d1e0, method=50 bb 52 00 00 00 00 00 00 00 00 00 00 00 00 00, location="auxiliary test code (environments or event listeners)")(), char const*) at gtest.cc:2402:10
    frame #10: 0x00000000005396e2 DirectoryWatcherTests`bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(object=0x000000000082d1e0, method=50 bb 52 00 00 00 00 00 00 00 00 00 00 00 00 00, location="auxiliary test code (environments or event listeners)")(), char const*) at gtest.cc:2455:12
    frame #11: 0x000000000052bb2a DirectoryWatcherTests`testing::UnitTest::Run(this=0x000000000081a218) at gtest.cc:4257:10
    frame #12: 0x000000000051df31 DirectoryWatcherTests`RUN_ALL_TESTS() at gtest.h:2233:46
    frame #13: 0x000000000051df1a DirectoryWatcherTests`main(argc=1, argv=0x00007fffffff6608) at TestMain.cpp:50:10
    frame #14: 0x00002aaaabc5334a libc.so.6`__libc_start_main + 234
    frame #15: 0x000000000040a54a DirectoryWatcherTests`_start at start.S:120

I have been told that the /tmp ramdisk on "compute nodes" is configured slightly different than the /tmp ramdisk on "login nodes," which may explain the difference in behavior. But I don't know what the difference in configuration actually is.

I also tried setting TMPDIR to different file systems, including Lustre and GPFS. I found that when TMPDIR is a GPFS file system, the test succeeds on any kind of node (login and compute), but when TMPDIR is a Lustre filesystem, it hangs on both login nodes and compute nodes, in the same way as when TMPDIR is set to the /tmp ramdisk.

So to summarize:

/tmp ramdisk on XC login nodes: PASS Lustre on XC login nodes: (hangs) GPFS on XC login nodes: PASS

/tmp ramdisk on XC compute nodes: (hangs) Lustre on XC compute nodes: (hangs) GPFS on XC compute nodes: PASS

Any ideas how to make this test less sensitive to the kind of file system it's running on?

Thanks.

Quuxplusone commented 4 years ago

I ran this unittest through valgrind, and observed that, when TMPDIR is set to one of the file systems which causes the program to hang as I described originally, the helgrind threading tool in valgrind reports the following:

==12182== Thread #1: pthread_cond_destroy: destruction of condition variable being waited upon
==12182==    at 0x4C34882: ??? (in /usr/lib64/valgrind/vgpreload_helgrind-amd64-linux.so)
==12182==    by 0x40B665: (anonymous namespace)::VerifyingConsumer::~VerifyingConsumer() (llvm-project/clang/unittests/DirectoryWatcher/DirectoryWatcherTest.cpp:100)
==12182==    by 0x40D629: DirectoryWatcherTest_DeleteWatchedDir_Test::TestBody() (llvm-project/clang/unittests/DirectoryWatcher/DirectoryWatcherTest.cpp:430)
==12182==    by 0x54A633: void testing::internal::HandleSehExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (llvm-project/llvm/utils/unittest/googletest/src/gtest.cc:2402)
==12182==    by 0x537A81: void testing::internal::HandleExceptionsInMethodIfSupported<testing::Test, void>(testing::Test*, void (testing::Test::*)(), char const*) (llvm-project/llvm/utils/unittest/googletest/src/gtest.cc:2455)
==12182==    by 0x525A25: testing::Test::Run() (llvm-project/llvm/utils/unittest/googletest/src/gtest.cc:2474)
==12182==    by 0x5262BA: testing::TestInfo::Run() (llvm-project/llvm/utils/unittest/googletest/src/gtest.cc:2656)
==12182==    by 0x526853: testing::TestCase::Run() (llvm-project/llvm/utils/unittest/googletest/src/gtest.cc:2774)
==12182==    by 0x52BE24: testing::internal::UnitTestImpl::RunAllTests() (llvm-project/llvm/utils/unittest/googletest/src/gtest.cc:4649)
==12182==    by 0x54DC33: bool testing::internal::HandleSehExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (llvm-project/llvm/utils/unittest/googletest/src/gtest.cc:2402)
==12182==    by 0x5396E1: bool testing::internal::HandleExceptionsInMethodIfSupported<testing::internal::UnitTestImpl, bool>(testing::internal::UnitTestImpl*, bool (testing::internal::UnitTestImpl::*)(), char const*) (llvm-project/llvm/utils/unittest/googletest/src/gtest.cc:2455)
==12182==    by 0x52BB29: testing::UnitTest::Run() (llvm-project/llvm/utils/unittest/googletest/src/gtest.cc:4257)
==12182==

It does not report that error when TMPDIR is set to a file system which allows the program to complete without hanging.