Quuxplusone / LLVMBugzillaTest

0 stars 0 forks source link

std::this_thread::sleep_until may block forever #42808

Open Quuxplusone opened 5 years ago

Quuxplusone commented 5 years ago
Bugzilla Link PR43838
Status NEW
Importance P normal
Reported by Philip Metzler (philip@kodiak.ai)
Reported on 2019-10-29 10:28:12 -0700
Last modified on 2020-03-06 06:06:25 -0800
Version 7.0
Hardware PC Linux
CC eric@efcs.ca, hans@chromium.org, llvm-bugs@lists.llvm.org, mclow.lists@gmail.com, shubhamrawat1100@gmail.com
Fixed by commit(s)
Attachments main.cpp (1301 bytes, text/x-c++src)
Blocks
Blocked by
See also
Created attachment 22742
Example code to hit the race condition. May need adjustment of loop_delay.
Compile with e.g. clang++ -o test -O2 -stdlib=libc++ main.cpp on a machine with
an affected libcxx version.

There is a race condition in condition_variable::wait_until in __mutex_base
that that can result in an infinite wait.
Potentially all released libcxx versions are affected.

The issue surfaces quite easily (and often) when using high resolution
(nanosecond resolution) clocks and made us ban all uses of standard library
functions that can hit the code in question.

Code in question is (__mutex_base, version 7.0):

378 template <class _Clock, class _Duration>
379 cv_status
380 condition_variable::wait_until(unique_lock<mutex>& __lk,
381                                const chrono::time_point<_Clock, _Duration>&
__t)
382 {
383     using namespace chrono;
384     wait_for(__lk, __t - _Clock::now());
385     return _Clock::now() < __t ? cv_status::no_timeout : cv_status::timeout;
386 }

where, depending on the clock representation, an underflow can occur in line
384 (time is advancing, and if __t was too close to the current time point, the
underflow will hit). Seems like all released version (up to 9.0) have this
piece of code. There is a commit on master that will prevent the underflow
happening that's not yet in the released versions.

Example stack traces based on the example in the attachment:

(lldb) bt
* thread #1, name = 'test', stop reason = signal SIGSTOP
    frame #0: 0x00007fb5875bff85 libpthread.so.0`__pthread_cond_timedwait at futex-internal.h:205
    frame #1: 0x00007fb5875bff60 libpthread.so.0`__pthread_cond_timedwait at pthread_cond_wait.c:539
    frame #2: 0x00007fb5875bfe00 libpthread.so.0`__pthread_cond_timedwait(cond=0x00007ffdcca62780, mutex=0x00007ffdcca627b0, abstime=0x00007ffdcca61a50) at pthread_cond_wait.c:667
    frame #3: 0x00007fb5883e7f15 libc++.so.1`std::__1::condition_variable::__do_timed_wait(std::__1::unique_lock<std::__1::mutex>&, std::__1::chrono::time_point<std::__1::chrono::system_clock, std::__1::chrono::duration<long long, std::__1::ratio<1l, 1000000000l> > >) + 101
  * frame #4: 0x00000000004036da test`std::__1::cv_status std::__1::condition_variable::wait_for<unsigned long long, std::__1::ratio<1l, 1000000000l> >(this=0x00007ffdcca62780, __lk=0x00007ffdcca62770, __d=0x00007ffdcca624b8) at __mutex_base:418
    frame #5: 0x000000000040286c test`std::__1::cv_status std::__1::condition_variable::wait_until<tai_clock, std::__1::chrono::duration<unsigned long long, std::__1::ratio<1l, 1000000000l> > >(this=0x00007ffdcca62780, __lk=0x00007ffdcca62770, __t=0x00007ffdcca629a0) at __mutex_base:384
    frame #6: 0x0000000000401a20 test`void std::__1::this_thread::sleep_until<tai_clock, std::__1::chrono::duration<unsigned long long, std::__1::ratio<1l, 1000000000l> > >(__t=0x00007ffdcca629a0) at thread:461
    frame #7: 0x00000000004015a5 test`main(argc=1, argv=0x00007ffdcca62c88) at main.cpp:41
    frame #8: 0x00007fb5877f2b97 libc.so.6`__libc_start_main(main=(test`main at main.cpp:36), argc=1, argv=0x00007ffdcca62c88, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007ffdcca62c78) at libc-start.c:310
    frame #9: 0x000000000040113a test`_start + 42

With the following stack frame variables (in wait_for, called by wait_until):

(lldb) frame info
frame #4: 0x00000000004036da test`std::__1::cv_status
std::__1::condition_variable::wait_for<unsigned long long, std::__1::ratio<1l,
1000000000l> >(this=0x00007ffdcca62780, __lk=0x00007ffdcca62770,
__d=0x00007ffdcca624b8) at __mutex_base:418

(lldb) frame variable
(std::__1::condition_variable *) this = 0x00007ffdcca62780
(std::__1::unique_lock<std::__1::mutex> &) __lk = 0x00007ffdcca62770: {
  __m_ = 0x00007ffdcca627b0
  __owns_ = true
}
(const std::__1::chrono::duration<unsigned long long, std::__1::ratio<1,
1000000000> > &) __d = 0x00007ffdcca624b8 (__rep_ = 18446744073709549768)
(__sys_tpf) _Max = {
  __d_ = (__rep_ = 9223372036854775807)
}
(std::__1::chrono::steady_clock::time_point) __c_now = {
  __d_ = (__rep_ = 141408845845668)
}
(std::__1::chrono::system_clock::time_point) __s_now = {
  __d_ = (__rep_ = 1572277999316767)
}

This occurs quite frequently in our setup that uses a nanosecond resolution
clock with a uint64_t representation type (see example code) and we had to ban
all uses of functions that can hit the wait_until eventually.

The standard system_clock/steady_clock looks to be a microsecond resolution
clock. For those clocks the probability of the underflow happening is pretty
low (but is non-zero).
Quuxplusone commented 5 years ago

Attached main.cpp (1301 bytes, text/x-c++src): Example code to hit the race condition. May need adjustment of loop_delay. Compile with e.g. clang++ -o test -O2 -stdlib=libc++ main.cpp on a machine with an affected libcxx version.

Quuxplusone commented 4 years ago

Marshall, Eric, have you had a chance to look at this?

(Not new, so not a 10.0.0 blocker, but sounds like something that would be nice to fix.)