Closed viboes closed 6 years ago
Hi Vicente,
Here is the latest test file and the results against the latest test code in this branch.
I added a "long" time jump forward in addition to the existing "short" time jump forward to test how the functions behave when the time jumps forward past the time when they are supposed to expire.
test_boost_sleep.cpp.txt linux_results_fixed_no_monotonic.txt linux_results_fixed_yes_monotonic.txt windows_results_fixed_no_monotonic.txt windows_results_fixed_yes_monotonic.txt
These tests require root (Linux) or administrator (Windows) privileges in order to run, so they can change the system time. With that limitation would you still be able to run these tests on your virtual machines? I don't know how to update the Jamfile so I would need some help with that.
I'm working on sending you a pull request using a sub-branch of this branch with some more changes I would like to propose.
In src/pthread/thread.cpp, why does hidden::sleep_for() use a condition variable when get_current_thread_data() is not NULL but uses no_interruption_point::hidden::sleep_for() when it is NULL? The condition variable in thread_info doesn't seem to be special. It seems to me the condition variable could simply be a local variable that is used regardless of whether or not thread_info is NULL, but perhaps there's something I'm missing.
And thanks for letting me know that sleep_for/sleep_until are interruption points and therefore must use a condition variable. I didn't fully understand that.
BTW, all timed-based functions are (or should be) documented as interruption points except the sleep inside the no_interruption_point
namespace and as consequence must use a condition variable
I'll try to fix first the no-monotonic issues for chrono related API
boost::this_thread::sleep_for():
While system clock remains stable: Expected: 580 ms, Actual: 580 ms, Returned: timeout, Passed
While system clock jumps back: Expected: 580 ms, Actual: 810 ms, Returned: timeout, FAILED: TOO LONG
While system clock jumps forward (short): Expected: 580 ms, Actual: 350 ms, Returned: timeout, FAILED: TOO SHORT
While system clock jumps forward (long): Expected: 580 ms, Actual: 110 ms, Returned: timeout, FAILED: TOO SHORT
boost::this_thread::sleep_until(), steady time:
While system clock remains stable: Expected: 580 ms, Actual: 580 ms, Returned: timeout, Passed
While system clock jumps back: Expected: 580 ms, Actual: 810 ms, Returned: timeout, FAILED: TOO LONG
While system clock jumps forward (short): Expected: 580 ms, Actual: 351 ms, Returned: timeout, FAILED: TOO SHORT
While system clock jumps forward (long): Expected: 580 ms, Actual: 110 ms, Returned: timeout, FAILED: TOO SHORT
boost::thread::try_join_for():
While system clock remains stable: Expected: 580 ms, Actual: 580 ms, Returned: timeout, Passed
While system clock jumps back: Expected: 580 ms, Actual: 810 ms, Returned: timeout, FAILED: TOO LONG
While system clock jumps forward (short): Expected: 580 ms, Actual: 350 ms, Returned: timeout, FAILED: TOO SHORT
While system clock jumps forward (long): Expected: 580 ms, Actual: 110 ms, Returned: timeout, FAILED: TOO SHORT
boost::shared_mutex::try_lock_for():
While system clock remains stable: Expected: 580 ms, Actual: 580 ms, Returned: timeout, Passed
While system clock jumps back: Expected: 580 ms, Actual: 810 ms, Returned: timeout, FAILED: TOO LONG
While system clock jumps forward (short): Expected: 580 ms, Actual: 350 ms, Returned: timeout, FAILED: TOO SHORT
While system clock jumps forward (long): Expected: 580 ms, Actual: 110 ms, Returned: timeout, FAILED: TOO SHORT
boost::shared_mutex::timed_lock(), posix duration:
While system clock remains stable: Expected: 580 ms, Actual: 580 ms, Returned: timeout, Passed
While system clock jumps back: Expected: 580 ms, Actual: 810 ms, Returned: timeout, FAILED: TOO LONG
While system clock jumps forward (short): Expected: 580 ms, Actual: 350 ms, Returned: timeout, FAILED: TOO SHORT
While system clock jumps forward (long): Expected: 580 ms, Actual: 110 ms, Returned: timeout, FAILED: TOO SHORT
boost::upgrade_mutex::try_lock_for():
While system clock remains stable: Expected: 580 ms, Actual: 580 ms, Returned: timeout, Passed
While system clock jumps back: Expected: 580 ms, Actual: 810 ms, Returned: timeout, FAILED: TOO LONG
While system clock jumps forward (short): Expected: 580 ms, Actual: 350 ms, Returned: timeout, FAILED: TOO SHORT
While system clock jumps forward (long): Expected: 580 ms, Actual: 110 ms, Returned: timeout, FAILED: TOO SHORT
Maybe you concentrate if you want on the DateTime issues for the monotonic case
boost::thread::try_join_until(), system time:
While system clock remains stable: Expected: 580 ms, Actual: 580 ms, Returned: timeout, Passed
While system clock jumps back: Expected: 810 ms, Actual: 810 ms, Returned: timeout, Passed
While system clock jumps forward (short): Expected: 350 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
While system clock jumps forward (long): Expected: 110 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
boost::thread::try_join_until(), custom time:
While system clock remains stable: Expected: 580 ms, Actual: 580 ms, Returned: timeout, Passed
While system clock jumps back: Expected: 810 ms, Actual: 810 ms, Returned: timeout, Passed
While system clock jumps forward (short): Expected: 350 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
While system clock jumps forward (long): Expected: 110 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
boost::thread::timed_join(), posix system time:
While system clock remains stable: Expected: 580 ms, Actual: 580 ms, Returned: timeout, Passed
While system clock jumps back: Expected: 810 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO SHORT
While system clock jumps forward (short): Expected: 350 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
While system clock jumps forward (long): Expected: 110 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
boost::condition_variable::timed_wait(), posix system time:
While system clock remains stable: Expected: 580 ms, Actual: 580 ms, Returned: timeout, Passed
While system clock jumps back: Expected: 810 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO SHORT
While system clock jumps forward (short): Expected: 350 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
While system clock jumps forward (long): Expected: 110 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
boost::condition_variable::timed_wait(), with predicate, posix system time:
While system clock remains stable: Expected: 580 ms, Actual: 580 ms, Returned: timeout, Passed
While system clock jumps back: Expected: 810 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO SHORT
While system clock jumps forward (short): Expected: 350 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
While system clock jumps forward (long): Expected: 110 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
boost::condition_variable_any::timed_wait(), posix system time:
While system clock remains stable: Expected: 580 ms, Actual: 580 ms, Returned: timeout, Passed
While system clock jumps back: Expected: 810 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO SHORT
While system clock jumps forward (short): Expected: 350 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
While system clock jumps forward (long): Expected: 110 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
boost::condition_variable_any::timed_wait(), with predicate, posix system time:
While system clock remains stable: Expected: 580 ms, Actual: 580 ms, Returned: timeout, Passed
While system clock jumps back: Expected: 810 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO SHORT
While system clock jumps forward (short): Expected: 350 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
While system clock jumps forward (long): Expected: 110 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
boost::shared_mutex::timed_lock(), posix system time:
While system clock remains stable: Expected: 580 ms, Actual: 580 ms, Returned: timeout, Passed
While system clock jumps back: Expected: 810 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO SHORT
While system clock jumps forward (short): Expected: 350 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
While system clock jumps forward (long): Expected: 110 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
boost::upgrade_mutex::timed_lock(), posix system time:
While system clock remains stable: Expected: 580 ms, Actual: 580 ms, Returned: timeout, Passed
While system clock jumps back: Expected: 810 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO SHORT
While system clock jumps forward (short): Expected: 350 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
While system clock jumps forward (long): Expected: 110 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
boost::timed_mutex::timed_lock(), posix system time:
While system clock remains stable: Expected: 580 ms, Actual: 580 ms, Returned: timeout, Passed
While system clock jumps back: Expected: 810 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO SHORT
While system clock jumps forward (short): Expected: 350 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
While system clock jumps forward (long): Expected: 110 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
boost::recursive_timed_mutex::timed_lock(), posix system time:
While system clock remains stable: Expected: 580 ms, Actual: 580 ms, Returned: timeout, Passed
While system clock jumps back: Expected: 810 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO SHORT
While system clock jumps forward (short): Expected: 350 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
While system clock jumps forward (long): Expected: 110 ms, Actual: 580 ms, Returned: timeout, FAILED: TOO LONG
I'll have to manage with the DateTime issues for the non-monotonic case and the windows issues.
I'll add also a branch for my modifications. It would be great to have short PR.
let me know if you have a better idea or how to organize the work.
Your plan sounds good. I don't have any better ideas about how to organize the work. If we can get the feature/timespec_clocks branch into a well-tested state where all of the tests pass, then I think we'll be good.
As I have time I'll work on the DateTime issues in the monotonic case and create small PRs for each issue.
I went ahead and sent you PR https://github.com/boostorg/thread/pull/151 just because I'd already started working on it. If it overlaps with anything you're already working on, feel free to ignore it. Thanks.
I added a long system clock jump backwards to each test, and included the results below. Without a monotonic clock, some of these new tests fail, but there's nothing we can do about that. When a condition variable is waiting 100 milliseconds on the system clock, and the system clock jumps backwards by 900 milliseconds, the condition variable won't return for 1 full second. So even when we're polling every 100 milliseconds, we can still end up waiting too long, which is why some of these new tests fail.
test_boost_sleep.cpp.txt linux_results_fixed_no_monotonic.txt linux_results_fixed_yes_monotonic.txt windows_results_fixed_no_monotonic.txt
If we are able to do a sleep_until to a system time correctly in any jumping case (no-mono), sleep_for and sleep_until a steady time could be done right using the sleep_until to a system time, isn't it?
I mean that we should pool on system_clock::now() + milliseconds(100).
No, that won't work. sleep_until() with a system time works correctly only because it's required by the contract to wait longer when the system clock jumps backward. But sleep_for() and sleep_until() with a steady time are required to not wait longer, even when the system clock jumps backwards. So even if we poll on system_clock::now() + milliseconds(100)
we will still end up waiting longer than we should.
Right. However, I remember you said that waiting on a timed condition would finish when the system time is changed. IIUC, this is not the case when waiting on a system time.
Right. However, I remember you said that waiting on a timed condition would finish when the system time is changed. IIUC, this is not the case when waiting on a system time.
Ah, yes, I do remember saying that. What I failed to mention is that that is only true when the system clock jumps forward and crosses the timeout point. It is not true when the system clock jumps backward.
This is the one situation where the only solution is to enable BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC.
Thanks for reviewing and merging all of my updates so far. I have some other things that I need to focus on for the next week or two, so I don't plan on submitting any more PRs until I free up again. However, I will still be available to review and/or test any updates you make to the timespec_clocks branch.
Looking forward to hopefully seeing these updates in Boost 1.66 in December!
I'll try to do my best, while you be busy ;-)
We will surely have all this fixed for Boost 1.66.
I've committed a minor change to try to fix condition_variable::timed_wait for linux. The code specific to BOOST_THREAD_HAS_CONDATTR_SET_CLOCK_MONOTONIC has not been compiled :(
Please, could you check my changes and report the results.
I've added a fixme, because I don't understand why the commented change doesn't work. Any help appreciated.
Bad news: we need tests for time-related boost::future/shared_future functions also :(
Do we need to add tests for v2/shared_mutex.hpp and/or for pthread/shared_mutex_assert.hpp? It doesn't look like either are being used currently, but both contain time-related code.
v2/shared_mutex.hpp can be used on windows but I I believe I should remove pthread/shared_mutex_assert.hpp .
I finally moved all of the test code and the latest results onto GitHub. Here are the latest results with the updates you've made so far.
It is a pleasure to work with you.
Thanks so much for all your contributions and your reports.
You're very welcome. I'm glad to help.
I have had some issues with the shared_mutex implementation on windows. I've moved to the incomplete v2/shared_mutex.hpp version for linux for the time being. The win32 version introduces a lot of regressions on the generic regression tests.
If we apply the same kind of fixes that we have applied so far for future (which is generic), we will have the same issue that I have found on windows for shared_mutex (which was also generic in some way) as we don't have the equivalent of the timespec_clocks.
Temporally I could duplicate the concerned parts of the file and fix futures on pthread. But we need some kind of timespec_clock for windows to be able to fix all the other files.
I think we just need to update timespec.hpp so that real_timespec_clock::now() and mono_timespec_clock::now() work on Windows as well as Linux. Hopefully we can just copy the necessary code from Boost Chrono. Then we can move timespec.hpp out of the pthread directory since it will be common to both.
Updated test results. All Linux test results that are passable now pass, including futures. No change on the Windows results.
https://github.com/austin-beer/test-time-jumps/blob/master/linux_results_fixed_no_monotonic.txt https://github.com/austin-beer/test-time-jumps/blob/master/linux_results_fixed_yes_monotonic.txt https://github.com/austin-beer/test-time-jumps/blob/master/windows_results_fixed_no_monotonic.txt
FYI, the current Windows test is running without BOOST_THREAD_PROVIDES_GENERIC_SHARED_MUTEX_ON_WIN or BOOST_THREAD_PROVIDES_SHARED_MUTEX_UPWARDS_CONVERSIONS defined, so it's not using the POSIX implementation. If you need me to run the Windows test with those defined, let me know.
Great. We achieved the first goal.
Now we need to fix the windows failures (functions returning no-timeout) and improve the QOI.
BTW, I believe that you need to define BOOST_THREAD_DONT_PROVIDE_GENERIC_SHARED_MUTEX_ON_WIN if you don't want the generic shared mutex, and I don't see this on the Makefile.
// GENERIC_SHARED_MUTEX_ON_WIN
#if ! defined BOOST_THREAD_DONT_PROVIDE_GENERIC_SHARED_MUTEX_ON_WIN \
&& ! defined BOOST_THREAD_PROVIDES_GENERIC_SHARED_MUTEX_ON_WIN
#define BOOST_THREAD_PROVIDES_GENERIC_SHARED_MUTEX_ON_WIN
#endif
The same for
#if ! defined BOOST_THREAD_DONT_PROVIDE_SHARED_MUTEX_UPWARDS_CONVERSIONS \
&& ! defined BOOST_THREAD_PROVIDES_SHARED_MUTEX_UPWARDS_CONVERSIONS
#define BOOST_THREAD_PROVIDES_SHARED_MUTEX_UPWARDS_CONVERSIONS
#endif
I'm using the default BOOST_THREAD_VERSION, which is 2. That version doesn't automatically define BOOST_THREAD_PROVIDES_GENERIC_SHARED_MUTEX_ON_WIN or BOOST_THREAD_PROVIDES_SHARED_MUTEX_UPWARDS_CONVERSIONS, so I have to explicitly define them if I want them.
Here, these test results might be more helpful on Windows.
https://github.com/austin-beer/test-time-jumps/blob/master/windows_results_fixed_no_generic_shared.txt https://github.com/austin-beer/test-time-jumps/blob/master/windows_results_fixed_yes_generic_shared.txt
Here are the list of changes that I believe need to be made to allow the generic shared_mutex.hpp and future.hpp to work with the win32 code using the updated classes in timespec.hpp.
do_wait()
functions in win32/condition_variable.hpp
with public do_wait_until()
functions that have identical signatures to their counterparts in pthread/condition_variable_fwd.hpp
. The new functions would all have internal_timespec_timepoint
parameters.basic_cv_list_entry::wait()
in win32/condition_variable.hpp
with basic_cv_list_entry::do_wait_until()
that has an internal_timespec_timepoint
parameter.interruptible_wait()
and non_interruptible_wait()
in src/win32/thread.cpp
to have an internal_timespec_timepoint
parameter.interruptible_wait()
and non_interruptible_wait()
in win32/thread_data.hpp
.do_try_join_for_noexcept()
in src/win32/thread.cpp
back to do_try_join_until_noexcept()
that has an internal_timespec_timepoint
parameter. This makes it identical to the do_try_join_until_noexcept()
function in src/pthread/thread.cpp
.detail/thread.hpp
to use the pthread join code for win32 as well, since there will no longer be any need for separate code paths for each.I may have some time in the next week or so to work on this, if you don't get to it first.
Please Austin, could you run the test once again.
I will try to fix the remaining issues on timed_mutex
/recursive_timed_mutex
and shared_mutex
for windows. Once we have those fixed, I will merge this PR.
If you could have time to add tests for sync_queue
time related functions we could as well fix the possible issues.
When I get the chance I'll run the tests again and add the sync_queue tests. I'm having trouble building on Linux ever since the threadapi commits. I'm hoping the solution is just to pull in newer versions of some of the other libraries in my boost workspace. Here's the error I'm getting in case you already know what the issue is:
Performing configuration checks
- 32-bit : no
- 64-bit : yes
- arm : no
- mips1 : no
- power : no
- sparc : no
- x86 : yes
- symlinks supported : yes
- lockfree boost::atomic_flag : yes
error: Name clash for '<pstage/lib>libboost_system.a'
error:
error: Tried to build the target twice, with property sets having
error: these incompatible properties:
error:
error: - none
error: - <threadapi>pthread
error:
error: Please make sure to have consistent requirements for these
error: properties everywhere in your project, especially for install
error: targets.
Also, you may want to hold off on merging this PR for awhile. I discovered on Friday that our changes (specifically the polling every 100 milliseconds in condition_variable) have introduced a "lost wakeup" bug in condition_variable. The reason is due to the following in http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_cond_timedwait.html:
It is important to note that when pthread_cond_wait() and pthread_cond_timedwait() return without error, the associated predicate may still be false. Similarly, when pthread_cond_timedwait() returns with the timeout error, the associated predicate may be true due to an unavoidable race between the expiration of the timeout and the predicate state change.
When we're polling every 100 milliseconds, and pthread_cond_timedwait times out (or the equivalent on Windows), we're assuming that the predicate has not changed and so we're going ahead and polling for another 100 milliseconds. But because of the above statement that is not a valid assumption.
Because of this, I think some of our changes need to be reverted. But not all of the changes need to be reverted. Only the ones where we don't have access to the predicate (i.e. some of the wait_() functions in condition_variable). All of the sleep(), try_join(), and try_lock() functions should still be fine because we have access to the predicate in those situations. We just need to review each one to make sure we haven't introduced the "lost wakeup" bug into them either.
Austin, I agree with you that we could test the predicate each time the pool expires, but I don't this as a defect but a QOI.
Fortunately we don't have a lot of time related function with predicated. IIRC ony condition_variables support them.
I've followed the
Please try by doing a pull of all the libraries on master and let me know what do you get.
Alternatively remove bin.v2 and build all.
BTW, the sync_queue issue is due to the elapsed time been a little bit less than the required.
elapsed = 492676000ms
libs\thread\test\sync\mutual_exclusion\sync_pq\pq_single_thread_pass.cpp(59): test 'diff < milliseconds(650) && diff >= milliseconds(500)' failed in function 'void __cdecl test_pull_for(void)'
I suspect that there was a time slicing between the now for the start and the pull_for. I will test for > 400ms only.
Ok, I'll try updating to master and seeing what happens.
As far as the "lost wakeup" bug is concerned, is the following behavior acceptable?
Right now with our changes this behavior is possible, but I believe it shouldn't be.
I removed bin.v2, made sure I'd checked out and pulled the latest boost/master, and updated all submodules to the SHA-1s specified in boost/master. That builds fine. But as soon as I check out thread/develop into libs/thread, I get the same threadapi error. I also checked out the latest develop branch of each submodule to see if that would help, but it did not. I still got the same error. Something must be different between the way you're building and the way I'm building. Thanks in advance for any help you can provide!
How "Thread A misses the notification and waits the full 10 seconds before returning"?
From my side I just run the tests from the test directory as
b2 toolset=<tool> -jn -l300
How do you build Boost.Thread?
If thread A is right at the end of one of its 100 millisecond loops when notify_*() is called, do_wait_until() will return that a timeout occurred even though it received the signal from pthread_cond_signal() or pthread_cond_broadcast(). In our loop, if do_wait_until() reports that a timeout occurred (returns false), we go around and start waiting for another 100 milliseconds.
I use the following command to build just the libraries that I need for my test application:
b2 --build-dir=dist --with-chrono --with-thread --with-system toolset=gcc link=static variant=release stage
I'll try the command you use myself on a new repository.
I also run the following first, just FYI:
b2 headers toolset=gcc
But if "do_wait_until() reports that a timeout occurred (returns false)" is because the thread has not received the signal. Or are you talking of the case where the signal was sent just after the thread received a timeout and before calling to the wait again? If this is the case, I believe this is the normal behavior for wait_until/wait_for without a predicate. However for the signatures with a predicate we should evaluate the predicate after each timeout. Am I missing something evident?
Yes, I'm talking about the case where the signal was sent just after the thread received a timeout but before calling wait again. And yes, that is normal behavior. The problem is because we call do_wait_until() multiple times within a single call to wait_until(). For example:
wait_until(1 second)
do_wait_until(100 milliseconds)
do_wait_until()
times out, a signal is sentdo_wait_until()
returns a timeout to wait_until()
wait_until()
sees that do_wait_until()
timed out, assumes that no signal was received, and calls do_wait_until(100 milliseconds)
again instead of returning to the userSo as far as the user is concerned the signal was lost, even though do_wait_until()
behaved normally.
Well, the problem is that we don't know if the user predicate has changed or not, and the time has not been elapsed or reached, and IMO we can not return from the wait_until
/wait_for
function as we don't know if the signal was lost or not. We could of course respond as if it was the case, but then what is the added value of having a duration or timepoint parameter? The user has asked to wait_until
the condition has been notified or the time is reached. Our algorithm works well when the notification is done while we are waiting and when the time is reached.
IIRC, we have added the polling to ensure to be as close to the timeout. Maybe this was an error, and we should prioritize to don't loss notifications :(
If we have a steady condition variable, it is not necessary to pool if the user is waiting for a steady timepoint and it is better to don't do it. Are we polling in this case?
If however the user requests to wait for a system clock, and we poll/don't poll, what are the possible behaviors when the system time change?
When we don't have steady conditions, what are the possible behavior if we poll/don't poll?
Have you tried adding threadapi=pthread
to the command line on linux?
threadapi=pthread
on the command line fixes the issue for me. Thank you.
Branch to fix the time related issues with non-steady clocks.
Thanks to Austin Beer for his work on PR https://github.com/boostorg/thread/pull/141 and helping me to test this branch.