valgrind|cygwin: Some unit tests are failing on exceeding elapsed duration

jeking3 commented 5 years ago

A number of unit tests are failing in various situations because they place an upper bound on elapsed time. There are problems with this logic:

In all cases where duration is specified, for example "this_thread::sleep" or "try_lock_for", the specified duration is a "guaranteed minimum". The operation will wait "at least" that long. No guarantees are made that it will not wait longer.
The issue is exacerbated by running a more stressful test such as within valgrind or by running tests on an overloaded Appveyor build slave.

Here are some examples of this in action:

On a cygwin (32-bit) job: https://ci.appveyor.com/project/jeking3/thread/builds/20434396/job/nrsami3xyqhhp56v#L10016

====== BEGIN OUTPUT ======
libs\thread\test\sync\mutual_exclusion\shared_mutex\try_lock_for_pass.cpp(53): test '(d).count() < (ns(max_diff)).count()' ('101460100' < '75000000') failed in function 'void f2()'
1 error detected.

EXIT STATUS: 1 
====== END OUTPUT ======

On a valgrind job: https://travis-ci.org/jeking3/thread/jobs/457490787#L4453

testing.capture-output ../../bin.v2/libs/thread/test/recursive_mutex__try_lock_p.test/clang-linux-6.0/debug/cxxstd-11-iso/threadapi-pthread/threading-multi/visibility-hidden/recursive_mutex__try_lock_p.run
====== BEGIN OUTPUT ======
==26698== Memcheck, a memory error detector
==26698== Copyright (C) 2002-2013, and GNU GPL'd, by Julian Seward et al.
==26698== Using Valgrind-3.10.1 and LibVEX; rerun with -h for copyright info
==26698== Command: ../../bin.v2/libs/thread/test/recursive_mutex__try_lock_p.test/clang-linux-6.0/debug/cxxstd-11-iso/threadapi-pthread/threading-multi/visibility-hidden/recursive_mutex__try_lock_p
==26698== 
../../libs/thread/test/sync/mutual_exclusion/recursive_mutex/try_lock_pass.cpp(89): test '(d_ms).count() < (max_diff).count()' ('91' < '75') failed in function 'int main()'
../../libs/thread/test/sync/mutual_exclusion/recursive_mutex/try_lock_pass.cpp(90): test '(d_ns).count() < (ns(max_diff)).count()' ('91304790' < '75000000') failed in function 'int main()'
2 errors detected.

Other jobs have shown up to 175ms unexpected delay. I am planning on changing the file (misspelled) "timming.hpp" to use 250 for all cases. That will clear up most of these issues in CI.

viboes commented 5 years ago

Hi, this is not a solution. I would prefer to inhibit those test on some circumstances, changing them wouldn't test anything and will postpone the issue.

viboes commented 5 years ago

An alternative is to change the timing depending on the platform or environment.

jeking3 commented 5 years ago

I restored the "develop" copy of the timming.hpp header and I changed 75 to 125, since the clang-6.0 test I referenced took 90 and the cygwin test I referenced took 105, and left 250 alone. We'll see how that handles it.

jeking3 commented 5 years ago

I changed it to 100 for the baseline (due to the 90 I saw on clang-6), and added cygwin to win32 at 250.

viboes commented 5 years ago

I believe that you should enter the VALGRIND macro on the definition of those values, without changing the default. What do you think?

jeking3 commented 5 years ago

Sure, that makes sense.

jeking3 commented 5 years ago

This issue will be resolved by the CI work.

viboes commented 5 years ago

Where we are concerning this issue?

jeking3 commented 5 years ago

I sync'd with the latest boost-ci and re-running to see where we're at post-1.70.0.

jeking3 commented 5 years ago

Yeah, the unit tests are too much for CI. We need to start honoring the BOOST_NO_STRESS_TEST setting so CI has a chance of passing.

boostorg / thread

valgrind|cygwin: Some unit tests are failing on exceeding elapsed duration #253