DegreeOfParallelism > ProcessorCount causes "The test step was orphaned by the test runner!"

GoogleCodeExporter commented 8 years ago

What steps will reproduce the problem?
1. Set [assembly: DegreeOfParallelism(X)] where X > ProcessorCount.
2. Add [Parallelizable(TestScope.Descendants)] or 
[Parallelizable(TestScope.All)] to a test fixture containing many tests. (Mine 
had several data driven tests, some over a hundred rows, for a total of 263 
tests.)
3.Run the aforementioned fixture (I ran it with TD.NET).

What is the expected output? What do you see instead?

Expected outcome is for all tests to pass.

Actual outcome is that at some point during the test run, 1+ tests will fail 
with the exception listed below. It appears that when there are more test 
execution worker threads than processors, it is possible for a data-driven test 
to "finish" before all of the test's steps have finished. Note that this 
symptoms' repeatability does appear to be related to the amount by which DoP 
exceepds the ProcessorCount. On my system (4 processors), using a DoP of 5 or 6 
usually resulted in 10%-40% of the 263 tests/steps passing before the error 
occurred. When using a DoP of 16, the error occurred immediately (i.e. before a 
single test/step had passed) every time. Once the error occurred, it appeared 
as if all remaining test steps also failed with the same error.

What version of the product are you using? On what operating system?

Gallio TestDriven.Net Runner - Version 3.2 build 601
WinXP
VS 2008

Please provide any additional information below.

TestCase '.../MySmokeTestFixture/WebPageSmokeTest/WebPageSmokeTest("/MyApp/", 
true)'
failed: 
    The test step was orphaned by the test runner!

Error: Internal error: An unhandled exception occurred while running a 
parallelizable action.
The exception occurred while test step 
'.../MySmokeTestFixture/WebPageSmokeTest/WebPageSmokeTest' was running.
System.InvalidOperationException: Cannot finish a step unless the test step is 
running.
   at Gallio.Model.Contexts.ObservableTestContext.FinishStep(TestOutcome outcome, Nullable`1 actualDuration, Boolean isDisposing) in c:\Server\Projects\MbUnit v3.2\Work\src\Gallio\Gallio\Model\Contexts\ObservableTestContext.cs:line 287
   at Gallio.Model.Contexts.ObservableTestContext.FinishStep(TestOutcome outcome, Nullable`1 actualDuration) in c:\Server\Projects\MbUnit v3.2\Work\src\Gallio\Gallio\Model\Contexts\ObservableTestContext.cs:line 236
Reported by: 
UnhandledExceptionPolicy
   at Gallio.Common.Concurrency.WorkScheduler.ReportUnhandledException(Exception ex) in c:\Server\Projects\MbUnit v3.2\Work\src\Gallio\Gallio\Common\Concurrency\WorkScheduler.cs:line 236
   at System.Threading._ThreadPoolWaitCallback.WaitCallback_Context(Object state)
   at System.Threading.ExecutionContext.runTryCode(Object userData)
   at System.Runtime.CompilerServices.RuntimeHelpers.ExecuteCodeWithGuaranteedCleanup(TryCode code, CleanupCode backoutCode, Object userData)
   at System.Threading.ExecutionContext.RunInternal(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
   at System.Threading._ThreadPoolWaitCallback.PerformWaitCallbackInternal(_ThreadPoolWaitCallback tpWaitCallBack)
   at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback(Object state)

Original issue reported on code.google.com by justin.w...@gmail.com on 22 Sep 2010 at 10:09

GoogleCodeExporter commented 8 years ago

Original comment by Yann.Tre...@gmail.com on 23 Sep 2010 at 7:37

Changed state: Accepted
Added labels: Component-MbUnit, Milestone-3.3, Priority-Medium, Type-Defect

GoogleCodeExporter commented 8 years ago

Original comment by Yann.Tre...@gmail.com on 14 Jun 2011 at 5:53

Added labels: Milestone-IdeaPool

GoogleCodeExporter commented 8 years ago

This happens for our test setup even if we don't set [assembly: 
DegreeOfParallelism(X)] where X is larger than Environment.ProcessorCount.

Original comment by scottle...@gmail.com on 10 Dec 2011 at 11:38

GoogleCodeExporter commented 8 years ago

I think I isolated the problem to some very subtle race condition in 
Gallio.Common.Concurrency.WorkScheduler, but only when used recursively, such 
as when using lots of Row attributes when the whole TestFixture is 
Parallizable.  The Run method would return before all actions in the work set 
had fully executed. The problem occurs more often with a faster machine or at 
least one with more logical processors.  

Anyway, the race condition is so subtle that I couldn't figure out why it was 
occurring exactly :) even after quite a few hours of debugging and fiddling 
with the code.  So I just rewrote the WorkScheduler in a different way.  It 
fixed my problems and passes the original unit tests.  

Unfortunately, I can't prove 100% I fixed the problem and not just hid the 
problem with different timings, but it works for me...

Original comment by scottle...@gmail.com on 13 Dec 2011 at 7:00

Attachments:

WorkScheduler.cs.patch

GoogleCodeExporter commented 8 years ago

Attached is a very simple sample test project that demonstrates the problem.  
I've reproduced the problem on 3 different 4-core machines with hyper-threading 
enabled by setting DeegreesOfParallelism to 8 or above.  If the problem doesn't 
occur just increase the number of rows or tests and it should occur eventually.

Original comment by scottle...@gmail.com on 23 Jan 2012 at 2:12

Attachments:

DegreesOfParallelismTest.zip

GoogleCodeExporter commented 8 years ago

The patched implementation has some issues.

Suppose that the Run() thread starts up #DOP threads.  Then it will itself run 
the next action.

Meanwhile, one or more of those other threads might finish up their job.  
However, no new work can be scheduled until the Run() thread finishes its 
action and notices that it needs to schedule new work.  So we get less 
utilization of processor cores than we would expect.

The original implementation of WorkScheduler does not have this problem.  New 
work can be scheduled on worker threads regardless of whether the Run() thread 
is currently busy.  That's because each worker thread is able to pick up 
additional work for itself.

I suggest you attempt a more targeted fix to the termination condition of the 
original implementation.

Unfortunately, I can't find the bug.  As far as I can tell, the termination 
condition of the Run() method guarantees that it will only exit when its work 
set has no more pending actions and none are in progress.

Do you have any idea where things go wrong here?

Original comment by jeff.br...@gmail.com on 25 Mar 2012 at 12:06

GoogleCodeExporter commented 8 years ago

Could it be that any of these actions are throwing ThreadAbortException?

Original comment by jeff.br...@gmail.com on 25 Mar 2012 at 12:10

GoogleCodeExporter commented 8 years ago

Ah, I see what you mean.  My implementation definitely under utilizes the 
threads, so my original fears are probably true, I just hid the problem with 
different timings.  I really have no clue where things are going wrong.  

I'll investigate if there is any way a ThreadAbortException could be occurring 
as well as just make one more pass to see if I can figure out any other 
possible problems.

Original comment by scottle...@gmail.com on 29 Mar 2012 at 5:26

GoogleCodeExporter commented 8 years ago

It doesn't appear to be related to a ThreadAbortException.  I'm leaning towards 
some the problem being somewhere in Gallio.Model.Contexts.ObservableTestContext 
or Gallio.Framework.Patter.PatternTestExecutor but haven't ruled out the 
WorkScheduler. 

With the original WorkScheduler I can get all my tests to run by just 
commenting out the Dispose call in 
ObservableTestContext.HandleParentFinishedBeforeThisContext, but that should 
only be called if the parent test context finished and I can't see how that's 
happening.  Anyone else have any ideas?

Original comment by sle...@xignite.com on 30 Mar 2012 at 4:22

GoogleCodeExporter commented 8 years ago

Is there a workaround for this?, or how far off is the fix?

Original comment by mmussm...@gmail.com on 25 May 2012 at 1:19

GoogleCodeExporter commented 8 years ago

The only work around I have found is to place the tests in their own test 
fixture and not parallelize the test fixtures.  

I have also improved this using Jenkins to run multiple fixtures in parallel by 
running each fixture in a separate job like so:

Gallio.Echo.exe "<MyTestAssembly>.dll" "/f:Type:<FixtureClassName>" 
/report-type:xml-inline /runner:IsolatedProcess /verbosity:debug 
/report-name-format:TestReport

Obviously this defeats the purpose of parallelism in MBUnit, but we have few 
test methods that run hundreds of tests cases from xml so it works for us as 
the test cases run in parallel, which is the important thing for us.

Original comment by tvarg...@gmail.com on 25 May 2012 at 2:09

GoogleCodeExporter commented 8 years ago

This problem is still not fixed. 
Using a DOP of 8 will yield at best 4 simultaneous tests to run. Increasing it 
will cause random orphaned test failures(8 cpus available). The only way I was 
able to get it to run DOP number of tests without failing was to not use ANY 
row, xml or factory tag in any test and shove them all in the same fixture. Why 
are fixtures eating up threads anyways?

Original comment by patrickb...@gmail.com on 18 Sep 2012 at 1:22

GoogleCodeExporter commented 8 years ago

Please fix this issue as soon as possible. Thanks

Original comment by raymond....@gmail.com on 5 Dec 2012 at 1:17

MudassarRasool / mb-unit

DegreeOfParallelism > ProcessorCount causes "The test step was orphaned by the test runner!" #732