dougbinks / enkiTS

A permissively licensed C and C++ Task Scheduler for creating parallel programs. Requires C++11 support.
zlib License
1.74k stars 145 forks source link

Crash when executing a small dependency graph in tight loop #78

Closed Liemarzac closed 1 year ago

Liemarzac commented 1 year ago

Hi Doug, First of all, thanks for your enkiTS lib. I enjoyed playing with it in the past week. I have noticed a rare crash by using the dependencies and managed to get a minimal code sample which will trigger it in few minutes running it. Sometimes it will take a while, but it will eventually crash. The sample contains a StartTask which triggers two TaskA. After completion the TaskAs will trigger a EndTask. Here is a simple topology of the task graph:

StartTask --> TaskA x 2 in parallel --> EndTask

By running the sample in a very tight loop, it will eventually crash deep in the TaskManager. This is the line which will crash

ENKI_ASSERT( prevDeps < pDependent->pTaskToRunOnCompletion->m_DependenciesCount ); where pTaskToRunOnCompletion is an invalid pointer.

It's entirely possible that I am not using the dependencies correctly, but I tried to stay as close as possible from the sample code provided. Because it seems to be timing specific, I suspect maybe a race condition in the LockLessMultireadPipe.

The sample I have provided should crash faster when compiled in Release.

enkiTSTest.zip

dougbinks commented 1 year ago

Many thanks for this test and bug report!

I was able to replicate this, and find the source of the bug - TaskScheduler::TaskComplete was accessing pDependentCurr->pTaskToRunOnCompletion in a thread when another thread called OnDependenciesComplete which marked the task as complete and testDependencies() exiting which deleted the task.

This likely doesn't occur for many programs as it requires the task to be deleted immediately after WaitforTask, and the examples and tests normally reuse the same tasks when in a loop.

I have a fix on branch dev_dependencies_fix_78 which I'm testing, and I'll also add a similar test later.

Let me know if this resolves the issue for you.

Liemarzac commented 1 year ago

Thanks for your quick turnaround on this issue. I am going to test it in the afternoon. It's likely that I hit this bug because the tasks are scheduled in a physics lib which runs many steps in quick succession. And you made a good point, the task graph should be built only once and stay around between each call. At the moment I just implemented a quick replacement of a parrallel_for loop,

Liemarzac commented 1 year ago

No more issues after grabbing that fix. Thanks Doug.

dougbinks commented 1 year ago

Excellent, thanks! It's been working well here on my projects and tests too so I shall merge it to main soon.