computablee / DotMP

A collection of powerful abstractions for parallel programming in .NET with an OpenMP-like API.
https://computablee.github.io/DotMP/
GNU Lesser General Public License v2.1
29 stars 8 forks source link

[BUG] No way to wait on child tasks from within a task without deadlock #49

Closed computablee closed 10 months ago

computablee commented 1 year ago

Basically, the current behavior of DotMP.Parallel.Taskwait is to wait on all tasks in the queue to complete. This means that if you have nested tasks that you want to wait on within a task, there is no good way to do this as calling Taskwait from within a task would cause deadlock.

I am having a hard time thinking of a way to do this with the current DAG data structure without inducing additional overhead (which I would really like to avoid). Perhaps it would be worth implementing an overload of Taskwait that takes a params TaskUUID[] and waits on just those tasks to complete. Would be a bit more effort from the user, but much less bookkeeping behind the scenes which keeps overhead down. Would like to discuss this before settling on how to solve this problem.

HarryHeres commented 11 months ago

Been thinking about this and came up with a single idea. What if we introduced a new Task class which would extend the System.Action and just introduced a new flag that would be set only if the Thread is a part of DAG? This would also enable us the benefit of being able to modify the Thread infrastructure, if we ever need to

Although if we are not able to use Inheritance, I don’t think there’s a better way than to check the UUIDs of the Tasks with the DAG entries.

Also, is this something we should address? This will inheritly add some type of overhead and it should be left to the user (programmer) to make sure he does not introduce deadlock in his application. Hard to say if it is worth to introduce security over the performance slowdown.

Does OpenMP introduce a mechanism for this?

computablee commented 11 months ago

OpenMP makes no guarantees about avoiding deadlock. The default behavior from OpenMP is that, in addition to the DAG which represents dependencies between tasks, there's a tree which represents parent/child tasks. For example, if I had some code like:

#pragma omp task
{
    printf("In parent task.\n");

    #pragma omp task
    {
        printf("In child task.\n");
    }

    #pragma omp task
    {
        printf("In another child task.\n");
    }
}

Then one task is submitted to the queue. Then, when it runs, that task itself submits two child tasks that are added to the queue. This behavior is, so far, supported by DotMP.

What is not supported is something like this:

#pragma omp task
{
    printf("In parent task.\n");

    #pragma omp task
    {
        printf("In child task.\n");
        do_work();
    }

    //waits ONLY on child tasks, not for all tasks to complete
    #pragma omp taskwait

    do_more_work();
}

If we had equivalent code in DotMP, you would hit a deadlock because the parent task ends up waiting on itself to complete. In OpenMP, taskwait only waits on child tasks, so in the above example, do_more_work is not called until the child task performing do_work is completed.

So implementing this behavior is what I'm concerned with.

Regarding your comment, seeing if a task is part of the DAG is not a concern. The real concern is figuring out a way to implement this behavior without greatly exacerbating the existing tasking overhead, and while keeping it user-friendly.

HarryHeres commented 11 months ago

Hm, okay, thanks for the clarification! Will take a deeper look inside the codebase tomorrow and see how this could be taken care of.

HarryHeres commented 11 months ago

@computablee Oh yeah, I believe I get it. When calling DotMP.Parallel.Taskwait, the parent is also a part of the TaskingContainer (that is established when creating the container using the Region injection), which means that when calling TaskingContainer.GetNextTask(), it would give back the same thread and therefore result in a deadlock, because it would be waiting on itself. Is that correct?

computablee commented 11 months ago

@HarryHeres Almost.

Basically, what taskwait does is continuously poll GetNextTask until it returns 0 tasks remaining (through the out int task_remaining) parameter. tasks_remaining is calculated through DAG.tasks_remaining. When task(s) are added (either through Parallel.Task or Parallel.Taskloop), tasks_remaining is incremented. It is only decremented when a task is "retired" or "completed" via calling DAG.CompleteItem.

So by adapting the earlier OpenMP to DotMP, if we had:

DotMP.Parallel.ParallelMaster(() =>
{
    DotMP.Parallel.Task(() =>
    {
        Console.WriteLine("Parent task.");

        DotMP.Parallel.Task(() =>
        {
            Console.WriteLine("Child task.");
        });

        DotMP.Parallel.Taskwait();
    });
});

The basic control flow (supposing num_threads == 1, for simplicity) is this:

  1. The parent task is added to the queue. tasks_remaining == 1
  2. The end of the parallel region (or in this case, parallel master) region is hit, causing an implicit taskwait to execute.
  3. taskwait ends up requesting a task from DAG.GetNextItem.
  4. The parent task is removed from the DAG, but not yet marked as complete. It begins execution.
  5. During execution of the parent task, the child task is added to the queue. tasks_remaining == 2
  6. Another taskwait is hit, this time inside the parent task, prompting the current task to suspend and begin working through the task queue.
  7. taskwait ends up requesting a task from DAG.GetNextItem.
  8. The child task is executed.
  9. The child task completes, prompting it to retire via DAG.CompleteItem. tasks_remaining == 1
  10. A deadlock is hit. Because the parent task has already been pulled from DAG.GetNextItem, it cannot be returned again. The parent task remains in a state of "is currently executing." However, the parent task cannot be retired until tasks_remaining == 0 due to the semantics of taskwait. Therefore, a deadlock is hit, because the current task cannot retire until tasks_remaining == 0, but tasks_remaining == 1 until the current task retires. Thus, the program spins in taskwait indefinitely.

What should happen, assuming DotMP follows OpenMP's behavior on this, is that taskwait shouldn't spin until tasks_remaining == 0. It should spin only until all child tasks are complete (where a child task is defined as a task B that was created from another task A, in which case A is the parent of B and B is the child of A). This relationship of parent/children tasks is not yet captured in the existing data structures.

HarryHeres commented 11 months ago

Hi and sorry,

I have been quite busy so I took some time before responding to this.

Basically, what taskwait does is continuously poll GetNextTask until it returns 0 tasks remaining (through the out int task_remaining) parameter. tasks_remaining is calculated through DAG.tasks_remaining. When task(s) are added (either through Parallel.Task or Parallel.Taskloop), tasks_remaining is incremented. It is only decremented when a task is "retired" or "completed" via calling DAG.CompleteItem.

The end of the parallel region (or in this case, parallel master) region is hit, causing an implicit taskwait to execute.

The rest of the description of the algorithm is very understandable, thank you so much!

Now, since it's already late today, I will be getting to it tomorrow.

PS: I'm sorry to take my time here but I wanted to be 100% sure to understand the principles of OMP as much as possible to be able to discuss the most effective solution. I hope this won't bother you much?

computablee commented 11 months ago

There is an implicit taskwait at the end of every parallel region. It is injected here upon creation of the parallel region's Thread objects.

Also, feel free to ask questions! It doesn't bother me at all, I like discussing the internals of this project :)

HarryHeres commented 11 months ago

Awesome, thank you!

So for this problem, I also created a drawing to better analyze the deadlock. Now I believe I have an idea how we could implement a countermeasure.

What if we changed this code in the DotMP.Parallel.Taskwait

do
{
    if (tc.GetNextTask(out Action do_action, out ulong uuid, out tasks_remaining))
    {
        do_action();
        tc.CompleteTask(uuid);
    }
}
while (tasks_remaining > 0);

to

bool task;
while (true)
{
    task = tc.GetNextTask(out Action do_action, out ulong uuid, out tasks_remaining);
    if (!task)
    {
        break;
    }
    do_action();
    tc.CompleteTask(uuid);
}

or alternatively

bool task;
do
{
    task = tc.GetNextTask(out Action do_action, out ulong uuid, out tasks_remaining);
    if (task)
    {
        do_action();
        tc.CompleteTask(uuid);
    }
}
while (task);

This way, we would only loop through the actual DAG items, rather than the presumed number of remaining tasks. At the end of the region, we could also make sure that the tasks_remaining has reached zero.

What do you think?

computablee commented 11 months ago

The problem with that is if there's some task dependency chain that looks like:

Task[0..m-1] -> Task[m] -> Task[m+1..n]

where A -> B represents that A is a prerequisite for B. In other words, tasks m+1 through n depend on task m, and task m depends on tasks 0 through m-1.

What ends up happening in this scenario with the current code is that all threads work through tasks 0 through m-1 first. Then, one thread executes task m, and when complete, all threads begin executing tasks m+1 through n. With your solution, once task m hits, all threads except the one executing task m retire, meaning that only 1 thread is still in the loop to execute tasks m+1 through n. The rest of the threads exit the loop because there is no task immediately available, despite there being other tasks in the dependency chain that will be available soon.

HarryHeres commented 11 months ago

What exactly is a prerequisite? I suppose that if A->B->C, then to calculate B, you need to have A done and to calculate C, you need to have B done. Or is it the other way around?

Also, what about introducing a barrier to the dependencies?

computablee commented 11 months ago

Yes, so if task B depends on task A, then A is considered a "prerequisite" for B. In essence, DotMP guarantees that A finishes execution before B begins, if such a dependency is specified. Barriers are not relevant here, because this is part of the tasking scheduling system and everything happens inside taskwait (from the user's perspective).

Just to reiterate, tasks are not simply pulled from a single "pool." A significant amount of the overhead with task scheduling is resolving dependencies and determining which tasks are eligible to execute and which must wait. Some example code that might generate this chain of dependencies:

Task[0..m-1] -> Task[m] -> Task[m+1..n]

might be:

DotMP.Parallel.ParallelRegion(() =>
{
    DotMP.Parallel.Master(() =>
    {
        // generate Task[0..m-1]
        // the tasks generated from this taskloop are ready to be executed
        var t0 = DotMP.Parallel.Taskloop(A, B, i =>
        {
            do_some_work();
        });

        // generate Task[m]
        // this task will not be scheduled until the tasks generated from the prior taskloop have completed
        // this is guaranteed by DotMP by the semantics of "depends"
        var t1 = DotMP.Parallel.Task(depends: t0, action: () =>
        {
            do_different_work();
        });

        // generate Task[m+1..n]
        // this task will not be scheduled until the prior task is completed
        // again, this is guaranteed by DotMP
        DotMP.Parallel.Task(C, D, depends: t1, action: i =>
        {
            do_even_different_work();
        });
    });

    DotMP.Parallel.Taskwait(); // begin executing the tasks
});

Now note that this partial ordering of tasks is only guaranteed because of the use of the depends parameter, which specifies dependencies between tasks. If the depends parameter were not used in the above code, all tasks would be immediately available for execution and may be executed in any order.

HarryHeres commented 11 months ago

Just to update - I'm checking out OpenMP in the free time to see the the differences and understand the basic foundation DotMP is building on :)

Let me get back at the end of this week when I have a bit more free time to discuss this further.

HarryHeres commented 11 months ago

Well, after some more thought, I come with yet another idea to address all the stuff we have discussed.

bool nextTask;
do
{
    nextTask = tc.GetNextTask(out Action do_action, out ulong uuid, out tasks_remaining);
    if (nextTask)
    {
        do_action();
        tc.CompleteTask(uuid);
    }
    else // No tasks in the queue, at the moment
    {
        if (tasks_running == 0) // No tasks in the queue but also no tasks that are being executed, which could fill the queue with dependencies
        {
            break;
        }
    }

}
while (tasks_remaining > 0);

The idea is that we will loop through all the tasks as it is now, but we would introduce a new variable/attribute to the DAG, which would follow the amount of tasks who are still running. The tasks_running attribute would be incremented each time the action method is being executed and decremented each time this method has been finished. This way, we could overcome the problem that the while loop would not resolve into being a deadlock, but also we would overcome the issue with dependencies not being in the queue at the moment.

What do you think? Do you think this would be possible, theoretically?

PS: The implementation is just illustrative of the concept

computablee commented 11 months ago

The problem with this solution is that it still doesn't solve the original problem-- a deadlock is hit because tasks_running stays at 1 assuming the example here.

Other issues include the Barrier() calls within Taskwait(). Currently taskwait acts as an implicit barrier, meaning that if it is called from within a task, it will still deadlock upon encountering a barrier because only the thread executing the task hits the barrier (since a task is only executed by a single thread). It's not as simple as removing the barriers -- those are there for a reason!

What I would recommend you do is to fork the repo and write a few integration tests to replicate the linked example and ensure that they fail (you'll probably wanna write a test that "times out" after some duration of time... for the linked example, 1 second is probably sufficient to check for a deadlock). Then, try your solutions in some dev branch and see if you can get the integration tests to pass. From there, we can inspect them and see if there might be any undesired behavior that we should write more tests for, or if it's a genuine solution.

I've spent the past several days working on the gpu branch implementing some very rudimentary proofs-of-concept for GPU acceleration in DotMP. Now that I have those proofs-of-concept, I'll turn my attention to this issue for the time being. This is a big issue I want to get resolved for DotMP v1.5.0, which I'm hoping to release early November. So, I'll do some testing on my own and see if I can come up with something.

HarryHeres commented 11 months ago

Okay then, seems like this will require way more inspection and trial then originally expected. Will do some testing a see if there's a solution that would suffice. Just wanted to try and see if there's a simple solution that could be found by applying simple logic with my propositions, seems like not. Thank you for your patience!

PS: the proposed solutions were just adjusting the while loop while leaving the rest of the method the same compared to the original implementation, didn't mean to replace the whole method. Apparently, I have not shown that properly, sorry! Probably should've done some testing first before the propositions. Well, a valuable lesson for the future.

computablee commented 11 months ago

No worries. Yeah, this is the hardest open issue right now, and genuinely has me conflicted on how I want to approach it. It won't be a simple change, it will almost certainly involve adding a couple new data structures to the tasking runtime.

PS: the proposed solutions were just adjusting the while loop while leaving the rest of the method the same compared to the original implementation, didn't mean to replace the whole method.

No, I got that you meant this, no worries! I just wanted to make clear that there are other parts of Taskwait() that also need changing, aside from the changes you proposed.

HarryHeres commented 11 months ago

Awesome, thank you for the patience! Really appreciate that! :)