Platform-specific algorithm composition

stephenswat commented 3 years ago

Right now, our algorithms are structured as function objects (or as C++ like to confusingly call them, functors), which wrap some algorithmic code. This works fairly well for CPU code, and it will likely work well for other heterogeneous code for the near future, but there is a significant problem with this strategy, which is that it assumes that function composition should work in the same way. Since these function objects are called sequentially on the CPU side, each algorithm is followed by an implicit synchronisation point, even for platforms where that is undesirable. For example, CUDA algorithms will need to wait unnecessarily, and we will not be able to exploit the asynchronous properties of the CUDA programming model.

I am opening this issue because I think that we will, in the close-to-medium future, need to come up with a model for algorithm composition that is flexible towards the properties of specific platforms. We will want to encode in C++ the different behaviours we want to have for different platforms, preferable with as little additional code as possible. For example:

For CPU code, g ∘ f is simply equal to running f, implicitly synchronizing, and then running g.
For CUDA code, g ∘ f could be significantly more performant if we create a CUDA stream, add f to it, add g to it, and then synchronize only at the end of that sequence of instructions.
Other heterogeneous platforms that support asynchronous execution will need something similar.

I am not sure C++ allows us to solve this problem in a truly elegant way, but we might be able to expand our definition of what is constitutes an algorithm by requiring two methods, instead of one:

A private asynchronous method that allows us to compose actions across streams and other models of asynchronous computation.
A public synchronous method, that implements some sort of platform-specific synchronisation barrier after calling the private asynchronous method, so we retain the right cause-and-effect relationship on the CPU side.

Then, on top of that, we would obviously need to formalize the composition itself. Essentially, we would be building a monad. For example, the function composition g ∘ f might look something like this on a CPU (in C++-like pseudocode):

template<typename I, template M, typename R>
class cpu_composition : algorithm<I, R> {
public:
    cpu_composition(algorithm<I, M> & f, algorithm & g) : f1(f), f2(g) {}

    R run_sync(I & i) {
        return run_async(i);
    }

private:
    R run_async(I & i) {
        return f2.run_async(f1.run_async(i));
    }

    algorithm & f1, f2;
}

...but it would look very different on the CUDA side...

template<typename I, template M, typename R>
class cuda_composition : algorithm<I, R> {
public:
    cuda_composition(algorithm<I, M> & f, algorithm & g) : f1(f), f2(g) {}

    R run_sync(I & i) {
        cudaStream_t s;
        R r = run_async(i, s);
        cudaStreamSynchronize(s);
        return r;
    }

private:
    R run_async(I & i, cudaStream_t & s) {
        f1.run_async(i, s);
        // Passing return data between these two is iffy, because CUDA always
        // uses output parameters which are hard to model, but that is an
        // implementation detail.
        f2.run_async(..., s);
    }

    algorithm & f1, f2;
}

The thing is, I am not sure if this would be the best solution. On the plus side, it requires minimal boilerplate code, which is great. On the other hand, it is a serious pain in the behind for the type system, and it would require us to really think long and hard about what we want to decide at compile-time, and what we want to do at run-time, because that will decide a lot about the further design of traccc.

Anyhow, this was mostly a brain dump, and I would be very curious to hear everyone else's opinion on how we should proceed here.

aaronj0 commented 1 year ago

Hello everyone, I'm Aaron and I'm quite interested in contributing to the traccc project.

We can utilize the fact that kernels launched in the same stream are sequentially executed. So if the function g calls multiple kernels and has a dependency ( example: g calls f2(f1) ) we can let the function run on one stream. but if a function g1 does not depend on g2 we can run both on separate cuda streams.

in the example provided, run_async is not asynchronous wrt the other function calls, since it's the same cuda stream, the kernels will be launched one after the other. It's asynchronous wrt other cuda_compostions.

I am unsure about how we can implement error synchronization since checking for errors is atomic. In case a function results in an error, the kernel will have to be aborted without blocking it.

stephenswat commented 1 year ago

Hi Aaron, sorry for responding to your comment in such a tardy fashion. I appreciate your comment; in the time since I opened this issue we've had relatively little progress towards this, but I have started work on composable CUDA graphs (#307) which allows the runtime to restructure the execution graph as it sees fit while keeping dependencies in place. I'd be interested to hear your further opinions.

If you want to contribute still, please contact me by e-mail and we can get you started.

acts-project / traccc

Platform-specific algorithm composition #52