NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
793 stars 231 forks source link

[FEA] triple buffering/pipelineing for SQL #11343

Open revans2 opened 1 month ago

revans2 commented 1 month ago

Is your feature request related to a problem? Please describe. The "happy path" for GPU SQL processing is to have one batch of input to a task and after the computation is done we get one batch of output. This way the GPU semaphore can let a task onto the GPU. It computes everything for that task in one go. Then copies the result back to the CPU releases the semaphore, with nothing left in GPU memory, and then writes out the result.

But the real world is messy and very few paths are the "happy path". To be able to deal with these cases we require that when an operator calls next, hasNext, or returns a result from an iterator that all of the GPU memory it is referencing is spillable. This is because at any point in time the GPU Semaphore might be released. This makes it so that we can continue to run, but it also can result in a lot of spilling. There are lots of operators that hold onto memory between batches because they don't have a way to recompute it, or it is expensive to recompute. This is especially a problem when we release the semaphore on the chance that we might do I/O. This results in other tasks being let onto the GPU, and those tasks will increase the memory pressure on the GPU resulting in more spilling.

Currently releasing the semaphore is up to the operator when it calls next or hasNext. This can result in a lot of problems and inconsistent behavior. We can end up doing I/O with the semaphore held. We can end up releasing the semaphore and holding onto a lot of GPU memory just to find out that there is nothing more for us to process.

In an ideal world we want.

This feels like a lot, but I think we can do it with a few changes at a time.

revans2 commented 4 days ago

We probably also want to come up with a set of standardized benchmarks to cover this use case as NDS does not cover it well.

https://github.com/NVIDIA/spark-rapids/pull/11376#issuecomment-2400253511

is a comment I made about them, but I will file a formal issue to create them.