[FEA] triple buffering/pipelineing for SQL

Is your feature request related to a problem? Please describe. The "happy path" for GPU SQL processing is to have one batch of input to a task and after the computation is done we get one batch of output. This way the GPU semaphore can let a task onto the GPU. It computes everything for that task in one go. Then copies the result back to the CPU releases the semaphore, with nothing left in GPU memory, and then writes out the result.

But the real world is messy and very few paths are the "happy path". To be able to deal with these cases we require that when an operator calls next, hasNext, or returns a result from an iterator that all of the GPU memory it is referencing is spillable. This is because at any point in time the GPU Semaphore might be released. This makes it so that we can continue to run, but it also can result in a lot of spilling. There are lots of operators that hold onto memory between batches because they don't have a way to recompute it, or it is expensive to recompute. This is especially a problem when we release the semaphore on the chance that we might do I/O. This results in other tasks being let onto the GPU, and those tasks will increase the memory pressure on the GPU resulting in more spilling.

Currently releasing the semaphore is up to the operator when it calls next or hasNext. This can result in a lot of problems and inconsistent behavior. We can end up doing I/O with the semaphore held. We can end up releasing the semaphore and holding onto a lot of GPU memory just to find out that there is nothing more for us to process.

In an ideal world we want.

The GPU Semaphore is released when blocking I/O is required to complete an operation or the task is done using the GPU.
I/O is done in the background as much as possible to avoid releasing the semaphore.
I/O and GPU Computation have flow control built in so that we can
- keep I/O as busy as possible
- keep the GPU as busy as possible
- Not use too much GPU memory
Consistent priority on all computation, spilling, and I/O to reduce context switching of processing on the GPU and memory pressure on the GPU.
No dead/live locks

This feels like a lot, but I think we can do it with a few changes at a time.

[ ] https://github.com/NVIDIA/spark-rapids/issues/11575 - Benchmarks so we can measure our progress towards fixing this issue.
[ ] https://github.com/NVIDIA/spark-rapids/issues/8301 - This reduces the memory pressure in cases where we are computation bound on the GPU today, but end up releasing the semaphore to do really fast I/O
[ ] https://github.com/NVIDIA/spark-rapids/issues/11341 - This is to help put shuffle writes in a background thread so that we can overlap them with computation.
[ ] https://github.com/NVIDIA/spark-rapids/issues/11344 - This is to help with the shuffle reads so that we can see if an end to end solution is really going to be great.
[ ] https://github.com/NVIDIA/spark-rapids/issues/11342 - This is the same as the shuffle write, but for putting file writes in a background thread to try and reduce the number of times we release the semaphore. This assumes that the experiments worked out well.
[ ] https://github.com/NVIDIA/spark-rapids/issues/1815 - If the shuffle results look good, then we want to try and do the same kind of thing for file reads. The first round would just be to use the existing code and release the semaphore when a blocking I/O would need to take place.
[ ] https://github.com/NVIDIA/spark-rapids/issues/11345 - is to do the hard work of getting the file readers to play nicely with the read coordinator and also in some cases read data ahead and buffer it.

NVIDIA / spark-rapids

[FEA] triple buffering/pipelineing for SQL #11343