NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
822 stars 235 forks source link

Spill framework refactor for better performance and extensibility #11747

Open abellina opened 10 hours ago

abellina commented 10 hours ago

This is a very large PR that I'd like some :eyes: on. Marked it as draft as I still have some TODOs around more tests. The PR is NOT going to go to 24.12, it's just that we don't have a 25.02 available.

The main file I think one should focus on is SpillFramework.scala (yeap one file, let me know if you want me to break that into multiple files). SpillFramework.scala has a comment describing how things should work, please take a look at that.

The main contribution here is a simplification of the framework where we remove the idea of a RapidsBuffer that has to be acquired and unacquired, for the idea of a handle that just knows how to materialize. There isn't a concept of acquisition in the new framework.

There is a SpillableColumnarBatch api and a lazy-spillable api for Join that I did not touch and left there on purpose, but we can start to remove that API and create spillable handles that replicate the lazy behavior we wanted in lazy spillable, or the recomputing behavior we want for broadcasts. This is the second contribution of the PR: handles decide how to spill, not the framework.

There is one easily fixable shortcoming today in the multiple-spiller case, that I will fix in a follow on PR. While we are spilling a handle, the handle holds a lock. The same lock is used to figure out if the handle is spillable. A second thread that is trying to spill may need to wait for this lock (and spill) to finish, to figure out if it needs to spill that handle or not. We can make this more straightforward by handling the spill state separate from the materialization/data state, but I'd like to submit that work as an improvement.

I have run this against NDS @ 3TB in our perf cluster and I don't see regressions, and have run it against spill prone cases and I am able to see multiple threads in the "spill path", and no deadlocks. I'll post more results when I can run them.