NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
792 stars 230 forks source link

[FEA] Stop doing stupid things when moving project ops to the GPU #7057

Open revans2 opened 1 year ago

revans2 commented 1 year ago

Is your feature request related to a problem? Please describe. There are a lot of cases where we end up doing really dumb things on the GPU. Typically this involves moving data to the GPU to do an operation that is simple to do on the CPU and then moving the data back to the CPU again afterwards. We tried to do this with a cost based optimizer CBO in the past and even got code for it checked in. but it didn't really work out all that well and it is off by default.

In the mean time we have spent a lot of effort with delta lake log queries to make them always be on the CPU because we end up having really bad performance on the GPU with them.

Describe the solution you'd like

Develop a number of test situations where we do something stupid today. At a minimum we need a test that starts out on the CPU tries to reorder or drop a column on the GPU (something that is almost free on the CPU) and then goes back to the CPU for more processing. We should also look at the delta lake log processing for ideas and examples of things that are bad on the GPU. We should keep in mind that there are a lot of variables that impact performance. Things like the number of rows being processed and the schema of the columns can have a huge impact on performance and should be a part of the tests. We should also come up with a set of cases, if we can find any, where the GPU doing a project can pay the cost of doing the transitions.

If we cannot find any project operation where going to the GPU just for the project and back to the CPU is worth it, then things are simple we write a hand coded rule to never do this, just like we have for shuffle. We need to be careful here and might need to have a config to disable it for testing because we do this a lot in our integration tests because we are looking at functionality/correctness and not performance.

If there are cases where it is worth the transition to do it, then we need to test a few alternative.

  1. Hand coded rule where we always fall back to the CPU.
  2. Hand coded rule where we only fall back in very obviously bad cases (the CPU is doing no computation just dropping columns, re-ordering columns, adding scalars/etc).
  3. The current CBO code (we might have just not tested it properly enough)
  4. If we see some promise in the current CBO code, but it is not great, we might want to look to see if we can find some bugs in how it is implemented.

If CBO or a modified CBO does look like a good solution we need to discuss if it goes in as is, or if we have a way to restricting it to just looking at situations that this is covering, a ProjectExec that could be on the GPU, but is surrounded by CPU operations.

If we find a good solution for ProjectExec, then we should start to look at other operators too and expand the solution. FilterExec is a good example because it is very similar to ProjectExec. After that we can start to look at other Exec operators that can pay for themselves or else we would never do anything on the GPU.

Describe alternatives you've considered Keep what we are doing today.

Additional context

revans2 commented 1 month ago

We might also want to look at limiting the number of transitions that we allow for any single exec, as they can get to be very expensive.