NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
811 stars 234 forks source link

[FEA] Explore CPU execution of specific expressions #6955

Open revans2 opened 2 years ago

revans2 commented 2 years ago

Is your feature request related to a problem? Please describe. There are some expressions where a GPU implementation is going to be very expensive to implement or possibly impossible to do in the general case. Our current solution is to fall back to the CPU at the operator level. We fall back for an entire Project if there is a single expression in the Project that we cannot do on the GPU. This works for producing the correct answer, but may not be the best performing solution. We should explore what a partial fallback solution might look like.

The cost based optimizations I was thinking about would probably look mostly at the cost of data movement, but we need to do some experiments to see if there are other things that would be good to add in too, like a SWAG at the cost of doing an operation on the GPU vs the CPU.

For example lets say we have an expression like regexp_extract(FOO, 'SOMETHING THAT DOES NOT WORK ON GPU') = 'BAR' We know that the regexp_extract we cannot do on the GPU, and the result of copying it back to the GPU afterwards would at least be copying the offsets back (an int per row but could be a lot larger). But we know that copying a boolean back is going to be much cheaper than copying a string so we probably should put the equality check on the CPU too, even though we could do it on the GPU.

https://github.com/NVIDIA/spark-rapids/issues/6955#tasklist-block-beb5e4f8-d7ad-4e1e-a353-8f31cedf72df

revans2 commented 2 years ago

@GregoryKimball had another great idea for an optimization if the fallback is a part of a filter. As a part of the fallback for a filter. If the filter is in the form of A and B and C and D then we could process some of the expressions before we fall back to the CPU. So for example if C has an expression that would only work on the CPU we could first do a filter on A and B and D on the GPU. Then we could do the special project to help produce the CPU parts of C and then finally do a filter for C at the end. We should add part of this into the issue we file to look at FilterExec

revans2 commented 2 months ago

Update: spark.rapids.sql.rowBasedUDF.enabled is almost never worth enabling. See https://github.com/NVIDIA/spark-rapids/issues/7873#issuecomment-1675372638 I am going to try and understand the use cases better where it can be a win. But I think we need to take a very different approach to how we might do a "dynamic fallback". I think the only way to make this fast is if we release the semaphore while we process things on the CPU. The problem is that we want to keep the CPU portion as small as possible, but unless we can know if an expression has side effects we might be stuck and have to process even more operators on the CPU in order to not break things when we try to isolate the expressions as much as possible.