[FEA] Explore CPU execution of specific expressions

revans2 commented 2 years ago

Is your feature request related to a problem? Please describe. There are some expressions where a GPU implementation is going to be very expensive to implement or possibly impossible to do in the general case. Our current solution is to fall back to the CPU at the operator level. We fall back for an entire Project if there is a single expression in the Project that we cannot do on the GPU. This works for producing the correct answer, but may not be the best performing solution. We should explore what a partial fallback solution might look like.

We currently have spark.rapids.sql.rowBasedUDF.enabled that lets us put a UDF on the CPU, but do so with the GPU semaphore held. We should look at what the performance impact is for something like this but for a more concrete operator, like regexp_extract. We know that there are a lot of regular expression cases that we cannot currently support so regular expressions is a good starting point for this.
If the results from the first experiment look good we should look at trying to pull the expression out into a stand alone ProjectExec that can run CPU expressions intelligently. It would pull the input columns to the CPU, make the original input spillable, release the semaphore, run the CPU expressions and collect the results on the CPU, grab the semaphore, and copy the resulting columns back down to the GPU. We should do this for UDFs as well as the regexp_extract as a part of the experiments.
If that looks good we can look into how to pull out these types of CPU expressions more generally. How can we extract them from a FilterExec or a HashAggregateExec or any of the join operations. These should probably be separate issues filed for each exec that we think could have these types of expressions in them.
We should then expand the types of expressions that we are looking at beyond regular expressions. We might be able to make this work for any expression that we have never seen before.
We should also look at how we can configure this on/off. Do we want to have a generic config that enables/disables this feature entirely. Do we want to have a config per expression that we can set like we do to enable/disable individual expressions today?
Collect some performance numbers to see if it is worth enabling this by default. Especially in comparison to falling back to the CPU for an entire project or aggregation. We need to put on our malicious hats when doing this to try and thing of really bad situations to see the performance impact. Things like where we go back and forth between the CPU and the GPU a lot. We already don't do well with this and this issue is not to fix those cases. This is just us verifying that we are not making those situations much worse.
The final step would be to look into cost based optimizations.

The cost based optimizations I was thinking about would probably look mostly at the cost of data movement, but we need to do some experiments to see if there are other things that would be good to add in too, like a SWAG at the cost of doing an operation on the GPU vs the CPU.

For example lets say we have an expression like regexp_extract(FOO, 'SOMETHING THAT DOES NOT WORK ON GPU') = 'BAR' We know that the regexp_extract we cannot do on the GPU, and the result of copying it back to the GPU afterwards would at least be copying the offsets back (an int per row but could be a lot larger). But we know that copying a boolean back is going to be much cheaper than copying a string so we probably should put the equality check on the CPU too, even though we could do it on the GPU.

https://github.com/NVIDIA/spark-rapids/issues/6955#tasklist-block-beb5e4f8-d7ad-4e1e-a353-8f31cedf72df

revans2 commented 2 years ago

@GregoryKimball had another great idea for an optimization if the fallback is a part of a filter. As a part of the fallback for a filter. If the filter is in the form of A and B and C and D then we could process some of the expressions before we fall back to the CPU. So for example if C has an expression that would only work on the CPU we could first do a filter on A and B and D on the GPU. Then we could do the special project to help produce the CPU parts of C and then finally do a filter for C at the end. We should add part of this into the issue we file to look at FilterExec

revans2 commented 2 months ago

Update: spark.rapids.sql.rowBasedUDF.enabled is almost never worth enabling. See https://github.com/NVIDIA/spark-rapids/issues/7873#issuecomment-1675372638 I am going to try and understand the use cases better where it can be a win. But I think we need to take a very different approach to how we might do a "dynamic fallback". I think the only way to make this fast is if we release the semaphore while we process things on the CPU. The problem is that we want to keep the CPU portion as small as possible, but unless we can know if an expression has side effects we might be stuck and have to process even more operators on the CPU in order to not break things when we try to isolate the expressions as much as possible.

NVIDIA / spark-rapids

[FEA] Explore CPU execution of specific expressions #6955