Open revans2 opened 2 years ago
@GregoryKimball had another great idea for an optimization if the fallback is a part of a filter. As a part of the fallback for a filter. If the filter is in the form of A and B and C and D
then we could process some of the expressions before we fall back to the CPU. So for example if C
has an expression that would only work on the CPU we could first do a filter on A and B and D
on the GPU. Then we could do the special project to help produce the CPU parts of C
and then finally do a filter for C
at the end. We should add part of this into the issue we file to look at FilterExec
Update: spark.rapids.sql.rowBasedUDF.enabled
is almost never worth enabling. See https://github.com/NVIDIA/spark-rapids/issues/7873#issuecomment-1675372638 I am going to try and understand the use cases better where it can be a win. But I think we need to take a very different approach to how we might do a "dynamic fallback". I think the only way to make this fast is if we release the semaphore while we process things on the CPU. The problem is that we want to keep the CPU portion as small as possible, but unless we can know if an expression has side effects we might be stuck and have to process even more operators on the CPU in order to not break things when we try to isolate the expressions as much as possible.
Is your feature request related to a problem? Please describe. There are some expressions where a GPU implementation is going to be very expensive to implement or possibly impossible to do in the general case. Our current solution is to fall back to the CPU at the operator level. We fall back for an entire Project if there is a single expression in the Project that we cannot do on the GPU. This works for producing the correct answer, but may not be the best performing solution. We should explore what a partial fallback solution might look like.
spark.rapids.sql.rowBasedUDF.enabled
that lets us put a UDF on the CPU, but do so with the GPU semaphore held. We should look at what the performance impact is for something like this but for a more concrete operator, likeregexp_extract
. We know that there are a lot of regular expression cases that we cannot currently support so regular expressions is a good starting point for this.ProjectExec
that can run CPU expressions intelligently. It would pull the input columns to the CPU, make the original input spillable, release the semaphore, run the CPU expressions and collect the results on the CPU, grab the semaphore, and copy the resulting columns back down to the GPU. We should do this for UDFs as well as theregexp_extract
as a part of the experiments.FilterExec
or aHashAggregateExec
or any of the join operations. These should probably be separate issues filed for each exec that we think could have these types of expressions in them.The cost based optimizations I was thinking about would probably look mostly at the cost of data movement, but we need to do some experiments to see if there are other things that would be good to add in too, like a SWAG at the cost of doing an operation on the GPU vs the CPU.
For example lets say we have an expression like
regexp_extract(FOO, 'SOMETHING THAT DOES NOT WORK ON GPU') = 'BAR'
We know that theregexp_extract
we cannot do on the GPU, and the result of copying it back to the GPU afterwards would at least be copying the offsets back (an int per row but could be a lot larger). But we know that copying a boolean back is going to be much cheaper than copying a string so we probably should put the equality check on the CPU too, even though we could do it on the GPU.https://github.com/NVIDIA/spark-rapids/issues/6955#tasklist-block-beb5e4f8-d7ad-4e1e-a353-8f31cedf72df