NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
806 stars 234 forks source link

[FEA] Support short-circuit evaluation for expensive expression like rlike #10613

Open winningsix opened 7 months ago

winningsix commented 7 months ago

Is your feature request related to a problem? Please describe. Rlike was an very expensive cudf operation compared to other operator like compare. One option is mentioned in https://github.com/NVIDIA/spark-rapids/issues/10600 that we can replace some cases with cheaper expression. Another option we could have is to introduce short-circuit evaluation to skip some regexp by prioritizing to evaluate those cheaper evaluations.

Describe the solution you'd like

record_name regexpr (.*[0-9]{2}) AND date BTWEEN `20201111` AND `20201112` AND length(store_names) > 15

For condition mentioned above, we could optimize it by:

  1. Evaluate other conditions firstly other than regexp
  2. Support null masking filter on the fisrst 2 conditions and evaluate null masked data using regexp
  3. (nice to have) reorder date and length conditions on the fly based on its selectivity
revans2 commented 7 months ago

On the GPU the problems typically show up around thread divergence and non-coalesed memory access patterns.

I am not 100% sure about this so we should run some experiments and see, but I don't think it is a clear win every time. If the string is long, then it will always have a bad memory access pattern, and the length of time that the kernel takes to run is likely the amount of time it takes for the longest string to be processed by a warp. If the string is short then the memory access pattern is likely good, and even though we might still have thread divergence it is decently fast.

Replacing string values with nulls requires a memory copy, and copy_if_else for strings is not known to be that great. So we might need some kind of a heuristic to see what matters. I think in really bad cases today this could be a big win, especially if we can free up some warps early to process more data, and there are enough input strings to make that be a win.