NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
761 stars 224 forks source link

[FEA] Implement lore framework to support all operators. #10987

Closed liurenjie1024 closed 1 day ago

liurenjie1024 commented 3 weeks ago

Is your feature request related to a problem? Please describe. We want to implement a lore framework to support all operators.

Describe the solution you'd like We need to figure out a way to allow user to tell us the operator id at runtime, e.g. we call it lore_id. The lore_id should be determinstic when given same spark configration, spark sql, and input data. Then in the second run we will dump the operators' input data, meta data(e.g. plan information) so that we can replay it in local. Ideally, we will also dump nsight tracing utilizing work here: https://github.com/NVIDIA/spark-rapids/pull/10870

Describe alternatives you've considered No.

Additional context No.

binmahone commented 3 weeks ago

with https://github.com/NVIDIA/spark-rapids/pull/10999, we can start to ues LORE at customer site for simple cases like GpuAggregateExec. I can think of these remaining issues to address:

  1. target Exec must be GpuExec, target Exec must have a child and it must be GpuExec
  2. only UnaryLike GpuExec is suppported now (Join not supported yet)
  3. target Exec's RDD paritions must be a 1:1 mapping with child Exec's RDD. e.g. GpuCoaleseExec(CoaleseRDD) not suppport now.
  4. cannot deal with cases where input has no columns (e.g. select count(*) )
  5. can only dump to executor local disk (/tmp/lore/)
  6. test case required

The list might be incomplete.

liurenjie1024 commented 3 weeks ago

target Exec must be GpuExec, target Exec must have a child and it must be GpuExec

I think we only need to care about GpuExec?