NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
783 stars 228 forks source link

[FEA] provide typical aggregation patterns for different spark version/flavor #3437

Open sperlingxx opened 3 years ago

sperlingxx commented 3 years ago

Is your feature request related to a problem? Please describe. We refactored hashAggReplaceMode in #3368, extending its ability to express complicated aggregation patterns. However, the change made these patterns harder to understand. As @abellina suggested, it would be nice if we can list all typical aggregation patterns for different spark versions/flavors, along with descriptions and illustrations.

sameerz commented 2 years ago

This should be resolved when https://github.com/NVIDIA/spark-rapids/issues/3194 is resolved. If 3914 does not get resolved in a timely fashion, we should come back to this and address it.

abellina commented 2 years ago

This should be resolved when #3194 is resolved. If 3914 does not get resolved in a timely fashion, we should come back to this and address it.

This issue is orthogonal to #3194. The patterns that @sperlingxx is talking about here are patterns to denote what an aggregate exec will look like for the tests only. For example, databricks may use the Complete in some cases, where Apache Spark will treat the aggregate differently, and if we are trying to test that the GPU aggregate can take and produce compatible output with the CPU, we need to be able to address each flavor of the hash aggregate plans. For example (stolen from one the tests @sperlingxx had):

_replace_modes_single_distinct = [
    # Spark: CPU -> CPU -> GPU(PartialMerge) -> GPU(Partial)
    # Databricks runtime: CPU(Final and Complete) -> GPU(PartialMerge)
    'partial|partialMerge',
    # Spark: GPU(Final) -> GPU(PartialMerge&Partial) -> CPU(PartialMerge) -> CPU(Partial)
    # Databricks runtime: GPU(Final&Complete) -> CPU(PartialMerge)
    'final|partialMerge&partial|final&complete',
]

So in this case, we want to keep on the GPU:

And the rest of the aggregate executes on the CPU, which is great as we can show in the tests we can be compatible if part of the plan needs to execute on the CPU due to some operation we don't support yet.

The patterns are a bit convoluted here, and you have to go through the comments to understand what's going on. The proposal is to at least try and associate each pattern with a flavor of Spark, but ideally we can find some common patterns that can be prebaked and documented so we don't have to read a bunch of comments each time.