NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
792 stars 230 forks source link

[FEA] Support collect_limited_list in windowing #10930

Open kuhushukla opened 4 months ago

kuhushukla commented 4 months ago

Is your feature request related to a problem? Please describe. Windowing on GPU should support

(42) Window
Input [4]: [id#885, a#1188, b#1189, c#1190]
Arguments: [collect_limited_list(true, 0, 0, 2) windowspecdefinition(a#1188, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS abc#2918], [a#1188]

CC: @mythrocks Also as part of this work it would be important to qualtify any performance improvements needed.

kuhushukla commented 4 months ago

Not sure if this is a udf or not?

mythrocks commented 4 months ago

@kuhushukla brought this to my attention this morning. Sorry I didn't respond here sooner.

collect_limited_list(true, 0, 0, 2)...

Not sure if this is a udf or not?

This isn't a standard SQL window aggregation, as far as I know. My guess is that this is a UDAF from Apache Datafu, per https://github.com/apache/datafu/pull/24.

From poking around, it appears that collect_limited_list(col, N) functions like collect_list(col), but limits the output list-row to a maximum of N elements.

On the face of it, I don't foresee this being difficult to implement. (It won't be deterministic.) I would, however, need to peruse the detailed definition of the function's behaviour.

kuhushukla commented 4 months ago

Thank you Mithun!