NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
803 stars 234 forks source link

[FEA] Support full outer and build-side-matches-join-side conditional nested loop joins #3269

Open jlowe opened 3 years ago

jlowe commented 3 years ago

3111 implemented support for some conditional nested loop joins but not all of them. The ones remaining are:

Supporting these types of joins requires tracking the rows in the build table that match for each streaming batch and then performing a post-processing step based on either the rows that matched or the complement (i.e.: unmatched rows). For example, FullOuter requires emitting all unmatched rows with null rows for the stream batch, and LeftSemi requires emitting only the rows that matched after all stream batches have been seen.

Salonijain27 commented 3 years ago

@jlowe will be adding feature requests in cuDF for 21.12

jlowe commented 3 years ago

See #3300 for a general discussion of the batched full join algorithm which can be applied with slight modifications to joins where the build side matches the join side. For left outer with build left, we track the gather maps on the left side, emitting output batches as normal, but after the right stream side finishes, we track which rows were never generated in the left gather maps and emit null-join entries for those. For left semi with build left, we track the gather maps for the left side, emitting nothing as we stream the right side, and only at the end we concatenate all gather maps, remove duplicates, and emit those rows. For left anti with build left, we do the same as left semi with build left but instead emit the complement of rows.