apache / datafusion-ballista

Apache DataFusion Ballista Distributed Query Engine
https://datafusion.apache.org/ballista
Apache License 2.0
1.56k stars 197 forks source link

Left/full outer join incorrect for CollectLeft / broadcast #1055

Open Dandandan opened 2 months ago

Dandandan commented 2 months ago

Describe the bug See discussion here https://github.com/apache/datafusion/issues/12454

The "broadcast join" (CollectLeft) is wrong for certain join types which produce results on unmatched left rows.

To Reproduce Run a broadcast join with left / full outer on more than one node

Expected behavior

Additional context

Dandandan commented 1 week ago

There is a proposal in DataFusion for adding a hook to support sharing the join state https://github.com/apache/datafusion/pull/12523

We tested this at Coralogix, this works very well for us.

Dandandan commented 1 week ago

It could be disabled as well, although that will likely hurt performance by quite a bit.

milenkovicm commented 1 week ago

should we take this once it gets merged in DF ?