Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
2.09k stars 141 forks source link

Speed up large scale anti-joins #2536

Closed jaychia closed 1 day ago

jaychia commented 2 months ago

Is your feature request related to a problem? Please describe.

For large-scale anti-joins, we can speed it up by not performing an expensive repartition on both sides. Since the LHS is very heavy (keys and values), but the RHS is lightweight (keys only), it can be much cheaper to not perform any repartitioning and just do an all-to-all join by performing a local anti-join for each LHS partition, for each RHS partition.

kevinzwang commented 1 week ago

I believe this is resolved by https://github.com/Eventual-Inc/Daft/pull/2621