Eventual-Inc / Daft

Distributed DataFrame for Python designed for the cloud, powered by Rust
https://getdaft.io
Apache License 2.0
1.85k stars 113 forks source link

Optimizer: eliminate nested union/concat #2425

Open universalmind303 opened 1 week ago

universalmind303 commented 1 week ago

Is your feature request related to a problem? Please describe. If a plan contains nested unions/concats, we can instead flatten those to a single operation

Example:

df.concat(df.concat(df.concat(df)).explain(True)

which ends up looking like this.

flowchart TD
    A[concat] --> B[3]
    A --> C[concat]
    C --> D[2]
    C --> E[1]

But a more efficient representation would be

flowchart TD
    A[concat] --> B[3]
    A --> D[2]
    A --> E[1]

Describe the solution you'd like Inefficient queries such as the above are automatically optimized using the logic stated

Describe alternatives you've considered None

Additional context polars - https://github.com/pola-rs/polars/issues/7855 datafusion - https://github.com/apache/datafusion/issues/7481

samster25 commented 4 days ago

Good call out! One thing that might be a bit messy is that we haven't built any support for any n-ary ops quite yet. So n-way concat will be the first