apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.16k stars 1.16k forks source link

Support arbitrary expressions in `LIMIT` clause #9821

Open alamb opened 6 months ago

alamb commented 6 months ago

Follow on to https://github.com/apache/arrow-datafusion/issues/9506

The idea is to support arbitrary expressions that can be consolidated to a constant in the LIMIT clause. For example

❯ select * from (values (1)) LIMIT 10*100;
Error during planning: Unsupported operator for LIMIT clause

This query should be able to run (and return the single value)

❯ select * from (values (1)) LIMIT 10+100;
+---------+
| column1 |
+---------+
| 1       |
+---------+

https://github.com/apache/arrow-datafusion/pull/9790 adds support for basic +/- but the general purpose solution that would handle any expr that can be consolidated to a constant would be better

As suggested by @jonahgao this might look like change the Limit logical plan to support arbitrary expressions?

pub struct Limit {
    pub skip: Expr,
    pub fetch: Option<Expr>,
    pub input: Arc<LogicalPlan>,
}

The SimplifyExpressions rule can automatically optimize them into constants. Some optimization rules such as PushDownLimit only run when the limit expression is a constant. We may need to add a cast for the limit expression when planning, only checking if it is a constant of type u64.

When creating the LimitExec physical plan, convert the limit expression into PhysicalExpr and evaluate it.

_Originally posted by @jonahgao in https://github.com/apache/arrow-datafusion/pull/9790#discussion_r1539358701_

Lordworms commented 6 months ago

I can do this one since I have been doing a related limit pushdown feature.

Lordworms commented 6 months ago

wait until https://github.com/apache/arrow-datafusion/pull/9815#issue-2209595815 merged

alamb commented 1 week ago

This one is probably ready to work on now

alamb commented 1 week ago

While re-reading this I think we should start with implementhing limits that can be evaluated to a constant by the time the physical plan is created (aka don't change the physical execution plans)

It is not clear to me what LIMIT 100 + x means

The key usecases are:

jonahgao commented 6 days ago

While re-reading this I think we should start with implementhing limits that can be evaluated to a constant by the time the physical plan is created (aka don't change the physical execution plans)

I will switch my focus to working on it.