apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.99k stars 1.13k forks source link

Proposal: introduced typed expressions, separate AST and IR #12604

Open findepi opened 3 days ago

findepi commented 3 days ago

Currently DataFusion expressions (Expr) do not know their type. The type can be calculated via <Expr as ExprSchemable::get_type function.

This has drawbacks unfortunately

I believe the current design of expressions come from the fact they are used during initial AST (abstract syntax tree) processing where types aren't known yet. The pair (AST, AST → LogicalPlan conversion function) is an implementation of a language. DF implements its own variant of SQL (https://github.com/apache/datafusion/issues/12357) and systems building on top of DF may or may not use it (e.g. Ballista probably uses it and Comet wants to be closely aligned with Spark SQL's semantics).

Proposal plan

findepi commented 3 days ago

cc @alamb @notfilippo @jayzhan211 @comphead

jayzhan211 commented 3 days ago

I think select_to_plan is actually datafusion's internal AST layer that convert sqlparser AST to LogicalPlan, can you explain more what the difference between them is?

I don't quite understand why explicitly carrying the type in Expr would improve performance. If get_type is slow, wouldn't we face the same cost of determining the type when creating an Expr? If the computing the type is expensive, we could also consider caching the result. We also have a nullable property for Expr, if it makes sense not to explicit carrying nullable for Expr, wouldn't the same logic apply to type as well.

comphead commented 2 days ago

Thanks @jayzhan211 @findepi. Youguys are right, currently we using expr.get_type(schema) to get the exact data type, which is recursive and potentially can have some perfomance hit for deep nested types, probably insignificant. Carrying the datatype with Expr itself also has some benefits in runtime type eval.

What I'm trying to say is to make this large change it might be a very good reason for doing it. I think we need to list the exact benefits of having this refactoring

findepi commented 2 days ago

why explicitly carrying the type in Expr would improve performance. If get_type is slow, wouldn't we face the same cost of determining the type when creating an Expr?

correct, unless get_type is called more than once. optimizer rules may check types. DF doesn't have as many optimizer rules yet, but I am basing this on observation and profiling of Trino optimizer.

If the computing the type is expensive, we could also consider caching the result.

The type will be needed sooner or later. No need to have implicit cache for something that is needed and can be "cached" explicitly. Explicitness helps understand the contract -- if something is a field it implies it won't change.

Also, can Expr be used against two different DFSchemas? Currently it can (and may produce different types), but i assume this capability is incidental more than intentional.

We also have a nullable property for Expr, if it makes sense not to explicit carrying nullable for Expr, wouldn't the same logic apply to type as well.

That's a good question. There are also other type- or value- related properties (aka traits), like range of possible values (intervals).

Not sure yet how these should be handled.

to make this large change it might be a very good reason for doing it. I think we need to list the exact benefits of having this refactoring

@comphead that's a very good call. my main motivation isn't performance, it's the design & reliability. With DF being "LLVM for data processing", explicit is superior to implicit, especially that DF user could have different type derivation rules. Eg different coercions between types. To be even more concrete with example, Trino and SQL Server has different rules for coercing decimals of different precision and scale.

alamb commented 19 hours ago

let Expr carry the type explicitly (that would be "logical type"

I think this is the

correct, unless get_type is called more than once. optimizer rules may check types. DF doesn't have as many optimizer rules yet, but I am basing this on observation and profiling of Trino optimi

FWIW I don't think i have seen get_type appear in the planning profiling traces (this may be because the overhead is mostly dominated by copies so the type calculations are overshadowed.

There are some times when the type of an intermedidate expression is needed (like the simplifier) but it doesn't seem to have been a bottleneck so far

to make this large change it might be a very good reason for doing it. I think we need to list the exact benefits of having this refactoring

I agree with @comphead -- I think this propsoal would be stronger if it was in the context of

  1. I am trying to do X but the current way types are represented makes it not possible
  2. I am trying to do Y but profiling shows a large amount of time being spent recomputing unecessary types

In other words, if we were designing a system from scratch this proposal makes sense to me, but I think at this point we already have a system that seems to work reasonably well for our existing users so just changing the design to be better without some practical improvement is a hard one for me

Eg different coercions between types. To be even more concrete with example, Trino and SQL Server has different rules for coercing decimals of different precision and scale.

I think making an API for user-defined coercion rules in DataFusion would be really nice (not just to support the usecase, but I think forcing the internal implementaiton into a thoughtful API would make the code easier to reason about and better strutured)