Open findepi opened 2 months ago
cc @alamb @notfilippo @jayzhan211 @comphead
I think select_to_plan
is actually datafusion's internal AST layer that convert sqlparser AST to LogicalPlan
, can you explain more what the difference between them is?
I don't quite understand why explicitly carrying the type in Expr
would improve performance. If get_type
is slow, wouldn't we face the same cost of determining the type when creating an Expr
? If the computing the type is expensive, we could also consider caching the result. We also have a nullable
property for Expr
, if it makes sense not to explicit carrying nullable
for Expr
, wouldn't the same logic apply to type
as well.
Thanks @jayzhan211 @findepi. Youguys are right, currently we using expr.get_type(schema)
to get the exact data type, which is recursive and potentially can have some perfomance hit for deep nested types, probably insignificant. Carrying the datatype with Expr itself also has some benefits in runtime type eval.
What I'm trying to say is to make this large change it might be a very good reason for doing it. I think we need to list the exact benefits of having this refactoring
why explicitly carrying the type in
Expr
would improve performance. Ifget_type
is slow, wouldn't we face the same cost of determining the type when creating anExpr
?
correct, unless get_type
is called more than once. optimizer rules may check types.
DF doesn't have as many optimizer rules yet, but I am basing this on observation and profiling of Trino optimizer.
If the computing the type is expensive, we could also consider caching the result.
The type will be needed sooner or later. No need to have implicit cache for something that is needed and can be "cached" explicitly. Explicitness helps understand the contract -- if something is a field it implies it won't change.
Also, can Expr be used against two different DFSchemas? Currently it can (and may produce different types), but i assume this capability is incidental more than intentional.
We also have a nullable property for Expr, if it makes sense not to explicit carrying nullable for Expr, wouldn't the same logic apply to type as well.
That's a good question. There are also other type- or value- related properties (aka traits), like range of possible values (intervals).
Not sure yet how these should be handled.
to make this large change it might be a very good reason for doing it. I think we need to list the exact benefits of having this refactoring
@comphead that's a very good call. my main motivation isn't performance, it's the design & reliability. With DF being "LLVM for data processing", explicit is superior to implicit, especially that DF user could have different type derivation rules. Eg different coercions between types. To be even more concrete with example, Trino and SQL Server has different rules for coercing decimals of different precision and scale.
let Expr carry the type explicitly (that would be "logical type"
I think this is the
correct, unless get_type is called more than once. optimizer rules may check types. DF doesn't have as many optimizer rules yet, but I am basing this on observation and profiling of Trino optimi
FWIW I don't think i have seen get_type appear in the planning profiling traces (this may be because the overhead is mostly dominated by copies so the type calculations are overshadowed.
There are some times when the type of an intermedidate expression is needed (like the simplifier) but it doesn't seem to have been a bottleneck so far
to make this large change it might be a very good reason for doing it. I think we need to list the exact benefits of having this refactoring
I agree with @comphead -- I think this propsoal would be stronger if it was in the context of
X
but the current way types are represented makes it not possibleY
but profiling shows a large amount of time being spent recomputing unecessary typesIn other words, if we were designing a system from scratch this proposal makes sense to me, but I think at this point we already have a system that seems to work reasonably well for our existing users so just changing the design to be better without some practical improvement is a hard one for me
Eg different coercions between types. To be even more concrete with example, Trino and SQL Server has different rules for coercing decimals of different precision and scale.
I think making an API for user-defined coercion rules in DataFusion would be really nice (not just to support the usecase, but I think forcing the internal implementaiton into a thoughtful API would make the code easier to reason about and better strutured)
Currently DataFusion expressions (
Expr
) do not know their type. The type can be calculated via<Expr as ExprSchemable::get_type
function.This has drawbacks unfortunately
I believe the current design of expressions come from the fact they are used during initial AST (abstract syntax tree) processing where types aren't known yet. The pair
(AST, AST → LogicalPlan conversion function)
is an implementation of a language. DF implements its own variant of SQL (https://github.com/apache/datafusion/issues/12357) and systems building on top of DF may or may not use it (e.g. Ballista probably uses it and Comet wants to be closely aligned with Spark SQL's semantics).Proposal plan
Expr
carry the type explicitly (that would be "logical type" from https://github.com/apache/datafusion/issues/11513)