Open findepi opened 1 month ago
cc @alamb, @andygrove, @jayzhan211, @ozankabak, @notfilippo, @comphead, @sadboy, @milevin, @wizardxz
This certainly sounds ambitious (but good!)
Another specific change towards this goal that might be worth considering (and that came in @andygrove 's CMU talk about comet) would be user defined coercion rules
The coerercion / type resolution is pretty specific today and can't be easily extended
Thanks @alamb for your feedback! For https://github.com/apache/datafusion/issues/12644 i started to prototype a new type system that DataFusion could adopt, with assumptions that all types should be first-class citizens (so a trait Type
rather than a enum Type
). This obviously leads to type providers having to negotiate the coercion rules, but also makes creating core optimizer rules harder -- it would mean the optimizer doesn't know actual types at all! I am pretty sure this is viable approach, but also pretty complicated. And it doesn't solve all the problems on its own, since at some point we need for "physicalize" the types into some common denominator. This remaining thing being exactly what this issue is about. Building a solid common denominator.
Let's put the exact mechanics (Tasks) aside, would be awesome to have agreement on the two things first:
- do we want DataFusion to be a reliable building block for building query engines?
Well, I don't think there is / will be much debate about this goal as it is the same as the current goal, in my mind (or maybe I don't understand if it is meant to be different than the current goal)
- if yes, do we want to introduce an architecture supporting that goal?
I look at it a little differently, which is "what features / extension points are missing to allow more building with DataFusion" -- like in my mind the current architure already supports this goal, though it has areas that could be improved (e.g. etendibel coercion, user defined types, etc)
I think #12622 is important if we want a nice function signature and coercion rule.
Thank you @alamb @jayzhan211 for your comments!
- do we want DataFusion to be a reliable building block for building query engines?
Well, I don't think there is / will be much debate about this goal as it is the same as the current goal, in my mind (or maybe I don't understand if it is meant to be different than the current goal)
I do think this is the current goal. However "an extensible query engine" != "LLVM for query engines". The current official documentation uses the former term. The latter is a bit more ambitious, requiring more predictable semantics.
I look at it a little differently, which is "what features / extension points are missing to allow more building with DataFusion" -- like in my mind the current architure already supports this goal,
Not fully. For example, the DF coercion rules are anywhere. The function signatures are consulted at various phases of the planning. Thus is someone wants to build a system without "everything is coercible to string" behavior, current architecture doesn't allow that. It is probably easy to fix (conceptually) by defining the layers. The frontend layer (SQL parsing, analysis, creation of the initial plan) is responsible for function resolution and coercions. Beyond that layer we do not repeat same questions.
(e.g. etendibel coercion, user defined types, etc)
I think https://github.com/apache/datafusion/issues/12622 is important if we want a nice function signature and coercion rule.
yes to both! plus https://github.com/apache/datafusion/issues/12644
but even if we do all of that, we still run on top of Arrow for execution (which is not a bad thing), so we need a layer at which the (simplified) arrow types is the type system (simplified to allow https://github.com/apache/datafusion/issues/12720). The physical plans could be the only such layer, but it's a missed optimizations reuse opportunity, they are too physical already. Thus we need a logical plan layer where engines can integrate.
TL;DR yes, i believe the goal here is inline with project goal. let's have agreement that we want to invest in code structure / layers making DataFusion reuse reliable and with predictable semantics.
However "an extensible query engine" != "LLVM for query engines". The current official documentation uses the former term. The latter is a bit more ambitious, requiring more predictable semantics.
I think in my mind the goal is to be an extensible query engine and an LLVM for data intensive systems. Perhaps that is subtlety different 🤔
Thus is someone wants to build a system without "everything is coercible to string" behavior, current architecture doesn't allow that.
Pedantically I would probably phrase this as "the current code doesn't allow that" -- I think that by adding a suitable API it would be straightforward to allow user defined coercion rules. There may be different opinions on if that would be an architectural change, but I would say it isn't.
let's have agreement that we want to invest in code structure / layers making DataFusion reuse reliable and with predictable semantics.
It seems to me like there is broad agreement that:
Maybe we can treat these as three different features
At least one good next step is probably file a ticket describing what "user defined coercion rules" would look like
I don't want this ticket to be about flexible (aka user defined) coercion rules. I believe this is necessary ingredient for https://github.com/apache/datafusion/issues/12644, but can also be tracked separately. My point is that for things like Extension Types (https://github.com/apache/datafusion/issues/12644) we need a lower layer which is stripped of those extension types. I don't think it's just physical layer, it would be missed optimizations opportunities cost. So, with this issue my goal is not to take away any functionality, or forbid adding any functionality. Quite contrary, to allow more functionality to be added, with less breaking changes for downstream consumers, we need some form of "contracts." for various APIs we provide. Highest level layer should provide all the bells & whistles. Lowest layer is Arrow vectors and kernels. Let's incrementally define the middle layer, as it will make reasoning about the code easier.
I hope there is no disagreement that more code structure, simpler code flow and well defined contracts is a good thing.
https://github.com/apache/datafusion/pull/13028 change is a great supporting example for this initiative. We have a language feature we want to expose in DataFusion (https://github.com/apache/datafusion/issues/9821), and due to how logical plan construction works today, this mostly-syntax-sugar feature affects logical plan structure, but it has no practical impact on logical plan expressibility, so from LP user perspective, it's a breaking change with no added value. If we had clear frontend vs core separation -- as being proposed here -- downstream LP consumers would be not be affected by such changes.
from https://datafusion.apache.org/
ibid.
The two passages indicate dual nature of DataFusion
First, it's a complete query engine, with which users can interact e.g. using datafusion-cli (or
dfdb
proposed in https://github.com/apache/datafusion/issues/11979) or DataFusion's SQL and DataFrames frontends.Second is what @alamb often refers to as DataFusion being "LLVM for query engines", a building block for other applications, where components are re-usable.
(See also https://datafusion.apache.org/user-guide/faq.html#how-does-datafusion-compare-with-xyz, https://datafusion.apache.org/contributor-guide/architecture.html and https://docs.rs/datafusion/latest/datafusion/index.html#architecture)
A query engine may implement a dialect of SQL that is identical with DataFusion SQL, or different. DataFusion doesn't need to know the details of the query engine being implemented (it is extensible rather than being union of all the query languages). DataFusion needs to allow expressing different query languages, providing a reliable and dialect-agnostic foundation for applications building on top of DataFusion.
A query engine and a query language have certain attributes
Internally, such a query engine transforms user queries (according to syntax, scoping, typing rules) into relational algebra operations, optimizes and executes them. Sounds simple, but in reality this is super complex and this is where DataFusion really shines.
This epic issue is a collection of tasks important for achieving this goal. Its description should be expected to be a living document.
On the high level, it aims at separation of concerns. The two roles DataFusion has -- an implementation of DataFusion SQL, a particular SQL dialect along and being reusable building block -- they should be clearly separated so that dialect-specific behavior isn't implicitly inherited by components needed to be dialect-agnostic.
Goals and Overview
As a result DataFusion should have the following layers
Frontend
DataFusion frontend includes DataFusion SQL - DataFusion's implementation of SQL. DataFusion frontend also includes the DataFrame API.
It provides the following functionality
sqlparser
crate that DataFusion SQL usesDataType
) but that should change in https://github.com/apache/datafusion/issues/11513)DataFusion main
DataFusion "main" or "core" represents a dialect- and syntax-agnostic execution query engine library, for building query engines. It serves as an API for all query engine implementors that decide to build on top of DataFusion.
It provides the following functionality
DataType
(https://github.com/apache/datafusion/issues/11513)UnwrapCastInComparison
for inputCAST(a_number AS varchar(4)) > '1234'
should check whether a_number can safely be represented in 4 characters (would the cast fail?). it clearly can't fail for eg int8 type, and clearly can fail for int32 typea = b
types ofa
andb
match exactly)IS NULL
andIS UNDEFINED
or have a Wildcard expression, since those things are handled by the frontend layer.DataFusion execution
It provides the following functionality
ExecutionPlan
,PhysicalExpr
)DataType
(https://github.com/apache/datafusion/issues/11513)DataType
directly is also an option, but would prevent runtime-adaptive execution, see https://github.com/apache/datafusion/issues/12720, so simplified types would be strongly preferred, https://github.com/apache/datafusion/issues/11513Tasks
(Tasks to be added here once they are discovered and defined.)
sqlparser
dependency from all crates exceptdatafusion-sql
anddatafusion-cli
(it is OK to use for tests)optimizer
crateSessionState
into "frontend SessionState" and "core SessionState": the layers build on each other, so every layer is concerned with runtime-relevant attributes such as RuntimeEnv, but only the frontend needs to know the function repertoire for example (https://github.com/apache/datafusion/issues/12550 seems related)datafusion/core
into frontend part and reusable part. the public crate name isdatafusion
which is perfect from a ready to consume frontend component, so maybe we could introduce core asdatafusion-core
crateUnwrapCastInComparison
example above) -- this is clearly vague and needs further specification