[Epic] Make DataFusion a reliable foundation for building query engines

findepi commented 1 month ago

DataFusion is an extensible query engine written in Rust that uses Apache Arrow as its in-memory format. DataFusion’s target users are developers building fast and feature rich database and analytic systems, customized to particular workloads. See use cases for examples.

“Out of the box,” DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community. Python Bindings are also available.

ibid.

DataFusion features a full query planner, a columnar, streaming, multi-threaded, vectorized execution engine, and partitioned data sources. You can customize DataFusion at almost all points including additional data sources, query languages, functions, custom operators and more. See the Architecture section for more details.

The two passages indicate dual nature of DataFusion

First, it's a complete query engine, with which users can interact e.g. using datafusion-cli (or dfdb proposed in https://github.com/apache/datafusion/issues/11979) or DataFusion's SQL and DataFrames frontends.

Second is what @alamb often refers to as DataFusion being "LLVM for query engines", a building block for other applications, where components are re-usable.

A query engine may implement a dialect of SQL that is identical with DataFusion SQL, or different. DataFusion doesn't need to know the details of the query engine being implemented (it is extensible rather than being union of all the query languages). DataFusion needs to allow expressing different query languages, providing a reliable and dialect-agnostic foundation for applications building on top of DataFusion.

A query engine and a query language have certain attributes

language: syntax
language: scope resolution rules
language: type system, including coercion rules
language: function repertoire
engine: data sources
engine: access control
engine: observability

Internally, such a query engine transforms user queries (according to syntax, scoping, typing rules) into relational algebra operations, optimizes and executes them. Sounds simple, but in reality this is super complex and this is where DataFusion really shines.

This epic issue is a collection of tasks important for achieving this goal. Its description should be expected to be a living document.

On the high level, it aims at separation of concerns. The two roles DataFusion has -- an implementation of DataFusion SQL, a particular SQL dialect along and being reusable building block -- they should be clearly separated so that dialect-specific behavior isn't implicitly inherited by components needed to be dialect-agnostic.

Goals and Overview

As a result DataFusion should have the following layers

Frontend

DataFusion frontend includes DataFusion SQL - DataFusion's implementation of SQL. DataFusion frontend also includes the DataFrame API.

It provides the following functionality

SQL frontend, with the syntax supported being a subset of syntax supported by the sqlparser crate that DataFusion SQL uses
the type system (currently Arrow data types (DataType) but that should change in https://github.com/apache/datafusion/issues/11513)
a repertoire of builtin functions
coercion rules required to make the language useufl (i.e. the coercion rules that exists today)
performs function resolution (taking function repertoire, their signatures and coercion rules into account)
the frontend remains extensible e.g. by adding more functions or providing table sources.
it may support extending the type system (https://github.com/apache/datafusion/issues/12644)

DataFusion main

DataFusion "main" or "core" represents a dialect- and syntax-agnostic execution query engine library, for building query engines. It serves as an API for all query engine implementors that decide to build on top of DataFusion.

It provides the following functionality

type system: a simplified version of Arrow DataType (https://github.com/apache/datafusion/issues/11513)
- it is currently assume that allowing extension types (https://github.com/apache/datafusion/issues/12644), or totally different data types, is not necessary at this layer. The applications building on top of DataFusion will have to translate their type systems to that of DataFusion main
- it should support annotating the DataFusion types with additional attributes or traits, so that custom optimizer rules can reason about them, or DataFusion type system gets reacher
  - for example, for a language with a constrained varchar type, a correct implementation of UnwrapCastInComparison for input CAST(a_number AS varchar(4)) > '1234' should check whether a_number can safely be represented in 4 characters (would the cast fail?). it clearly can't fail for eg int8 type, and clearly can fail for int32 type
plan IR (intermediate representation) with strict typing rules and no implicit coercions (e.g. for a = b types of a and b match exactly)
does not perform type coercions
does not use function signatures (the formal signatures of functions inserted into the plan must match exactly)
does not have a function repertoire (doesn't need it, since function resolution already happened), but may provide implementations of functions that are reusable by the Frontend layer, or by applications building on top of DataFusion
has simple (stripped down) expression language. For example is doesn't have to discern between IS NULL and IS UNDEFINED or have a Wildcard expression, since those things are handled by the frontend layer.
optimizers which take the the IR plan and produce an equivalent better IR plan
since it doesn't do scope resolution, it doesn't need a concept of aliases, but needs to preserve output field names externally assigned (e.g. by the frontend layer)
- see also https://github.com/apache/datafusion/issues/6543
- see also https://github.com/apache/datafusion/issues/8379
- see also https://github.com/apache/datafusion/issues/1468
the DataFusion main remains extensible e.g. by adding custom optimizer rules, or custom plan nodes

DataFusion execution

It provides the following functionality

physical planning (ExecutionPlan, PhysicalExpr)
type system: a simplified version of Arrow DataType (https://github.com/apache/datafusion/issues/11513)
- using Arrow DataType directly is also an option, but would prevent runtime-adaptive execution, see https://github.com/apache/datafusion/issues/12720, so simplified types would be strongly preferred, https://github.com/apache/datafusion/issues/11513
has no coercion rules
executes queries and produces results

Tasks

(Tasks to be added here once they are discovered and defined.)

[ ] as part of https://github.com/apache/datafusion/issues/12357, clearly document project goals (Work “out of the box” + Customizable everything)
[ ] introduce new simple DataFusion types: https://github.com/apache/datafusion/issues/11513
[ ] allow functions to accept types according to new simple DataFusion types (relates to https://github.com/apache/datafusion/issues/12635, but that issue goes far beyond that)
[ ] remove sqlparser dependency from all crates except datafusion-sql and datafusion-cli (it is OK to use for tests)
[ ] move analyzer out of optimizer and into the SQL frontend - analyzer plays supportive role towards a particular SQL dialect implementation and as such is part of the SQL frontend. It should not live inside the optimizer crate
[ ] separate implicitly typed expressions from explicitly typed expressions. The DataFrame API creates expressions directly and by its very frontendish nature they are not strictly typed, i.e. coercions are necessary. The IR layer needs expressions that have no implicit typing rules. The best approach would be to separate those expression hierarchies into two separate trees see also https://github.com/apache/datafusion/issues/12604.
[ ] split SessionState into "frontend SessionState" and "core SessionState": the layers build on each other, so every layer is concerned with runtime-relevant attributes such as RuntimeEnv, but only the frontend needs to know the function repertoire for example (https://github.com/apache/datafusion/issues/12550 seems related)
[ ] triage existing issues, cataloging them into frontend and core/main
[ ] split datafusion/core into frontend part and reusable part. the public crate name is datafusion which is perfect from a ready to consume frontend component, so maybe we could introduce core as datafusion-core crate
[ ] extend DataFusion core type system to better support building on top of it (see UnwrapCastInComparison example above) -- this is clearly vague and needs further specification
... surely much more ...

findepi commented 1 month ago

cc @alamb, @andygrove, @jayzhan211, @ozankabak, @notfilippo, @comphead, @sadboy, @milevin, @wizardxz

alamb commented 1 month ago

This certainly sounds ambitious (but good!)

Another specific change towards this goal that might be worth considering (and that came in @andygrove 's CMU talk about comet) would be user defined coercion rules

The coerercion / type resolution is pretty specific today and can't be easily extended

findepi commented 1 month ago

Thanks @alamb for your feedback! For https://github.com/apache/datafusion/issues/12644 i started to prototype a new type system that DataFusion could adopt, with assumptions that all types should be first-class citizens (so a trait Type rather than a enum Type). This obviously leads to type providers having to negotiate the coercion rules, but also makes creating core optimizer rules harder -- it would mean the optimizer doesn't know actual types at all! I am pretty sure this is viable approach, but also pretty complicated. And it doesn't solve all the problems on its own, since at some point we need for "physicalize" the types into some common denominator. This remaining thing being exactly what this issue is about. Building a solid common denominator.

findepi commented 1 month ago

Let's put the exact mechanics (Tasks) aside, would be awesome to have agreement on the two things first:

do we want DataFusion to be a reliable building block for building query engines?
if yes, do we want to introduce an architecture supporting that goal?

alamb commented 1 month ago

do we want DataFusion to be a reliable building block for building query engines?

Well, I don't think there is / will be much debate about this goal as it is the same as the current goal, in my mind (or maybe I don't understand if it is meant to be different than the current goal)

if yes, do we want to introduce an architecture supporting that goal?

I look at it a little differently, which is "what features / extension points are missing to allow more building with DataFusion" -- like in my mind the current architure already supports this goal, though it has areas that could be improved (e.g. etendibel coercion, user defined types, etc)

jayzhan211 commented 1 month ago

I think #12622 is important if we want a nice function signature and coercion rule.

findepi commented 1 month ago

Thank you @alamb @jayzhan211 for your comments!

do we want DataFusion to be a reliable building block for building query engines?

Well, I don't think there is / will be much debate about this goal as it is the same as the current goal, in my mind (or maybe I don't understand if it is meant to be different than the current goal)

I do think this is the current goal. However "an extensible query engine" != "LLVM for query engines". The current official documentation uses the former term. The latter is a bit more ambitious, requiring more predictable semantics.

I look at it a little differently, which is "what features / extension points are missing to allow more building with DataFusion" -- like in my mind the current architure already supports this goal,

Not fully. For example, the DF coercion rules are anywhere. The function signatures are consulted at various phases of the planning. Thus is someone wants to build a system without "everything is coercible to string" behavior, current architecture doesn't allow that. It is probably easy to fix (conceptually) by defining the layers. The frontend layer (SQL parsing, analysis, creation of the initial plan) is responsible for function resolution and coercions. Beyond that layer we do not repeat same questions.

(e.g. etendibel coercion, user defined types, etc)

I think https://github.com/apache/datafusion/issues/12622 is important if we want a nice function signature and coercion rule.

yes to both! plus https://github.com/apache/datafusion/issues/12644

but even if we do all of that, we still run on top of Arrow for execution (which is not a bad thing), so we need a layer at which the (simplified) arrow types is the type system (simplified to allow https://github.com/apache/datafusion/issues/12720). The physical plans could be the only such layer, but it's a missed optimizations reuse opportunity, they are too physical already. Thus we need a logical plan layer where engines can integrate.

TL;DR yes, i believe the goal here is inline with project goal. let's have agreement that we want to invest in code structure / layers making DataFusion reuse reliable and with predictable semantics.

alamb commented 1 month ago

However "an extensible query engine" != "LLVM for query engines". The current official documentation uses the former term. The latter is a bit more ambitious, requiring more predictable semantics.

I think in my mind the goal is to be an extensible query engine and an LLVM for data intensive systems. Perhaps that is subtlety different 🤔

Thus is someone wants to build a system without "everything is coercible to string" behavior, current architecture doesn't allow that.

Pedantically I would probably phrase this as "the current code doesn't allow that" -- I think that by adding a suitable API it would be straightforward to allow user defined coercion rules. There may be different opinions on if that would be an architectural change, but I would say it isn't.

alamb commented 1 month ago

let's have agreement that we want to invest in code structure / layers making DataFusion reuse reliable and with predictable semantics.

It seems to me like there is broad agreement that:

More flexible (aka user defined) coercion rules would be good.
Support extension types https://github.com/apache/datafusion/issues/12644 is good
Finding a way to support runtime DataType adaptivity https://github.com/apache/datafusion/issues/12720 would be good

Maybe we can treat these as three different features

At least one good next step is probably file a ticket describing what "user defined coercion rules" would look like

findepi commented 1 month ago

I don't want this ticket to be about flexible (aka user defined) coercion rules. I believe this is necessary ingredient for https://github.com/apache/datafusion/issues/12644, but can also be tracked separately. My point is that for things like Extension Types (https://github.com/apache/datafusion/issues/12644) we need a lower layer which is stripped of those extension types. I don't think it's just physical layer, it would be missed optimizations opportunities cost. So, with this issue my goal is not to take away any functionality, or forbid adding any functionality. Quite contrary, to allow more functionality to be added, with less breaking changes for downstream consumers, we need some form of "contracts." for various APIs we provide. Highest level layer should provide all the bells & whistles. Lowest layer is Arrow vectors and kernels. Let's incrementally define the middle layer, as it will make reasoning about the code easier.

I hope there is no disagreement that more code structure, simpler code flow and well defined contracts is a good thing.

findepi commented 1 month ago

https://github.com/apache/datafusion/pull/13028 change is a great supporting example for this initiative. We have a language feature we want to expose in DataFusion (https://github.com/apache/datafusion/issues/9821), and due to how logical plan construction works today, this mostly-syntax-sugar feature affects logical plan structure, but it has no practical impact on logical plan expressibility, so from LP user perspective, it's a breaking change with no added value. If we had clear frontend vs core separation -- as being proposed here -- downstream LP consumers would be not be affected by such changes.

apache / datafusion