apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.37k stars 1.2k forks source link

[Epic] Make DataFusion a reliable foundation for building query engines #12723

Open findepi opened 1 month ago

findepi commented 1 month ago

from https://datafusion.apache.org/

DataFusion is an extensible query engine written in Rust that uses Apache Arrow as its in-memory format. DataFusion’s target users are developers building fast and feature rich database and analytic systems, customized to particular workloads. See use cases for examples.

“Out of the box,” DataFusion offers SQL and Dataframe APIs, excellent performance, built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community. Python Bindings are also available.

ibid.

DataFusion features a full query planner, a columnar, streaming, multi-threaded, vectorized execution engine, and partitioned data sources. You can customize DataFusion at almost all points including additional data sources, query languages, functions, custom operators and more. See the Architecture section for more details.

The two passages indicate dual nature of DataFusion

First, it's a complete query engine, with which users can interact e.g. using datafusion-cli (or dfdb proposed in https://github.com/apache/datafusion/issues/11979) or DataFusion's SQL and DataFrames frontends.

Second is what @alamb often refers to as DataFusion being "LLVM for query engines", a building block for other applications, where components are re-usable.

(See also https://datafusion.apache.org/user-guide/faq.html#how-does-datafusion-compare-with-xyz, https://datafusion.apache.org/contributor-guide/architecture.html and https://docs.rs/datafusion/latest/datafusion/index.html#architecture)

A query engine may implement a dialect of SQL that is identical with DataFusion SQL, or different. DataFusion doesn't need to know the details of the query engine being implemented (it is extensible rather than being union of all the query languages). DataFusion needs to allow expressing different query languages, providing a reliable and dialect-agnostic foundation for applications building on top of DataFusion.

A query engine and a query language have certain attributes

Internally, such a query engine transforms user queries (according to syntax, scoping, typing rules) into relational algebra operations, optimizes and executes them. Sounds simple, but in reality this is super complex and this is where DataFusion really shines.

This epic issue is a collection of tasks important for achieving this goal. Its description should be expected to be a living document.

On the high level, it aims at separation of concerns. The two roles DataFusion has -- an implementation of DataFusion SQL, a particular SQL dialect along and being reusable building block -- they should be clearly separated so that dialect-specific behavior isn't implicitly inherited by components needed to be dialect-agnostic.

Goals and Overview

As a result DataFusion should have the following layers

Frontend

DataFusion frontend includes DataFusion SQL - DataFusion's implementation of SQL. DataFusion frontend also includes the DataFrame API.

It provides the following functionality

DataFusion main

DataFusion "main" or "core" represents a dialect- and syntax-agnostic execution query engine library, for building query engines. It serves as an API for all query engine implementors that decide to build on top of DataFusion.

It provides the following functionality

DataFusion execution

It provides the following functionality

Tasks

(Tasks to be added here once they are discovered and defined.)

findepi commented 1 month ago

cc @alamb, @andygrove, @jayzhan211, @ozankabak, @notfilippo, @comphead, @sadboy, @milevin, @wizardxz

alamb commented 1 month ago

This certainly sounds ambitious (but good!)

Another specific change towards this goal that might be worth considering (and that came in @andygrove 's CMU talk about comet) would be user defined coercion rules

The coerercion / type resolution is pretty specific today and can't be easily extended

findepi commented 1 month ago

Thanks @alamb for your feedback! For https://github.com/apache/datafusion/issues/12644 i started to prototype a new type system that DataFusion could adopt, with assumptions that all types should be first-class citizens (so a trait Type rather than a enum Type). This obviously leads to type providers having to negotiate the coercion rules, but also makes creating core optimizer rules harder -- it would mean the optimizer doesn't know actual types at all! I am pretty sure this is viable approach, but also pretty complicated. And it doesn't solve all the problems on its own, since at some point we need for "physicalize" the types into some common denominator. This remaining thing being exactly what this issue is about. Building a solid common denominator.

findepi commented 1 month ago

Let's put the exact mechanics (Tasks) aside, would be awesome to have agreement on the two things first:

alamb commented 1 month ago
  • do we want DataFusion to be a reliable building block for building query engines?

Well, I don't think there is / will be much debate about this goal as it is the same as the current goal, in my mind (or maybe I don't understand if it is meant to be different than the current goal)

  • if yes, do we want to introduce an architecture supporting that goal?

I look at it a little differently, which is "what features / extension points are missing to allow more building with DataFusion" -- like in my mind the current architure already supports this goal, though it has areas that could be improved (e.g. etendibel coercion, user defined types, etc)

jayzhan211 commented 1 month ago

I think #12622 is important if we want a nice function signature and coercion rule.

findepi commented 1 month ago

Thank you @alamb @jayzhan211 for your comments!

  • do we want DataFusion to be a reliable building block for building query engines?

Well, I don't think there is / will be much debate about this goal as it is the same as the current goal, in my mind (or maybe I don't understand if it is meant to be different than the current goal)

I do think this is the current goal. However "an extensible query engine" != "LLVM for query engines". The current official documentation uses the former term. The latter is a bit more ambitious, requiring more predictable semantics.

I look at it a little differently, which is "what features / extension points are missing to allow more building with DataFusion" -- like in my mind the current architure already supports this goal,

Not fully. For example, the DF coercion rules are anywhere. The function signatures are consulted at various phases of the planning. Thus is someone wants to build a system without "everything is coercible to string" behavior, current architecture doesn't allow that. It is probably easy to fix (conceptually) by defining the layers. The frontend layer (SQL parsing, analysis, creation of the initial plan) is responsible for function resolution and coercions. Beyond that layer we do not repeat same questions.

(e.g. etendibel coercion, user defined types, etc)

I think https://github.com/apache/datafusion/issues/12622 is important if we want a nice function signature and coercion rule.

yes to both! plus https://github.com/apache/datafusion/issues/12644

but even if we do all of that, we still run on top of Arrow for execution (which is not a bad thing), so we need a layer at which the (simplified) arrow types is the type system (simplified to allow https://github.com/apache/datafusion/issues/12720). The physical plans could be the only such layer, but it's a missed optimizations reuse opportunity, they are too physical already. Thus we need a logical plan layer where engines can integrate.

TL;DR yes, i believe the goal here is inline with project goal. let's have agreement that we want to invest in code structure / layers making DataFusion reuse reliable and with predictable semantics.

alamb commented 1 month ago

However "an extensible query engine" != "LLVM for query engines". The current official documentation uses the former term. The latter is a bit more ambitious, requiring more predictable semantics.

I think in my mind the goal is to be an extensible query engine and an LLVM for data intensive systems. Perhaps that is subtlety different 🤔

Thus is someone wants to build a system without "everything is coercible to string" behavior, current architecture doesn't allow that.

Pedantically I would probably phrase this as "the current code doesn't allow that" -- I think that by adding a suitable API it would be straightforward to allow user defined coercion rules. There may be different opinions on if that would be an architectural change, but I would say it isn't.

alamb commented 1 month ago

let's have agreement that we want to invest in code structure / layers making DataFusion reuse reliable and with predictable semantics.

It seems to me like there is broad agreement that:

  1. More flexible (aka user defined) coercion rules would be good.
  2. Support extension types https://github.com/apache/datafusion/issues/12644 is good
  3. Finding a way to support runtime DataType adaptivity https://github.com/apache/datafusion/issues/12720 would be good

Maybe we can treat these as three different features

At least one good next step is probably file a ticket describing what "user defined coercion rules" would look like

findepi commented 1 month ago

I don't want this ticket to be about flexible (aka user defined) coercion rules. I believe this is necessary ingredient for https://github.com/apache/datafusion/issues/12644, but can also be tracked separately. My point is that for things like Extension Types (https://github.com/apache/datafusion/issues/12644) we need a lower layer which is stripped of those extension types. I don't think it's just physical layer, it would be missed optimizations opportunities cost. So, with this issue my goal is not to take away any functionality, or forbid adding any functionality. Quite contrary, to allow more functionality to be added, with less breaking changes for downstream consumers, we need some form of "contracts." for various APIs we provide. Highest level layer should provide all the bells & whistles. Lowest layer is Arrow vectors and kernels. Let's incrementally define the middle layer, as it will make reasoning about the code easier.

I hope there is no disagreement that more code structure, simpler code flow and well defined contracts is a good thing.

findepi commented 1 month ago

https://github.com/apache/datafusion/pull/13028 change is a great supporting example for this initiative. We have a language feature we want to expose in DataFusion (https://github.com/apache/datafusion/issues/9821), and due to how logical plan construction works today, this mostly-syntax-sugar feature affects logical plan structure, but it has no practical impact on logical plan expressibility, so from LP user perspective, it's a breaking change with no added value. If we had clear frontend vs core separation -- as being proposed here -- downstream LP consumers would be not be affected by such changes.