[EPIC] JIT support for `DataFusion`

alamb commented 2 years ago

Summary TLDR: The key focus of this work is to speed up fundamentally row oriented operations like hash table lookup or comparisons (e.g. #2427)

Background

DataFusion, like many Arrow systems, is a classic "vectorized computation engine" which works quite well for many common operations. The following paper, gives a good treatment on the various tradeoffs between vectorized and JIT's compilation of query plans: https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf?lang=de

As mentioned in the paper, there are some fundamentally "row oriented" operations in a database that are not typically amenable to vectorization. The "classics" are: Hash table updates in Joins and Hash Aggregates, as well as comparing tuples in sort.

Another example can be found in these slides from this presentation

@yjshen added initial support for JIT'ing in https://github.com/apache/arrow-datafusion/pull/1849 and it currently lives in https://github.com/apache/arrow-datafusion/tree/master/datafusion/jit. He also added partial support for aggregates in https://github.com/apache/arrow-datafusion/pull/2375

This ticket aims to be a central location for tracking the status of JIT compiling expressions for anyone who wants to contribute to this effort

Describe the solution you'd like

alamb commented 2 years ago

Actually, this is basically the same as https://github.com/apache/arrow-datafusion/issues/1861 so closing in favor of that ticket

leoluan2009 commented 6 months ago

Hi @alamb , do we need to support expression JIT for performance like ClickHouse: https://clickhouse.com/blog/clickhouse-just-in-time-compiler-jit

alamb commented 6 months ago

Hi @leoluan2009

In my opinion, I don't think DataFusion needs JIT to get good performance.

In general, I find the paper "Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask" to explain the tradeoffs well

DataFusion is a vectorized engine and we haven't found areas where JIt would be compelling compared to vectorized code. The only area I can really think of would be to implement type specialized comparisons for sorting (to avoid the RowFormat) but we would need to have a pretty compelling benchmark showing improvements to justify I think

faucct commented 6 months ago

Though the paper that you have mentioned admits that JIT-compilation is beneficial for OLTP workloads:

Besides OLAP performance, other factors also play an important role. Compilation-based engines have advantages in < OLTP as they can create fast stored procedures

If DataFusion would have JIT, then it could be useful for building Online ML Feature Store engines.

alamb commented 6 months ago

If DataFusion would have JIT, then it could be useful for building Online ML Feature Store engines.

FWIW there is no reason you couldn't JIT compile arbitrary expressions and run them as UDFs. The API is now basically complete: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html

It would also be possible to do the same with aggregates / windows / etc

faucct commented 6 months ago

I think that compiling SQL-expressions to UDFs by hand would kinda kill the whole point of the framework, but it seems like most of the framework would be irrelevant for the in-memory transformation of Online Features, so I guess it would be easier to build the same thing from scratch, though using the same ideas, like for example SQL and Arrow format for intermediate data representation.

alamb commented 6 months ago

FIW there is a lot more to SQL evaluation than just the expression evaluation, so that might be a reason to use DataFusion even if you had to implement your own expressions 🤔

apache / datafusion

[EPIC] JIT support for `DataFusion` #2703