Closed alamb closed 2 years ago
Actually, this is basically the same as https://github.com/apache/arrow-datafusion/issues/1861 so closing in favor of that ticket
Hi @alamb , do we need to support expression JIT for performance like ClickHouse: https://clickhouse.com/blog/clickhouse-just-in-time-compiler-jit
Hi @leoluan2009
In my opinion, I don't think DataFusion needs JIT to get good performance.
In general, I find the paper "Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask" to explain the tradeoffs well
DataFusion is a vectorized engine and we haven't found areas where JIt would be compelling compared to vectorized code. The only area I can really think of would be to implement type specialized comparisons for sorting (to avoid the RowFormat) but we would need to have a pretty compelling benchmark showing improvements to justify I think
Though the paper that you have mentioned admits that JIT-compilation is beneficial for OLTP workloads:
Besides OLAP performance, other factors also play an important role. Compilation-based engines have advantages in < OLTP as they can create fast stored procedures
If DataFusion would have JIT, then it could be useful for building Online ML Feature Store engines.
If DataFusion would have JIT, then it could be useful for building Online ML Feature Store engines.
FWIW there is no reason you couldn't JIT compile arbitrary expressions and run them as UDFs. The API is now basically complete: https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.ScalarUDFImpl.html
It would also be possible to do the same with aggregates / windows / etc
I think that compiling SQL-expressions to UDFs by hand would kinda kill the whole point of the framework, but it seems like most of the framework would be irrelevant for the in-memory transformation of Online Features, so I guess it would be easier to build the same thing from scratch, though using the same ideas, like for example SQL and Arrow format for intermediate data representation.
FIW there is a lot more to SQL evaluation than just the expression evaluation, so that might be a reason to use DataFusion even if you had to implement your own expressions š¤
Summary TLDR: The key focus of this work is to speed up fundamentally row oriented operations like hash table lookup or comparisons (e.g. #2427)
Background
DataFusion, like many Arrow systems, is a classic "vectorized computation engine" which works quite well for many common operations. The following paper, gives a good treatment on the various tradeoffs between vectorized and JIT's compilation of query plans: https://db.in.tum.de/~kersten/vectorization_vs_compilation.pdf?lang=de
As mentioned in the paper, there are some fundamentally "row oriented" operations in a database that are not typically amenable to vectorization. The "classics" are: Hash table updates in Joins and Hash Aggregates, as well as comparing tuples in sort.
Another example can be found in these slides from this presentation
@yjshen added initial support for JIT'ing in https://github.com/apache/arrow-datafusion/pull/1849 and it currently lives in https://github.com/apache/arrow-datafusion/tree/master/datafusion/jit. He also added partial support for aggregates in https://github.com/apache/arrow-datafusion/pull/2375
This ticket aims to be a central location for tracking the status of JIT compiling expressions for anyone who wants to contribute to this effort
Describe the solution you'd like