dask-contrib / dask-sql

Distributed SQL Engine in Python using Dask
https://dask-sql.readthedocs.io/
MIT License
376 stars 71 forks source link

Optimization: Go from general logical plans to dask-specific plans? #183

Open nils-braun opened 3 years ago

nils-braun commented 3 years ago

This question is both a feature proposal and a question.

Currently, the process from a SQL (string) to a calculated dataframe is as follows:

  1. Use Calcite for SQL parsing and non-optional relational algebra creation
  2. Optimize with Calcite using rule-based optimizations resulting into a plan of Logical*. Those relational plans are quite generic, do not know anything about dask, distributed processing or the data
  3. Transform each relational algebra one-by-one to dask API calls
  4. Let dask do another round of optimizations on the dask graph and execute it

While this works quite well and is also used e.g. at blazingSQL, there are at least two problems with it:

One possibility to solve these problems and therefore maybe boost the speed would be to bring the distributed operations like "Shuffle" etc to Calcite and let it decide when to shuffle and just copy its decision to dask. The resulting optimized plan would then look similar to e.g. Spark's output. I was hoping that we could maybe "copy" the procedure from other Apache Calcite projects and then just execute their optimized physical plan, but so far was not able to find a good project.

nils-braun commented 3 years ago

I am actively looking for people with Calcite (or any similar project) experience :-)