dbt-labs / dbt-core

dbt enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications.
https://getdbt.com
Apache License 2.0
9.92k stars 1.63k forks source link

Support Dask as an Adapter #1860

Closed talebzeghmi closed 4 years ago

talebzeghmi commented 5 years ago

Describe the feature

Support Dask just as Spark is supported.

Who will this benefit?

This will benefit realtime / web-request use cases where milliseconds matter. The same isomorphic Machine Learning model Transform using SQL can then run in bulk (against the datalake) and in the web-request sub ms latency contexts.

drewbanin commented 5 years ago

Hey @talebzeghmi - cool idea! I've heard many good things about Dask, but I haven't used it before personally. This would be a great opportunity to get some experience with it for sure.

The Spark plugin for dbt is really a SparkSQL plugin -- dbt handles templating SQL and executing it against a remote Spark cluster. All of dbt's existing adapter plugins are databases (eg. BigQuery, Snowflake, Redshift, Presto, etc). Does Dask provide any sort of SQL-interface into working with dataframes? Based on my (admittedly very incomplete) understanding, I don't quite see where Dask would fit into the dbt picture.

I'm super happy to discuss / brainstorm around this issue even if it's not something we'd be able to pick up imminently -- keen to hear what you have in mind & your opinions on the broader Python data space here if you'd care to share them!

Thanks for taking the time :)

talebzeghmi commented 5 years ago

Does dbt do simple SQL translation or does it have an executor engine plugin model to execute the query plan produced? An example would be how Hive supports an HQL to be executed by MapReduce, Tez or Spark.

drewbanin commented 5 years ago

dbt builds SQL statements and them to a database to be executed. dbt itself does not interpret user-provided SQL!

drewbanin commented 4 years ago

closing this one. Let me know if you have any further thoughts here @talebzeghmi - happy to re-open and discuss :)

nils-braun commented 3 years ago

Hi @drewbanin and @talebzeghmi! Congratulations on this well-done and well-documented package! I am the main developer of dask-sql, a relatively new extension of Dask that adds SQL capabilities. dask-sql is still a very young project and can definitely not be compared with the very mature SparkSQL or alike, but we are happy to extend the collaboration as much as possible. @rajagurunath brought up a possible adapter in dbt in the referenced issue.

I had only a quick look so far so excuse my naive question: If dbt issues a create table SQL command (in one way or the other), does it rely on the data being actually stored, e.g. on disk? All meta information on the tables in dask-sql lives in memory so far - a possible restart of the dask-sql server will remove all tables again. Is this a possible showstopper for implementing an adapter?

The dask-sql server implementation speaks the presto wire protocol, so apart from some possible SQL incompatibilities (we do not cover the full SQL standard so far, especially not some table-description-commands), I do not see a reason why such an adapter can not be implemented.

jtcohen6 commented 3 years ago

@nils-braun That sounds really neat!

That sounds like a serious complication, though perhaps not a showstopper, so long as the meta information about tables persists between sessions/connections. How often do you expect a dask-sql server to be restarted? How would end users, seeking to benefit from the transformed datasets produced by a dbt project, expect to access or query those tables?

nils-braun commented 3 years ago

Hi @jtcohen6

Thanks for your answer! The idea is that the dask-sql server is restarted as often as things like e.g. a Hive or presto server - and the Dask cluster itself. So in general not very often. One naive question: What happens if the created tables are lost again? I assume dbt will probably recreate them with the cost of additional processing time?

jtcohen6 commented 3 years ago

That makes sense. (Ultimately, I imagine it would be valuable to back up those tables in some kind of metastore, along the lines of Hive or Presto). In the meantime, you've got the right idea: dbt takes the opinionated view that all transformed data objects (models) should be idempotent, such that it can rebuild them from raw data sources, at any time and for any reason, with no data lost—just the cost of the computation.

nils-braun commented 3 years ago

That is nice to hear @jtcohen6 - thanks for the clarification. I think in this case it makes sense to continue our work (or currently, @rajagurunath's work) with the integration of dbt and dask-sql. Maybe it is to some use for some users! Thanks for your help so far - we might come back with more questions :-)

srggrs commented 1 year ago

Hi guys, since dbt adopted pure Python models, do you think there is room for a pure python implementation for Dask and dbt?