Closed talebzeghmi closed 4 years ago
Hey @talebzeghmi - cool idea! I've heard many good things about Dask, but I haven't used it before personally. This would be a great opportunity to get some experience with it for sure.
The Spark plugin for dbt is really a SparkSQL plugin -- dbt handles templating SQL and executing it against a remote Spark cluster. All of dbt's existing adapter plugins are databases (eg. BigQuery, Snowflake, Redshift, Presto, etc). Does Dask provide any sort of SQL-interface into working with dataframes? Based on my (admittedly very incomplete) understanding, I don't quite see where Dask would fit into the dbt picture.
I'm super happy to discuss / brainstorm around this issue even if it's not something we'd be able to pick up imminently -- keen to hear what you have in mind & your opinions on the broader Python data space here if you'd care to share them!
Thanks for taking the time :)
Does dbt do simple SQL translation or does it have an executor engine plugin model to execute the query plan produced? An example would be how Hive supports an HQL to be executed by MapReduce, Tez or Spark.
dbt builds SQL statements and them to a database to be executed. dbt itself does not interpret user-provided SQL!
closing this one. Let me know if you have any further thoughts here @talebzeghmi - happy to re-open and discuss :)
Hi @drewbanin and @talebzeghmi! Congratulations on this well-done and well-documented package! I am the main developer of dask-sql, a relatively new extension of Dask that adds SQL capabilities. dask-sql is still a very young project and can definitely not be compared with the very mature SparkSQL or alike, but we are happy to extend the collaboration as much as possible. @rajagurunath brought up a possible adapter in dbt in the referenced issue.
I had only a quick look so far so excuse my naive question: If dbt issues a create table
SQL command (in one way or the other), does it rely on the data being actually stored, e.g. on disk? All meta information on the tables in dask-sql lives in memory so far - a possible restart of the dask-sql server will remove all tables again. Is this a possible showstopper for implementing an adapter?
The dask-sql server implementation speaks the presto wire protocol, so apart from some possible SQL incompatibilities (we do not cover the full SQL standard so far, especially not some table-description-commands), I do not see a reason why such an adapter can not be implemented.
@nils-braun That sounds really neat!
That sounds like a serious complication, though perhaps not a showstopper, so long as the meta information about tables persists between sessions/connections. How often do you expect a dask-sql server to be restarted? How would end users, seeking to benefit from the transformed datasets produced by a dbt project, expect to access or query those tables?
Hi @jtcohen6
Thanks for your answer! The idea is that the dask-sql server is restarted as often as things like e.g. a Hive or presto server - and the Dask cluster itself. So in general not very often. One naive question: What happens if the created tables are lost again? I assume dbt will probably recreate them with the cost of additional processing time?
That makes sense. (Ultimately, I imagine it would be valuable to back up those tables in some kind of metastore, along the lines of Hive or Presto). In the meantime, you've got the right idea: dbt takes the opinionated view that all transformed data objects (models) should be idempotent, such that it can rebuild them from raw data sources, at any time and for any reason, with no data lost—just the cost of the computation.
That is nice to hear @jtcohen6 - thanks for the clarification. I think in this case it makes sense to continue our work (or currently, @rajagurunath's work) with the integration of dbt and dask-sql. Maybe it is to some use for some users! Thanks for your help so far - we might come back with more questions :-)
Hi guys, since dbt adopted pure Python models, do you think there is room for a pure python implementation for Dask and dbt?
Describe the feature
Support Dask just as Spark is supported.
Who will this benefit?
This will benefit realtime / web-request use cases where milliseconds matter. The same isomorphic Machine Learning model Transform using SQL can then run in bulk (against the datalake) and in the web-request sub ms latency contexts.