Integration of DBT and dask_sql

rajagurunath commented 3 years ago

dbt (data build tool) enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications. refer : github and product website

so far, dbt is well adopted and working in modern data warehouses and frameworks (SQL engines) like SPARKSQL and presto, through custom adapters, any database engine can implement or extend the adapters to use all features of dbt. refer : Building new adapter guidelines <- I am following this now.

I feel integrating dbt and dask-sql opens up a lot of potentials because ;

dask-sql allows adding custom functions without performance overhead (where other JVM frameworks trying to catch up like dbt-spark here)
integrates seamlessly with other databases (hive), filesystems, etc, so people can use multiple databases or filesystems in the same dbt session (not sure how people are doing this now)
Support's implementation of machine learning models in plain SQL.

Since dask-sql already speaks presto protocol, I tried using dbt-presto adapter but not able to integrate completely, the adapter needs to implement some macros which should be responsible for generating SQL that is compliant with the target database/SQL engine like here. These macros implement functionalities of information schema and other database-related functionalities I believe, which are not working with dask-sql. I will investigate more on those issues. (Very early stage to decide anything from my side, like whether to implement new adapter for dask-sql or make use of presto adapter somehow)

(feels similar to https://github.com/nils-braun/dask-sql/issues/61, but a lot of other things involved here like SQL compatibility etc not only presto protocol (FastAPI server-side) issue, that's why created separately )

I am very much new to dbt, I will try to integrate these two frameworks! requesting your help if I got stuck anywhere 😊

What do you think about this integration? will it be helpful? and what is your thought on the new adapter for dask-sql ?

nils-braun commented 3 years ago

Hi @rajagurunath I really like this idea. I am also not an expert in dbt, but I see it popping up in many discussions around data engineering in the last months - so it should definitely be on our agenda. Having a quick look into the dbt documentation, I have seen the following things:

https://github.com/dbt-labs/dbt/issues/1860 we should connect to the people on this issue, and get their opinions :-)
The spark adapter.sql is maybe also worth having a look at. I think we are not super far away when it comes to implementing the needed features.
Two SQL commands that are "complicated": one is the create schema (currently there is just a single schema, but that is something we should fix in the long run anyway https://github.com/nils-braun/dask-sql/issues/203) and the other is the create table: dask-sql does not have a robust persistent metadata store where we actually create the tables. All we do is add a reference in the in-memory dictionary. I have no idea if that is enough for dbt (because it will be gone whenever the dask-sql server is restarted).
this point is not strictly necessary but related: currently the server does not offer any authentication mechanisms (#57) - that might be something we can fix on the go

Maybe, if you have already tested the macros with dbt-presto can you summarize your findings here? That would be cool!

nils-braun commented 3 years ago

And last but not least I would be very interested in the opinion of @goodwanghan - because I think fugue-sql and dbt share some common goals. Han, have you worked with dbt already? How does it relate to fugue-sql?

rajagurunath commented 3 years ago

Not completed my exploration on dbt + presto (dask_sql), But here are few findings so far (Apologies if something was wrong) :

DB Connection, TRANSACTION, and Execution:

dbt-presto ultimately hands over the query to prestodb - presto-client library where each execution is considered as transaction (every query starts with START TRANSACTION for example ), we may need BEGIN, ROLLBACK and COMMIT keywords implementation as well. (at least I saw BEGIN & START TRANSACTION is needed while running dbt run & dbt test respectively)(dask_sql not supports does commands as of now I guess). we can control this behavior to some extent in connections.py of dbt-adapter (currently I commented/removed those lines)

SQL compatibility:

It seems from above links, adapters.sql must implement following two functions all other macros are optional
1. get_columns_in_relation (so far no issue with this query not sure where it will work in real scenerios) describe table '{{ relation.identifier }}' # relation.identifier get replaced with table name
2. list_relations_without_caching
```
  select table_catalog as database,
  table_name as name,
  table_schema as schema,
  case when table_type = 'BASE TABLE' then 'table'
       when table_type = 'VIEW' then 'view'
       else table_type
  end as table_type
from {{ relation.schema }}.information_schema
```
I temporarily created the information_schema table and registered it in the context as a temporary fix.
As you mentioned, implementing different schema creation features is definitely good to have.
As you mentioned above (create table), there are lot of macros under materializations (like table/view/incremental tables/snapshots, etc and the ml model also needs some form of special materialization I guess. ) but these macros are also optional i guess.

Majorly the points 1 & 2, hindered the execution of dbt compiled SQL commands in dask-sql (even after commenting out the transaction logic, etc.), but in a lot of places, we can surely try to substitute most of the presto functions/commands with macros and writing the suitable ConnectionManager class in connections.py. (but I have not tried this fully yet ) and not able to make this combination work as of now.

I will update more here when I get some more information !.

Please let me know if any information provided above was wrong.

Thanks

marckeelingiv commented 2 years ago

Really looking forward to this integration!

srggrs commented 1 year ago

re-posting my question from https://github.com/dbt-labs/dbt-core/issues/1860

Hi guys, since dbt adopted pure Python models, do you think there is room for a pure python implementation for Dask and dbt?

dask-contrib / dask-sql

Integration of DBT and dask_sql #200