dask-contrib / dask-sql

Distributed SQL Engine in Python using Dask
https://dask-sql.readthedocs.io/
MIT License
390 stars 72 forks source link

Integration of DBT and dask_sql #200

Open rajagurunath opened 3 years ago

rajagurunath commented 3 years ago

dbt (data build tool) enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications. refer : github and product website

so far, dbt is well adopted and working in modern data warehouses and frameworks (SQL engines) like SPARKSQL and presto, through custom adapters, any database engine can implement or extend the adapters to use all features of dbt. refer : Building new adapter guidelines <- I am following this now.

I feel integrating dbt and dask-sql opens up a lot of potentials because ;

Since dask-sql already speaks presto protocol, I tried using dbt-presto adapter but not able to integrate completely, the adapter needs to implement some macros which should be responsible for generating SQL that is compliant with the target database/SQL engine like here. These macros implement functionalities of information schema and other database-related functionalities I believe, which are not working with dask-sql. I will investigate more on those issues. (Very early stage to decide anything from my side, like whether to implement new adapter for dask-sql or make use of presto adapter somehow)

(feels similar to https://github.com/nils-braun/dask-sql/issues/61, but a lot of other things involved here like SQL compatibility etc not only presto protocol (FastAPI server-side) issue, that's why created separately )

I am very much new to dbt, I will try to integrate these two frameworks! requesting your help if I got stuck anywhere 😊

What do you think about this integration? will it be helpful? and what is your thought on the new adapter for dask-sql ?

nils-braun commented 3 years ago

Hi @rajagurunath I really like this idea. I am also not an expert in dbt, but I see it popping up in many discussions around data engineering in the last months - so it should definitely be on our agenda. Having a quick look into the dbt documentation, I have seen the following things:

Maybe, if you have already tested the macros with dbt-presto can you summarize your findings here? That would be cool!

nils-braun commented 3 years ago

And last but not least I would be very interested in the opinion of @goodwanghan - because I think fugue-sql and dbt share some common goals. Han, have you worked with dbt already? How does it relate to fugue-sql?

rajagurunath commented 3 years ago

Not completed my exploration on dbt + presto (dask_sql), But here are few findings so far (Apologies if something was wrong) :

DB Connection, TRANSACTION, and Execution:

  1. dbt-presto ultimately hands over the query to prestodb - presto-client library where each execution is considered as transaction (every query starts with START TRANSACTION for example ), we may need BEGIN, ROLLBACK and COMMIT keywords implementation as well. (at least I saw BEGIN & START TRANSACTION is needed while running dbt run & dbt test respectively)(dask_sql not supports does commands as of now I guess). we can control this behavior to some extent in connections.py of dbt-adapter (currently I commented/removed those lines)

SQL compatibility:

  1. It seems from above links, adapters.sql must implement following two functions all other macros are optional

    1. get_columns_in_relation (so far no issue with this query not sure where it will work in real scenerios) describe table '{{ relation.identifier }}' # relation.identifier get replaced with table name

    2. list_relations_without_caching

      select table_catalog as database,
      table_name as name,
      table_schema as schema,
      case when table_type = 'BASE TABLE' then 'table'
           when table_type = 'VIEW' then 'view'
           else table_type
      end as table_type
    from {{ relation.schema }}.information_schema

    I temporarily created the information_schema table and registered it in the context as a temporary fix.

  2. As you mentioned, implementing different schema creation features is definitely good to have.

  3. As you mentioned above (create table), there are lot of macros under materializations (like table/view/incremental tables/snapshots, etc and the ml model also needs some form of special materialization I guess. ) but these macros are also optional i guess.

Majorly the points 1 & 2, hindered the execution of dbt compiled SQL commands in dask-sql (even after commenting out the transaction logic, etc.), but in a lot of places, we can surely try to substitute most of the presto functions/commands with macros and writing the suitable ConnectionManager class in connections.py. (but I have not tried this fully yet ) and not able to make this combination work as of now.

I will update more here when I get some more information !.

Please let me know if any information provided above was wrong.

Thanks

marckeelingiv commented 2 years ago

Really looking forward to this integration!

srggrs commented 1 year ago

re-posting my question from https://github.com/dbt-labs/dbt-core/issues/1860

Hi guys, since dbt adopted pure Python models, do you think there is room for a pure python implementation for Dask and dbt?