Open rajagurunath opened 3 years ago
Hi @rajagurunath
I really like this idea. I am also not an expert in dbt
, but I see it popping up in many discussions around data engineering in the last months - so it should definitely be on our agenda. Having a quick look into the dbt
documentation, I have seen the following things:
create schema
(currently there is just a single schema, but that is something we should fix in the long run anyway https://github.com/nils-braun/dask-sql/issues/203) and the other is the create table
: dask-sql does not have a robust persistent metadata store where we actually create the tables. All we do is add a reference in the in-memory dictionary. I have no idea if that is enough for dbt (because it will be gone whenever the dask-sql server is restarted).Maybe, if you have already tested the macros with dbt-presto
can you summarize your findings here? That would be cool!
And last but not least I would be very interested in the opinion of @goodwanghan - because I think fugue-sql and dbt share some common goals. Han, have you worked with dbt already? How does it relate to fugue-sql?
Not completed my exploration on dbt + presto (dask_sql), But here are few findings so far (Apologies if something was wrong) :
DB Connection, TRANSACTION, and Execution:
SQL compatibility:
It seems from above links, adapters.sql must implement following two functions all other macros are optional
get_columns_in_relation (so far no issue with this query not sure where it will work in real scenerios)
describe table '{{ relation.identifier }}'
# relation.identifier get replaced with table name
list_relations_without_caching
select table_catalog as database,
table_name as name,
table_schema as schema,
case when table_type = 'BASE TABLE' then 'table'
when table_type = 'VIEW' then 'view'
else table_type
end as table_type
from {{ relation.schema }}.information_schema
I temporarily created the information_schema table and registered it in the context as a temporary fix.
As you mentioned, implementing different schema creation features is definitely good to have.
As you mentioned above (create table
), there are lot of macros under materializations (like table/view/incremental tables/snapshots, etc and the ml model also needs some form of special materialization I guess. ) but these macros are also optional i guess.
Majorly the points 1 & 2, hindered the execution of dbt compiled SQL commands in dask-sql (even after commenting out the transaction logic, etc.), but in a lot of places, we can surely try to substitute most of the presto functions/commands with macros and writing the suitable ConnectionManager class in connections.py. (but I have not tried this fully yet ) and not able to make this combination work as of now.
I will update more here when I get some more information !.
Please let me know if any information provided above was wrong.
Thanks
Really looking forward to this integration!
re-posting my question from https://github.com/dbt-labs/dbt-core/issues/1860
Hi guys, since dbt adopted pure Python models, do you think there is room for a pure python implementation for Dask and dbt?
dbt (data build tool) enables data analysts and engineers to transform their data using the same practices that software engineers use to build applications. refer : github and product website
so far, dbt is well adopted and working in modern data warehouses and frameworks (SQL engines) like SPARKSQL and presto, through custom adapters, any database engine can implement or extend the adapters to use all features of dbt. refer : Building new adapter guidelines <- I am following this now.
I feel integrating dbt and dask-sql opens up a lot of potentials because ;
dask-sql allows adding custom functions without performance overhead (where other JVM frameworks trying to catch up like dbt-spark here)
integrates seamlessly with other databases (hive), filesystems, etc, so people can use multiple databases or filesystems in the same dbt session (not sure how people are doing this now)
Support's implementation of machine learning models in plain SQL.
Since dask-sql already speaks presto protocol, I tried using dbt-presto adapter but not able to integrate completely, the adapter needs to implement some macros which should be responsible for generating SQL that is compliant with the target database/SQL engine like here. These macros implement functionalities of information schema and other database-related functionalities I believe, which are not working with dask-sql. I will investigate more on those issues. (Very early stage to decide anything from my side, like whether to implement new adapter for dask-sql or make use of presto adapter somehow)
(feels similar to https://github.com/nils-braun/dask-sql/issues/61, but a lot of other things involved here like SQL compatibility etc not only presto protocol (FastAPI server-side) issue, that's why created separately )
I am very much new to dbt, I will try to integrate these two frameworks! requesting your help if I got stuck anywhere 😊
What do you think about this integration? will it be helpful? and what is your thought on the new adapter for dask-sql ?