Closed mvdwielen closed 1 year ago
Thanks for reaching out @mvdwielen !
The dbt-spark adapter has a config property (on per model basis) named location_root
that specifies the directory to store its data.
Here's an article by a Microsoft Solutions Architect that gives a demo of how to use Azure Data Lake Services (ADLS) + Databricks + dbt:
Wanna give it a look and see if the location_root
config yields the benefits you are hoping for?
Thanks @dbeatty10, I am aware of that article but it doesn't cover the challenge I am facing. Location_root will add a location to the create table statement making the table an unmanaged table meaning the physical data of the table won't be deleted from the ADLS account, the table will only be removed from the metastore.
What I would like to accomplish is that I can provide a location to the create database / schema statement which DBT executes on databricks. Providing the location to the database / schema ensures that:
is this something which can be considered as potential feature?
Thanks for that clarification @mvdwielen 👍
To be clear, are you asking for MANAGED LOCATION
to be added to the create schema DDL or just LOCATION
(or both)?
i.e., which of these two options are you proposing?
CREATE SCHEMA IF NOT EXISTS {{ schema_name }}
MANAGED LOCATION {{ location_path }}
CREATE SCHEMA IF NOT EXISTS {{ schema_name }}
LOCATION {{ schema_directory }}
Thanks @dbeatty10 for your understanding, it would be great if both would be supported.
The 'MANAGED LOCATION' is specifically when Unity Catalog is enabled in databricks which is not always the case.
The 'LOCATION' can be used regardless if Unity Catalog is enabled.
For me personally I require the last capability but it would be great if both can be supported.
I'm thinking MANAGED LOCATION
should be in dbt-databricks then (and I can feel you share the feeling @mvdwielen ;))
Here I'm wondering if external managed tables should not be another materialization, rather than an option on tables.
I haven't used Spark or Databricks enough recently to know if you usually see managed tables primarily used as sources (first upstream stage) or throughout a DAG.
@ueshin do you have plans / ideas around this?
cc @andrefurlan-db
@Fleid I am not sure if I understand what you mean but I would expect that a config option would be added to the dbt_projects.yml file where you can define something my like:
model: +schema: schema_name location: path_to_location
The value of the location should then be used in the create_schema macro which will then execute the SQL: CREATE SCHEMA schema_name LOCATION path_to_location
I will use tables, views and incremental materializations throughout the DAG. Where databricks will stores the physical data of these materializations at the location specified in the schema where these database objects are created.
Hope this helps
Hey @mvdwielen, thanks for the additional info.
I need to take the time to think about this. My initial reaction is to push back against anything that decorates schemas, as schemas are not a thing in dbt, yet. But I want to stay open. This stays in my to-do, but I need some time to digest.
This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.
Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.
Is this your first time submitting a feature request?
Describe the feature
The spark__create_schema(relation) macro should provide the option to provide a location in such the location of the schema (and the tables within the schema) can be stored on an external location like an ADLS storage account. This way the schema and tables within the schema become managed in such the physical data of the schema and tables will be deleted as well when the schema and / or tables are dropped.
Describe alternatives you've considered
Overriding the macro spark__create_schema(relation) to implement this logic within my own dbt project
Who will this benefit?
All dbt / databricks / spark users who want to create managed tables stored on an external location instead of the default local storage location of the warehouse.
Are you interested in contributing this feature?
yes
Anything else?
No response