[ADAP-344] macro create schema should support a location

mvdwielen commented 1 year ago

Is this your first time submitting a feature request?

[X] I have read the expectations for open source contributors
[X] I have searched the existing issues, and I could not find an existing issue for this feature
[X] I am requesting a straightforward extension of existing dbt-spark functionality, rather than a Big Idea better suited to a discussion

Describe the feature

The spark__create_schema(relation) macro should provide the option to provide a location in such the location of the schema (and the tables within the schema) can be stored on an external location like an ADLS storage account. This way the schema and tables within the schema become managed in such the physical data of the schema and tables will be deleted as well when the schema and / or tables are dropped.

Describe alternatives you've considered

Overriding the macro spark__create_schema(relation) to implement this logic within my own dbt project

Who will this benefit?

All dbt / databricks / spark users who want to create managed tables stored on an external location instead of the default local storage location of the warehouse.

Are you interested in contributing this feature?

yes

Anything else?

No response

dbeatty10 commented 1 year ago

Thanks for reaching out @mvdwielen !

The dbt-spark adapter has a config property (on per model basis) named location_root that specifies the directory to store its data.

Here's an article by a Microsoft Solutions Architect that gives a demo of how to use Azure Data Lake Services (ADLS) + Databricks + dbt:

Using DBT for building a Medallion Lakehouse architecture (Azure Databricks + Delta + DBT)

Wanna give it a look and see if the location_root config yields the benefits you are hoping for?

mvdwielen commented 1 year ago

Thanks @dbeatty10, I am aware of that article but it doesn't cover the challenge I am facing. Location_root will add a location to the create table statement making the table an unmanaged table meaning the physical data of the table won't be deleted from the ADLS account, the table will only be removed from the metastore.

What I would like to accomplish is that I can provide a location to the create database / schema statement which DBT executes on databricks. Providing the location to the database / schema ensures that:

the physical data of all tables within the database / schema will be stored at the location configured in the database / schema.
The tables automatically become managed, hence the table and the physical data will be deleted when the table is dropped.

is this something which can be considered as potential feature?

dbeatty10 commented 1 year ago

Thanks for that clarification @mvdwielen 👍

To be clear, are you asking for MANAGED LOCATION to be added to the create schema DDL or just LOCATION (or both)?

i.e., which of these two options are you proposing?

CREATE SCHEMA IF NOT EXISTS {{ schema_name }}
    MANAGED LOCATION {{ location_path }}

CREATE SCHEMA IF NOT EXISTS {{ schema_name }}
    LOCATION {{ schema_directory }}

mvdwielen commented 1 year ago

Thanks @dbeatty10 for your understanding, it would be great if both would be supported.

The 'MANAGED LOCATION' is specifically when Unity Catalog is enabled in databricks which is not always the case.

The 'LOCATION' can be used regardless if Unity Catalog is enabled.

For me personally I require the last capability but it would be great if both can be supported.

Fleid commented 1 year ago

I'm thinking MANAGED LOCATION should be in dbt-databricks then (and I can feel you share the feeling @mvdwielen ;))

Here I'm wondering if ~~external~~ managed tables should not be another materialization, rather than an option on tables. I haven't used Spark or Databricks enough recently to know if you usually see managed tables primarily used as sources (first upstream stage) or throughout a DAG.

@ueshin do you have plans / ideas around this?

ueshin commented 1 year ago

cc @andrefurlan-db

mvdwielen commented 1 year ago

@Fleid I am not sure if I understand what you mean but I would expect that a config option would be added to the dbt_projects.yml file where you can define something my like:

model: +schema: schema_name location: path_to_location

The value of the location should then be used in the create_schema macro which will then execute the SQL: CREATE SCHEMA schema_name LOCATION path_to_location

https://learn.microsoft.com/en-us/azure/databricks/sql/language-manual/sql-ref-syntax-ddl-create-schema

I will use tables, views and incremental materializations throughout the DAG. Where databricks will stores the physical data of these materializations at the location specified in the schema where these database objects are created.

Hope this helps

Fleid commented 1 year ago

Hey @mvdwielen, thanks for the additional info.

I need to take the time to think about this. My initial reaction is to push back against anything that decorates schemas, as schemas are not a thing in dbt, yet. But I want to stay open. This stays in my to-do, but I need some time to digest.

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 1 year ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

dbt-labs / dbt-spark