dlt-hub / dlt

data load tool (dlt) is an open source Python library that makes data loading easy 🛠️
https://dlthub.com/docs
Apache License 2.0
2.69k stars 180 forks source link

Support for Databricks external locations beyond staging #1979

Open phillem15 opened 1 month ago

phillem15 commented 1 month ago

Feature description

It would be great to allow for use of external locations outside of the typical staging layer. Right now, external locations are supported for staging, but upon table creation in Databricks Unity Catalog, all the source tables are "Managed". In a lakehouse architecture we would like the flexibility to use external locations for various use cases. This could be set at the source or resource level.

Are you a dlt user?

Yes, I run dlt in production.

Use case

In a hypothetical medallion architecture, we would have a landing, bronze, silver, and gold layer. We could have the staging layer set up to use external locations, specifically the landing layer. When we move from staging to the final destination, I want the destination table to use the bronze layer. Here is an example flow:

1 - Staging abfss://<container>.<storageaccount>.dfs.core.windows.net/landing/<source>/<table> in Unity catalog this would correlate to <catalog>.<source_staging>.<table>

2 - Destination (not possible as of now) abfss://<container>.<storageaccount>.dfs.core.windows.net/bronze/<source>/<table> in Unity catalog this would correlate to <catalog>.<source>.<table>

Proposed solution

To start it would be great to see this at the pipeline level. Here is what is currently in the documentation:

pipeline = dlt.pipeline(
    pipeline_name='chess_pipeline',
    destination='databricks',
    staging=dlt.destinations.filesystem('abfss://<container>@<storageaccount>.dfs.core.windows.net/<path>'), # add this to activate the staging location
    dataset_name='player_data'
)

But ideally we would also want the ability to specify an external location for the destination data:

pipeline = dlt.pipeline(
    pipeline_name='chess_pipeline',
    destination='databricks',
    staging=dlt.destinations.filesystem('abfss://<container>@<storageaccount>.dfs.core.windows.net/<path>'), # add this to activate the staging location
    external_loc=dlt.destinations.filesystem('abfss://<container>@<storageaccount>.dfs.core.windows.net/<different_path>'),
    dataset_name='player_data'
)

Related issues

No response

rudolfix commented 4 weeks ago

@phillem15 we plan to support external tables starting from snowflake and databricks. the idea for the feature is to add location hint to resources that if set to external will set the location of staging filesystem as external location for that table.

my question is why do you need external_location separately from staging_location in the pipeline? you want to load both internal and external tables via the same pipeline run? and you need separate buckets to hold them?

phillem15 commented 4 weeks ago

@rudolfix technically in my example, we are using a single bucket, but I have different paths for my staging vs. raw vs. clean data. So in this case, the path for the staging data would be <bucket_url>/landing/ but I want the raw data to go to <bucket_url>/bronze/. I see your point, we technically could re-use the staging filesystem configuration, but the ability to change the path is what I'm after.