Open phillem15 opened 1 month ago
@phillem15 we plan to support external tables starting from snowflake and databricks. the idea for the feature is to add location
hint to resources that if set to external
will set the location of staging
filesystem as external location for that table.
my question is why do you need external_location
separately from staging_location
in the pipeline? you want to load both internal and external tables via the same pipeline run? and you need separate buckets to hold them?
@rudolfix technically in my example, we are using a single bucket, but I have different paths for my staging vs. raw vs. clean data. So in this case, the path for the staging data would be <bucket_url>/landing/
but I want the raw data to go to <bucket_url>/bronze/
. I see your point, we technically could re-use the staging filesystem configuration, but the ability to change the path is what I'm after.
Feature description
It would be great to allow for use of external locations outside of the typical staging layer. Right now, external locations are supported for staging, but upon table creation in Databricks Unity Catalog, all the source tables are "Managed". In a lakehouse architecture we would like the flexibility to use external locations for various use cases. This could be set at the source or resource level.
Are you a dlt user?
Yes, I run dlt in production.
Use case
In a hypothetical medallion architecture, we would have a landing, bronze, silver, and gold layer. We could have the staging layer set up to use external locations, specifically the landing layer. When we move from staging to the final destination, I want the destination table to use the bronze layer. Here is an example flow:
1 - Staging
abfss://<container>.<storageaccount>.dfs.core.windows.net/landing/<source>/<table>
in Unity catalog this would correlate to<catalog>.<source_staging>.<table>
2 - Destination (not possible as of now)
abfss://<container>.<storageaccount>.dfs.core.windows.net/bronze/<source>/<table>
in Unity catalog this would correlate to<catalog>.<source>.<table>
Proposed solution
To start it would be great to see this at the pipeline level. Here is what is currently in the documentation:
But ideally we would also want the ability to specify an external location for the destination data:
Related issues
No response