"mart-erialized" external tables?

dataders commented 3 years ago

Describe the feature

I feel weird even saying this out loud, but after hearing this this package is @jtcohen6's "house in the country", I realized my schnapsidee just might belong after all. So here goes.

What if in the external config dict, there was a materialized key that defaults to false, but if true would stage the external table temporarily only to call the following, then drop the external table?

SELECT * INTO source_table from temp_external_table

In this way, ext_full_refresh would drop the materialized table and reload it, something we'd do on a daily basis before calling dbt run in our production pipeline.

The benefit of extending (read: perverting) the elegance of stage_external_sources is that it ensures that downstream tables can be run and queried quickly while still keeping a link to the external source that can be called again when needed.

However, brain is split because:

we have a real use case for this, I think (more below); but
it feels heretical to cast dbt into the realms of Extract and Load; and
these aren't external tables even anymore, right?!?

Describe alternatives you've considered

We're also considering either:

snapshooting (sp?) our external table sources and build our whole DAG of these source snapshots,
having another macro to call after stage_external_sources that could accomplish the same via something like: rename, select * into.

Additional context

Both Azure Synapse and Azure SQL support the ideal of external tables in the more familiar sense of flat files in blob or s3 (though unfortunately, with different syntax... ). But, Azure SQL also supports cross database queries (a.k.a. "vertically partitioned elastic queries" a.k.a. external data sources of type RDBMS) which are basically the Azure equivalent of the classic SQL Server idea of "linked servers". Enter the idea of using dbt to build data marts.

We're thinking a lot about the idea of marts that are downstream of our core data warehouse for the following reasons:

we can't throw all our consumers at our DW, because the lowest tier of Synapse only offers a maximum of four concurrent queries,
using marts allows us analytics engineers to be agile without stepping on toes of the DW team that is still working to migrate our 10-year-old data platform to the cloud,
traditionally we've always had domain-specific BI teams that own marts for their domain-specific reporting,
our data isn't very large (only a few tables with 10M+ rows),
our data velocity is daily at the fastest, and
some of these marts can probably be ephemeral and need only exists for a few days each month.

Who will this benefit?

This would obvs benefit our team, and also Azure stack folks. But it looks like Redshift also has a concept of cross-database queries?

And perhaps anyone else with a handful of the bullet points above?

Are you interested in contributing this feature?

@chaerinlee1 and I have toying around this the idea with #73 but it's still a work in progress. cc: @nandanhegde15

github-actions[bot] commented 1 year ago

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

github-actions[bot] commented 1 year ago

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

dbt-labs / dbt-external-tables