Closed awoehrl closed 9 months ago
Thanks @awoehrl! I agree with the need for this functionality - not just for the benefits in Bigquery, but for general data development across arbitrary warehouses. For example, if a developer's work requires adding two new activities and building a dataset derived from those two activities and a third existing activity, the ideal dev workflow should leverage dbt's defer functionality to build the new activities and reference the production version of the third activity when building the dataset. The current implementation requires the full stream to be built before building the dataset for that workflow.
Supporting this functionality would also enable environment-specific builds - where, using the same example workflow from above, datasets could query the full stream in production, but query individual activity tables in development.
This implementation will require the following updates/additions:
One open question I have is around how to indicate if the model is the full stream or an individual activity, as that has an impact on which cluster/partition keys to use. I'm currently leaning towards adding a check to see if the model name is registered as a stream in the project - if yes, assume the model is a stream; if no, assume the model is an individual activity. The code implementation will be verbose, but it preserves the current API. The package assumes that the model name of the stream exists in the project variables, so this assumption seems reasonable. I'm just not sure if the cluster keys macro would be used anywhere other than in an activity or stream model, which would break this assumption.
I think it makes a lot of sense to use the model names from registered streams for the check!
Yeah I think the logic would look something like the following:
if model.name in var("dbt_aql").get("streams").keys():
return activity stream cluster/partition columns
else:
return single table cluster/partition columns
I'll take a stab at this in the next few days!
Scoping out the specifics for implementing this support:
[x] if model name matches that of a stream registered in the dbt_project.yml
, return appropriate keys for partitioning a stream
[x] if model name does not match any stream registered in the dbt_project.yml
, return appropriate keys for partitioning a single activity
[x] macro interface shouldn't change from current state, and ideally shouldn't change when used in activity vs model
customer
column alias. The stream is defined in the config for each activity model, but the macro gets rendered before the config is defined in the environment, so the config can't be used. One alternative is to explicitly pass the stream as an input parameter to the macro. However, this solution would violate the requirement that the macro interface shouldn't change when used in the activity vs the stream model.
skips_stream
is defined at the stream level - i.e. each stream in a project can have a different skip strategyskips_stream
config parameterdataset
macrodataset
macroHey @awoehrl - I'm merging an implementation of this feature in a moment (see #43), then I'm going to bundle it up with your feature and release version 0.2.0 of the package. I'll let you know once that's done - feel free to pull down the branch from that PR to give it a try if you're interested!
At least for BigQuery it's better to directly join on the activities rather than joining on an activity stream filtered on the activity. This should lower the query cost and influence performance.