feature - join on activities rather than activity stream (BigQuery)

awoehrl commented 9 months ago

At least for BigQuery it's better to directly join on the activities rather than joining on an activity stream filtered on the activity. This should lower the query cost and influence performance.

bcodell commented 9 months ago

Thanks @awoehrl! I agree with the need for this functionality - not just for the benefits in Bigquery, but for general data development across arbitrary warehouses. For example, if a developer's work requires adding two new activities and building a dataset derived from those two activities and a third existing activity, the ideal dev workflow should leverage dbt's defer functionality to build the new activities and reference the production version of the third activity when building the dataset. The current implementation requires the full stream to be built before building the dataset for that workflow.

Supporting this functionality would also enable environment-specific builds - where, using the same example workflow from above, datasets could query the full stream in production, but query individual activity tables in development.

This implementation will require the following updates/additions:

general cluster/partition key support (specified in another issue)
expanded config definitions (model and project-level)
refactoring the dataset macro to support one table per activity

One open question I have is around how to indicate if the model is the full stream or an individual activity, as that has an impact on which cluster/partition keys to use. I'm currently leaning towards adding a check to see if the model name is registered as a stream in the project - if yes, assume the model is a stream; if no, assume the model is an individual activity. The code implementation will be verbose, but it preserves the current API. The package assumes that the model name of the stream exists in the project variables, so this assumption seems reasonable. I'm just not sure if the cluster keys macro would be used anywhere other than in an activity or stream model, which would break this assumption.

awoehrl commented 9 months ago

I think it makes a lot of sense to use the model names from registered streams for the check!

bcodell commented 9 months ago

Yeah I think the logic would look something like the following:

if model.name in var("dbt_aql").get("streams").keys():
  return activity stream cluster/partition columns
else:
  return single table cluster/partition columns

I'll take a stab at this in the next few days!

bcodell commented 9 months ago

Scoping out the specifics for implementing this support:

General cluster/partition key support (also referenced in #36)

Assumptions:

macro is only used in activity models and stream models

Requirements

[x] if model name matches that of a stream registered in the dbt_project.yml, return appropriate keys for partitioning a stream
[x] if model name does not match any stream registered in the dbt_project.yml, return appropriate keys for partitioning a single activity
[x] macro interface shouldn't change from current state, and ideally shouldn't change when used in activity vs model

Open Questions

How will the macro determine the associated stream when called from an activity model? That input is needed to determine the customer column alias. The stream is defined in the config for each activity model, but the macro gets rendered before the config is defined in the environment, so the config can't be used. One alternative is to explicitly pass the stream as an input parameter to the macro. However, this solution would violate the requirement that the macro interface shouldn't change when used in the activity vs the stream model.
- Solution: interface doesn't change when used in streams. It requires user to explicitly pass the stream as an input when used in activity models for Bigquery only

Expanded config definitions (model and project-level)

Assumptions:

skips_stream is defined at the stream level - i.e. each stream in a project can have a different skip strategy

Requirements:

[x] skips_stream config parameter

Refactored dataset macro to support one table per activity

Assumptions

activity schemas with streams should have lineage that links back individual activities used

Requirements

[x] update join handling in dataset macro
[x] add dependencies on activity models in dataset macro

bcodell commented 9 months ago

Hey @awoehrl - I'm merging an implementation of this feature in a moment (see #43), then I'm going to bundle it up with your feature and release version 0.2.0 of the package. I'll let you know once that's done - feel free to pull down the branch from that PR to give it a try if you're interested!

bcodell / dbt-activity-schema