bcodell / dbt-activity-schema

A dbt package with a POC implementation of an interface to query activity streams that adhere to the Activity Schema 2.0 spec.
Apache License 2.0
12 stars 2 forks source link

feature - join on activities rather than activity stream (BigQuery) #31

Closed awoehrl closed 9 months ago

awoehrl commented 9 months ago

At least for BigQuery it's better to directly join on the activities rather than joining on an activity stream filtered on the activity. This should lower the query cost and influence performance.

bcodell commented 9 months ago

Thanks @awoehrl! I agree with the need for this functionality - not just for the benefits in Bigquery, but for general data development across arbitrary warehouses. For example, if a developer's work requires adding two new activities and building a dataset derived from those two activities and a third existing activity, the ideal dev workflow should leverage dbt's defer functionality to build the new activities and reference the production version of the third activity when building the dataset. The current implementation requires the full stream to be built before building the dataset for that workflow.

Supporting this functionality would also enable environment-specific builds - where, using the same example workflow from above, datasets could query the full stream in production, but query individual activity tables in development.

This implementation will require the following updates/additions:

One open question I have is around how to indicate if the model is the full stream or an individual activity, as that has an impact on which cluster/partition keys to use. I'm currently leaning towards adding a check to see if the model name is registered as a stream in the project - if yes, assume the model is a stream; if no, assume the model is an individual activity. The code implementation will be verbose, but it preserves the current API. The package assumes that the model name of the stream exists in the project variables, so this assumption seems reasonable. I'm just not sure if the cluster keys macro would be used anywhere other than in an activity or stream model, which would break this assumption.

awoehrl commented 9 months ago

I think it makes a lot of sense to use the model names from registered streams for the check!

bcodell commented 9 months ago

Yeah I think the logic would look something like the following:

if model.name in var("dbt_aql").get("streams").keys():
  return activity stream cluster/partition columns
else:
  return single table cluster/partition columns

I'll take a stab at this in the next few days!

bcodell commented 9 months ago

Scoping out the specifics for implementing this support:

General cluster/partition key support (also referenced in #36)

Assumptions:

Requirements

Open Questions

Expanded config definitions (model and project-level)

Assumptions:

Requirements:

Refactored dataset macro to support one table per activity

Assumptions

Requirements

bcodell commented 9 months ago

Hey @awoehrl - I'm merging an implementation of this feature in a moment (see #43), then I'm going to bundle it up with your feature and release version 0.2.0 of the package. I'll let you know once that's done - feel free to pull down the branch from that PR to give it a try if you're interested!