bcodell / dbt-activity-schema

A dbt package with a POC implementation of an interface to query activity streams that adhere to the Activity Schema 2.0 spec.
Apache License 2.0
12 stars 2 forks source link

feature - generate automatic tests for activities #35

Closed awoehrl closed 10 months ago

awoehrl commented 11 months ago

The activity_id should never be null and should be unique. It would be great to have a test ready in the moment a new activity table is created. This way the first dbt build run would validate the activity_id. I'm not quite sure if generating a test this way is possible via macros though.

bcodell commented 11 months ago

Great suggestion! I've been thinking about the same topic, but I'm not aware of a way to define a test once and cascade it to many models. One approach I'm considering is to build dedicated materializations for activities and streams - this solution would provide a host of benefits (automated incremental handling, easier retrieval/identification from the dbt graph object, defining an activity once and building it many times for each stream that should include it, etc) as well as adding unique/not null tests that could fail before the table creation/update statement is committed to the database so that corrupt data doesn't end up in the model. However, it would corrupt the intended use of the materialization concept, and it would require manually supporting the functionality of dbt's table and incremental materializations as new dbt core versions are released, as I don't think materializations can be subclassed. That maintenance requirement is enough to prevent me from moving forward with this solution.

Open to alternative ideas if you have them!

awoehrl commented 10 months ago

A custom materialization would most probably work but could be a pain to maintain. I know Snowplow dbt models moved away from their custom materialization as soon as dbt extended their incremental merge strategy options.

I have tried to find more information on how to integrate these kind of generic tests that would be added to all activity models, but the only ways I see - apart from a custom materialization - are via a yaml file declaration or some kind of post hook. Not sure how a post hook could output the result though. Also it wouldn't be part of the normal dbt workflow.

awoehrl commented 10 months ago

In the end, it would be best practice to have yaml files for the activities even if these are of limited use because of the repeated columns. So I guess adding the necessary tests could be an addition to the package documentation and that may solve it already.

bcodell commented 10 months ago

All good points. I think in the near term, the best bet may be to build a code generation macro called generate_activity_yml that produces model yaml with tests defined. Then users could call it as a run-operation:

dbt run-operation generate_activity_yml --arguments '{activity_name: activity_name}'

From there it's pretty straightforward to pipe the command line output into a file. I'll reserve the conversation around custom materializations for a separate issue as it's a bigger discussion.

Does this proposal seem like a reasonable near-term solution?