elementary-data / elementary

The dbt-native data observability solution for data & analytics engineers. Monitor your data pipelines in minutes. Available as self-hosted or cloud service with premium features.
https://www.elementary-data.com/
Apache License 2.0
1.86k stars 158 forks source link

Understanding how elementary resolves dbt models with CI jobs #1474

Open wusanny opened 5 months ago

wusanny commented 5 months ago

Describe the bug Uncertain if this is a bug or expected behaviour, but we are quite confused with the current behaviour, thus raising this issue to get further clarification.

We have observed that the first time we ran a CI job after installing elementary dbt package, the elementary models will get built into a temporary PR schema, instead of their own custom elementary schema as specified in dbt_project.yml.

For context, in dbt Cloud, CI jobs will materialize the models in a temporary schema unique to the PR which will then be dropped once the PR is merged/closed (docs for reference here). It is expected that the Elementary models are still written into their own schema defined in the dbt_project.yml file (Elementary should still override this schema and write its models into its own schema NOT the temp PR schema).

After that PR has been merged and a production job is ran, Elementary models from all the subsequent CI jobs will start writing into the expected schema.

To Reproduce Prerequisite:

Steps to reproduce the behavior:

  1. In dbt Cloud
  2. Install elementary dbt package, following the instructions here
  3. In dbt Cloud's IDE > Commit & sync > Create pull-request
  4. This will kick off the CI job created in step 1
  5. After the run is complete > open up the run page > click on the last step 'Invoke dbt build --select state:modified+' > Debug Logs > Download full debug logs
  6. Search for any of the elementary models, eg, dbt_invocations and we can see that it is built into the temporary PR schema - create or replace table DEVELOPMENT.dbt_cloud_pr_537847_8_elementary_new.dbt_invocations - instead of DEVELOPMENT.dbt_sanny_elementary_new.dbt_invocations.
  7. Merge the PR and run the main production job
  8. Download the debug logs for the main job and we can see that elementary models are built into the correct elementary schema
  9. Create a new pull request that will trigger another CI job
  10. Download the debug logs for that new CI job > elementary models are now built into the correct elementary schema

Expected behavior The expected behaviour is for all elementary models to be built into the custom elementary schema, regardless if it was the initial CI run or not. Note that prior to the merge of that first PR, any metadata that was inserted into the temporary PR schema will disappear when the PR is merged and temporary PR schema is dropped (default behaviour for dbt's CI jobs). Client's data is then lost and cannot be recovered.

Environment (please complete the following information):

Additional context Debug logs for references 1. Debug log run 261771270 (1st CI run).log 2. Debug log run 261772372 (after merge, main prod run).log 3. Debug log run 261772795 (2nd CI run).log

This is the observed behaviour from multiple testings:

haritamar commented 2 months ago

Hi @wusanny ! Thanks for opening this detailed issue and apologies for the large delay here.

As of this time we don't have an official support for dbt's CI jobs feature. Generally speaking I see a lot of sense with the Elementary schema persisting, but I see some challenges with doing it correctly (for example - different branches may use different Elementary package versions that might have conflicting table schemas). It would likely need further research.

I'll tag this as an Enhancement but definitely makes sense to reconsider / adapt the behavior there.

Something I'd definitely be open to though is to add a flag that forces the elementary schema to be fixed on CI runs - if you'd like to contribute this that would be great. (and we can later consider if that flag should be the default)