dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.67k stars 1.47k forks source link

support DBT source tests #18641

Open johannkm opened 11 months ago

johannkm commented 11 months ago

Discussed in https://github.com/dagster-io/dagster/discussions/18629

Originally posted by **geoHeil** December 11, 2023 How should asset checks be handled for source objects in DBT? In dagster they only show up as a grey box - and the tests do not seem to be imported like for normal DBT models
johannkm commented 11 months ago

Related to https://github.com/dagster-io/dagster/issues/18424

jakobkolb commented 8 months ago

We are currently looking into using dagster-dbt for data quality monitoring in a data mesh setting. Being able to run tests on dbt source tables as asset checks in dagster would be really nice!

milicevica23 commented 3 months ago

Hi @johannkm, i would like to take this over and follow on this topic? Do you know if somebody already did some work and investigation here?

dpeng817 commented 3 months ago

@milicevica23 johann and I have both done some initial investigation work here.

The approach will require making some changes to the underlying dagster-dbt decorator, but it's certainly tenable. All of the pieces/abstractions are in place. We aren't actively working on this project, but if you want to put up a PR I would certainly review it!

milicevica23 commented 3 months ago

Sure, I would like to do so. Let me write down my thoughts on this one in the next few days and discuss it here regarding the implementation, till then i will onboard myself a bit around the dagster project

milicevica23 commented 3 months ago

Hi @dpeng817,

As per my understanding we should have two main use cases:

Situation 1: Using dbt source as external table in e.g big query or reading csv function in duckdb: important here is that dagster is not responsible for the state of the source asset so there is no a asset responsible to bring/create

Situation 2: Using dbt source for already created and filled table which exists in the dwh solution e.g a copy into statement or spark job and then using this state for further dbt models: important here is that dagster manages this asset and this asset is not a part of the dbt project itself (our use case)

Observation 1: In both scenarios tests makes sense because we want to check what data comes into the dwh and we want to leverage the dbt tests for this job. We also want to have some sort of overview and be able to stop processing before data actually goes into the further processing

Observation 2: Creating source assets from the dbt manifest makes sense just in the first situation because we need an object to which we have to assign the asset keys. In the second situation, we just want to map the asset keys between non-dbt asset and dbt source asset checks (defined as tests on dbt source definition)

What i did till now: As i said earlier we are mainly interested in the second situation. I made very naive working version locally but stumbled on some problems and thinkings along the way.

As i understand we iterate here through all the models which are selected and can be models (models, seeds, snapshots) and respective to those selected we create the asset checks for each one here So we don't pick the source tests because they are not attached to an object which is not an asset in the very end.

So for out situation 2 it was enough to add the same asset check creation code for all parents if the parents are source assets, this code should go somewhere here where we iterate through parents ids and ask is the parent_id a source dbt resource.

The first challenge with this approach is that we create asset checks with asset keys which are not in the multi asset batch creation and the validation check for the multiple asset throws an error here. Is there a way to relax this assumption and can you see it as a possible relaxation?

The second potential problem/bug is that if we have two different asset decorators which are generating the same sources asset checks it will break dagster because there will be duplicate key. If there is one decorator, there is no problem because they are overwritten deduplicated here

Nevertheless, I am very interested in the first situation, too, and would like to give it a try and implement it because I think in the dbt-duckdb scenario, it makes a lot of sense. I also think that these features can be hidden behind some dbt_translator flags and be enabled per use case

I will soon try to update the code and open a draft PR so that we can see this in practice. Happy to hear your findings and feedback!

Not related but what I still has to figure out is:

  1. Can we execute asset checks if they are connected to the source assets?
  2. What is the situation with partitions and asset checks?
  3. How the testing is currently done, and how I can recreate those situations
milicevica23 commented 3 months ago

Hi @dpeng817, @johannkm here I added the minimal example of what would be enough for our use case (situation 2) https://github.com/milicevica23/dagster/tree/feat/add-dbt-source-asset-checks As written above, the expected two problems with validation for asset keys are there

You can comment on the pull request

dpeng817 commented 3 months ago

Hey @milicevica23 ; really awesome to see you digging into this problem!

Some comments regarding the approach as I see it: I'm not sure that situation 1 and 2 are materially different as far as dagster is concerned. DBT sources are always rendered by dagster as "upstreams" of the models which utilize them. I think the main problem is, we don't actually munge those upstream sources into a "tangible" source asset that we can use for creating dbt tests.

So we need to figure out the right way to munge an upstream source into a "tangible" source asset. But we don't want to do this as part of the @dbt_asset multi asset creation, because that's an actual executable asset.

I think the easiest path forward is to use the tooling here: https://docs.dagster.io/integrations/dbt/reference#defining-a-dbt-source-as-a-dagster-asset to force users to define a dbt source as a source asset; then provide some utility function which can create CheckSpec objects directly from the manifest, from a provided list of asset keys. This way, the user can create their own python-based asset check to run the dbt source tests on their provided asset keys.

Let me know if that all makes sense or there's any questions with that approach.

Other answers to your questions:

Regarding execution of asset checks if they are connected

milicevica23 commented 3 months ago

@dpeng817 thank you for your time and answer I see, so we would instead provide some utilities around already created assets and extract the extra information for the source assets if needed. Let me take a look at how this can be implemented and iterate with this idea.

MinchoG commented 2 weeks ago

Hi, is there any update on this to support it with some doc examples?