dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.51k stars 1.45k forks source link

Ability to have Auto-materialize wait until all in-progress parents have completed #16555

Open clintonmonk opened 1 year ago

clintonmonk commented 1 year ago

Dagster version

Dagster Cloud 1.4.13

What's the issue?

In our pipeline, we have assets that depend upon multiple source assets. These assets are configured to auto-materialize when any of their source assets are updated. We have an observation job that observes ~130 @observable_source_asset assets hourly.

We have seen instances where extra (duplicate) auto-materialization runs are created when multiple source assets are updated. We believe this is because the source assets are observed independently a few seconds/minutes apart from each other, allowing Dagster to trigger the auto-materialization policy multiple times.

The duplicate runs incur unnecessary compute costs and require a data engineer to get involved (terminate the extra run while ensuring the other run is functioning correctly).

What did you expect to happen?

We expect the downstream asset's auto-materialization policy to not kick in until all observations in the current run are complete. This would prevent multiple auto-materialization runs from being created.

How to reproduce?

Here is an example:

Given:

Steps:

  1. S1 and S2 are updated at the same time (let's say they are produced by the same data provider).
  2. Observation job runs.
  3. S1 is observed first. It has a new data version, so Dagster triggers an auto-materialization run for A1.
  4. S2 is observed next (1-2 minutes later, in the same observation job). It has a new data version, so Dagster triggers another auto-materialization run for A1.
  5. A1 now has two concurrent auto-materialization runs.

In the situation above, A1 was materialized twice when it really only needed to be materialized once. The second run is the preferred run, as that one will show that the latest data version of A1 is based off of the latest data versions of S1 and S2.

Deployment type

Dagster Cloud

Deployment details

No response

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

OwenKephart commented 1 year ago

Hi @clintonmonk -- your diagnosis looks correct, and it sounds like the more general behavior of "it should be possible to have auto-materialize skip materializing an asset if one or more of its parents are still in progress" would work here. This is something we've considered in the past, and would fit nicely into the new AutoMaterializeRule system, i.e. something like:

my_policy = AutoMaterializePolicy.with_rules(AutoMaterializeRule.skip_on_parent_in_progress())

This gets a bit trickier specifically with observable source assets, due to some internal aspects of how we track in progress runs, but that general shape seems to be what would be most helpful here.

clintonmonk commented 1 year ago

Thanks! I wasn't aware of the AutoMaterializeRule system yet, but after taking a look, it does look like a good framework for solving this problem.

Just to confirm I'm correctly hearing what you're suggesting, using my original example with source assets S1 and S2 and a downstream data asset A1 that depends on them:

  1. We add a new TBD AutoMaterializeRule rule (e.g. AutoMaterializeRule.skip_on_parent_in_progress()) to the auto-materialization policy for asset A1.
  2. When the observation job runs, the yet-to-be-observed source assets S1 and S2 are marked as "in progress" somehow.
  3. S1 is observed first. Its status is moved from "in progress" to "observed".
  4. The auto-materialization policy does nothing, as there is another in-progress parent (S2).
  5. S2 is observed next. Its status is moved from "in progress" to "observed".
  6. Now that none of A1's parents are "in progress", it is auto-materialized.

If that's correct, I think that would work well for us!

OwenKephart commented 1 year ago

Your description of the proposed solution is completely accurate :)

I've removed the "bug" label as this is more of a feature request.

PascalStehling commented 3 months ago

+1