Open clintonmonk opened 1 year ago
Hi @clintonmonk -- your diagnosis looks correct, and it sounds like the more general behavior of "it should be possible to have auto-materialize skip materializing an asset if one or more of its parents are still in progress" would work here. This is something we've considered in the past, and would fit nicely into the new AutoMaterializeRule
system, i.e. something like:
my_policy = AutoMaterializePolicy.with_rules(AutoMaterializeRule.skip_on_parent_in_progress())
This gets a bit trickier specifically with observable source assets, due to some internal aspects of how we track in progress runs, but that general shape seems to be what would be most helpful here.
Thanks! I wasn't aware of the AutoMaterializeRule
system yet, but after taking a look, it does look like a good framework for solving this problem.
Just to confirm I'm correctly hearing what you're suggesting, using my original example with source assets S1 and S2 and a downstream data asset A1 that depends on them:
AutoMaterializeRule
rule (e.g. AutoMaterializeRule.skip_on_parent_in_progress()
) to the auto-materialization policy for asset A1.If that's correct, I think that would work well for us!
Your description of the proposed solution is completely accurate :)
I've removed the "bug" label as this is more of a feature request.
+1
Dagster version
Dagster Cloud 1.4.13
What's the issue?
In our pipeline, we have assets that depend upon multiple source assets. These assets are configured to auto-materialize when any of their source assets are updated. We have an observation job that observes ~130
@observable_source_asset
assets hourly.We have seen instances where extra (duplicate) auto-materialization runs are created when multiple source assets are updated. We believe this is because the source assets are observed independently a few seconds/minutes apart from each other, allowing Dagster to trigger the auto-materialization policy multiple times.
The duplicate runs incur unnecessary compute costs and require a data engineer to get involved (terminate the extra run while ensuring the other run is functioning correctly).
What did you expect to happen?
We expect the downstream asset's auto-materialization policy to not kick in until all observations in the current run are complete. This would prevent multiple auto-materialization runs from being created.
How to reproduce?
Here is an example:
Given:
@observable_source_asset
.dagster.define_asset_job
anddagster.load_assets_from_package_module
) that observes S1 and S2.Steps:
In the situation above, A1 was materialized twice when it really only needed to be materialized once. The second run is the preferred run, as that one will show that the latest data version of A1 is based off of the latest data versions of S1 and S2.
Deployment type
Dagster Cloud
Deployment details
No response
Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.