Open clintonmonk opened 1 year ago
cc @clairelin135 this seems relevant to some of the stuff you've been working on recently
hey! is there any movement on this? i saw a reference to it in the documentation but it doesn't seem to have been implemented yet:
https://docs.dagster.io/_apidocs/assets#auto-materialize-and-freshness-policies
" AutoMaterializeRule.skip_on_all_parents_not_updated() "
It would be a neat way to avoid unnecessary scheduled materializations of assets.
@thomasXschneider skip_on_not_all_parents_updated is implemented. There's a usage example in these docs as well.
@garethbrickman Thanks, I'm aware of that. However, that's not the searched for behavior.
skip on all parents not updated : skip if none of the parents have been updated since the last materialization of the asset. skip on not all parents updated: skip if any of the parents have not been updated since the last materialization of the asset.
What's the use case?
Our data pipeline ingests data from multiple data sources. Each data source has its own DAG of assets. The first assets ("raw") represent the data from the data source. They are materialized on a schedule that is chosen based on how often the data source updates. Downstream assets in the DAG ("cleaned", etc) are auto-materialized when their parent assets are materialized. We define data versions and code versions (docs) for each of the assets.
We have found that, in these DAGs, we only want to auto-materialize when there is a data change (the data version of the parents has changed), not when there is a code change (the code version of the parents or current asset has changed). This is because we share code libraries in the downstream assets (e.g. a shared cleaning library). We manually materialize assets when a code change requires re-materialization. Otherwise, we wait for the next scheduled run.
We currently support this behavior inside the op (specifically, inside the
@multi_asset
) by using theOpExecutionContext
to check if the parents' data versions have changed since the last time the assets were materialized. If they have not changed, the op is skipped (docs). The downside to this approach is that the op still runs, which affects the materialization history of the assets. It would be nice if we could configure the appropriate auto-materialization policy so that the op would not run at all.Ideas of implementation
This could be supported with a new or updated
AutoMaterializeRule
(docs).Ideas:
AutoMaterializeRule.skip_on_no_parent_data_update
(name TBD). This rule would skip if none of the parents have new data versions since the last time the asset was materialized. Users could add this to theireager()
policy.AutoMaterializeRule.materialize_on_parent_data_updated
(name TBD). This rule would be nearly identical to the existingAutoMaterializeRule.materialize_on_parent_updated
except that it would only materialize if the parents' data versions were updated. Users would replacematerialize_on_parent_updated
with this rule in theireager()
policy.AutoMaterializeRule.materialize_on_parent_updated
to accept an optional parameter to work with this use case. For example,AutoMaterializeRule.materialize_on_parent_updated(ignore_code_version_updates=True)
would ignore code version updates so that materialization would only happen for data version updates. Users would replacematerialize_on_parent_updated
with this rule in theireager()
policy.Additional information
No response
Message from the maintainers
Impacted by this issue? Give it a 👍! We factor engagement into prioritization.