dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.6k stars 1.46k forks source link

Skip Materialization When Parent Data Versions Have Not Changed #16872

Open clintonmonk opened 1 year ago

clintonmonk commented 1 year ago

What's the use case?

Our data pipeline ingests data from multiple data sources. Each data source has its own DAG of assets. The first assets ("raw") represent the data from the data source. They are materialized on a schedule that is chosen based on how often the data source updates. Downstream assets in the DAG ("cleaned", etc) are auto-materialized when their parent assets are materialized. We define data versions and code versions (docs) for each of the assets.

We have found that, in these DAGs, we only want to auto-materialize when there is a data change (the data version of the parents has changed), not when there is a code change (the code version of the parents or current asset has changed). This is because we share code libraries in the downstream assets (e.g. a shared cleaning library). We manually materialize assets when a code change requires re-materialization. Otherwise, we wait for the next scheduled run.

We currently support this behavior inside the op (specifically, inside the @multi_asset) by using the OpExecutionContext to check if the parents' data versions have changed since the last time the assets were materialized. If they have not changed, the op is skipped (docs). The downside to this approach is that the op still runs, which affects the materialization history of the assets. It would be nice if we could configure the appropriate auto-materialization policy so that the op would not run at all.

Ideas of implementation

This could be supported with a new or updated AutoMaterializeRule (docs).

Ideas:

  1. A new AutoMaterializeRule.skip_on_no_parent_data_update (name TBD). This rule would skip if none of the parents have new data versions since the last time the asset was materialized. Users could add this to their eager() policy.
  2. A new AutoMaterializeRule.materialize_on_parent_data_updated (name TBD). This rule would be nearly identical to the existing AutoMaterializeRule.materialize_on_parent_updated except that it would only materialize if the parents' data versions were updated. Users would replace materialize_on_parent_updated with this rule in their eager() policy.
  3. Update existing AutoMaterializeRule.materialize_on_parent_updated to accept an optional parameter to work with this use case. For example, AutoMaterializeRule.materialize_on_parent_updated(ignore_code_version_updates=True) would ignore code version updates so that materialization would only happen for data version updates. Users would replace materialize_on_parent_updated with this rule in their eager() policy.

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

jamiedemaria commented 1 year ago

cc @clairelin135 this seems relevant to some of the stuff you've been working on recently

thomasXschneider commented 4 months ago

hey! is there any movement on this? i saw a reference to it in the documentation but it doesn't seem to have been implemented yet:

https://docs.dagster.io/_apidocs/assets#auto-materialize-and-freshness-policies

" AutoMaterializeRule.skip_on_all_parents_not_updated() "

It would be a neat way to avoid unnecessary scheduled materializations of assets.

garethbrickman commented 4 months ago

@thomasXschneider skip_on_not_all_parents_updated is implemented. There's a usage example in these docs as well.

thomasXschneider commented 3 months ago

@garethbrickman Thanks, I'm aware of that. However, that's not the searched for behavior.

skip on all parents not updated : skip if none of the parents have been updated since the last materialization of the asset. skip on not all parents updated: skip if any of the parents have not been updated since the last materialization of the asset.