dagster-io / dagster

An orchestration platform for the development, production, and observation of data assets.
https://dagster.io
Apache License 2.0
11.24k stars 1.42k forks source link

Enhanced Failure Management: Mid-Run Asset Restart or Forking with State Preservation #15922

Open jadenhpark opened 1 year ago

jadenhpark commented 1 year ago

What's the use case?

We're currently managing complex pipelines involving multiple threads of assets running in parallel, specifically threads A and B as an example. We've encountered challenges when an asset in thread A fails to materialize early in its process. With the current Dagster UI, our options are limited:

  1. Wait for thread B to complete: This isn't ideal as it defeats the purpose of having a parallelized DAG.
  2. Launch a separate run for thread A: This approach creates two distinct runs which complicates the management, especially when these threads merge downstream. The manual coordination required goes against the intended benefits of using a DAG.

Ideas of implementation

  1. Asset Restart During Run: Introduce a feature allowing users to restart a failed asset in the middle of a run. This ensures that if an asset in a parallel thread fails, it can be immediately addressed without waiting for other threads or initiating a new run.
  2. Fork & Preserve State: Enable a mechanism to fork a new run from an existing run while preserving the ongoing state of other threads. This would allow us to manage failures in one thread without disturbing or restarting the entirety of other threads.

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

smackesey commented 1 year ago

cc @alangenfeld

sryza commented 11 months ago

This ensures that if an asset in a parallel thread fails, it can be immediately addressed without waiting for other threads or initiating a new run.

Hey @jadenhpark - what do you see as the downside of initiating a new run?