Enhanced Failure Management: Mid-Run Asset Restart or Forking with State Preservation

jadenhpark commented 1 year ago

What's the use case?

We're currently managing complex pipelines involving multiple threads of assets running in parallel, specifically threads A and B as an example. We've encountered challenges when an asset in thread A fails to materialize early in its process. With the current Dagster UI, our options are limited:

Wait for thread B to complete: This isn't ideal as it defeats the purpose of having a parallelized DAG.
Launch a separate run for thread A: This approach creates two distinct runs which complicates the management, especially when these threads merge downstream. The manual coordination required goes against the intended benefits of using a DAG.

Ideas of implementation

Asset Restart During Run: Introduce a feature allowing users to restart a failed asset in the middle of a run. This ensures that if an asset in a parallel thread fails, it can be immediately addressed without waiting for other threads or initiating a new run.
Fork & Preserve State: Enable a mechanism to fork a new run from an existing run while preserving the ongoing state of other threads. This would allow us to manage failures in one thread without disturbing or restarting the entirety of other threads.

Additional information

No response

Message from the maintainers

Impacted by this issue? Give it a 👍! We factor engagement into prioritization.

smackesey commented 1 year ago

cc @alangenfeld

sryza commented 11 months ago

This ensures that if an asset in a parallel thread fails, it can be immediately addressed without waiting for other threads or initiating a new run.

Hey @jadenhpark - what do you see as the downside of initiating a new run?

dagster-io / dagster