[BUG] Schedules: fail fast when a reference workflow can't be read

Describe the bug

If an incident causes the offloaded workflow closure to be deleted from blobstore but an active launch plan schedule is still associated with it, the scheduler will endlessly loop attempting to create the workflow but ultimately each attempt to read it at CreateExecution time will fail with

Failed to get workflow with id resource_type:WORKFLOW project:\"flytesnacks\" domain:\"development\" name:\"productionizing.lp_schedules.positive_wf\" version:\"v0.3.231\"  with err Unable to read WorkflowClosure from location

Expected behavior

Fail fast for unrecoverable schedule scenarios

Additional context to reproduce

No response

Screenshots

No response

Are you sure this issue hasn't been raised already?

[X] Yes

Have you read the Code of Conduct?

[X] Yes

🤖

Hi there @katrogan! I'm Dosu, a bot here to help you with Flyte related issues, answer your questions, and guide you if you'd like to contribute. While we're waiting for a human maintainer, feel free to ask me anything you need. I'm here to make your experience smoother!

Thank you for bringing this issue to our attention. It seems like the current mechanism in Flyte for handling scenarios where a reference workflow can't be read is to transition the workflow to a failing or failed state. This is done through the handleRunningWorkflow and handleFailureNode functions in the workflowExecutor struct. If an error occurs during the execution of the start node or the failure node of the workflow, these functions return a failing status with an ExecutionError indicating the error that occurred. The error is then propagated up to the HandleFlyteWorkflow function which transitions the workflow to the failing or failed phase and records the event. You can find more details in the flytepropeller/pkg/controller/workflow/executor.go file.

However, to address the specific issue you're facing where the scheduler endlessly loops attempts to create the workflow when a reference workflow can't be read, we could potentially modify Flyte to fail fast. This could be achieved by using the NewFailFastLaunchPlanExecutor() function from the launchplan package to create a new instance of a launch plan executor that fails fast when it encounters an error. This function is used in the TestWorkflowExecutor_HandleFlyteWorkflow_Failing and BenchmarkWorkflowExecutor functions to create an instance of a launch plan executor. You can find more details in the flytepropeller/pkg/controller/workflow/executor_test.go file.

Regarding the Flyte scheduler, it handles errors in several ways depending on the type of error. For instance, if a schedule is activated but the launchplan is not, the scheduler will continue to hit the admin with new future scheduled times where the problem can get fixed for the launchplan. If a scheduled time execution fails, the goroutine executing for that time will go through 30 repetitions before aborting the run. You can find more details in the flyteadmin/scheduler/doc.go file.

I hope this information helps. If you have any further questions or need more clarification, please let me know.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

flyteorg / flyte