[Core feature] Handle job failure to fail faster and improve UX

ysysys3074 commented 5 months ago

Motivation: Why do you think this is important?

For TF/Pytorch/MPI/Ray jobs, when they fails for some common reasons, e.g. "invalidImageName" or "imagePullBackOff(due to bad image)", the job will queue for long time because we currently only fetch condition status from the Job, but these failure will not pass failure status immediately to the job pod spec so the job can not fail quickly.

Goal: What should the final outcome look like, ideally?

For some common issue such as "invalidImageName" or "imagePullBackOff(due to bad image)", the job can fast fail instead of queue for very long time. We can brainstorm how to resolve this on flyte layer(preferred) or training operator layer.

Describe alternatives you've considered

Also maybe we can surface k8s events better in UI so user will have idea what is going on under the hood, if the pod run into those issues, we can find way to fast fail the job according to the events.

Propose: Link/Inline OR Additional context

No response

Are you sure this issue hasn't been raised already?

[X] Yes

Have you read the Code of Conduct?

[X] Yes

dosubot[bot] commented 5 months ago

Hey @ysysys3074, good to see you back in the realm of workflow orchestration challenges! Hope this message finds you well.

Thank you for bringing up this issue. I agree that improving the handling of job failures due to common issues like "invalidImageName" or "imagePullBackOff" can significantly enhance the user experience.

To address this, you can modify the error handling mechanism in the workflow execution flow, specifically where the workflow execution is initiated and monitored. This involves checking for these specific errors right after the workflow execution request is made and immediately failing the job if such errors are detected.

Here's a conceptual code snippet that demonstrates how you might implement this:

// After initiating workflow execution
execInfo, execErr := workflowExecutor.Execute(ctx, workflowengineInterfaces.ExecutionData{
    // Execution data parameters
})

// Check if the execution encountered specific errors like "invalidImageName" or "imagePullBackOff"
if execErr != nil {
    if strings.Contains(execErr.Error(), "invalidImageName") || strings.Contains(execErr.Error(), "imagePullBackOff") {
        // Log the error
        logger.Errorf(ctx, "Execution failed due to image-related error: %v", execErr)

        // Increment a metric for tracking these failures, if desired
        m.systemMetrics.PropellerFailures.Inc()

        // Immediately fail the execution by creating an execution model with the error
        createExecModelInput.Error = execErr
        executionModel, err := transformers.CreateExecutionModel(createExecModelInput)
        if err != nil {
            logger.Errorf(ctx, "Failed to create execution model for failed execution due to: %v", err)
            return nil, nil, err
        }

        // Optionally, directly update the execution status in the database to a failed state
        // This step depends on how your execution statuses are managed within Flyte's data model

        // Stop further processing
        return ctx, executionModel, nil
    }
}

// Continue with normal execution flow if no specific errors were detected

This approach ensures that if an execution fails due to specific image-related errors, it is quickly marked as failed, thus improving the user experience by preventing extended queuing times for doomed jobs. Note that the actual implementation may vary based on the complete structure of your execution_manager.go file and how executions are managed within your Flyte deployment.

Regarding your alternative solution of enhancing the UI to better surface Kubernetes events, this could indeed provide users with more insight into what's happening under the hood. However, it's important to note that this would likely involve changes to both the backend (to expose the necessary event data) and the frontend (to display this data in a user-friendly manner).

I hope this helps! If you have any further questions or need more clarification, feel free to ask.

Sources

#### About Dosu

This response is meant to be useful and save you time. It is **not meant to be a precise solution**, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot.

teocns commented 5 months ago

Facing exactly this. IMO the issue here is mainly the UI not reporting an appropriate and informative status (i.e PENDING). I opt for the lifecycle behavior to remain consistent and predictable; should specific error-cases be treated any differently than other schedule failures?

eapolinario commented 5 months ago

The underlying issue has to do with how propeller interacts with each plugin. In each case (i.e. TF/Pytorch/MPI/Ray) the pods are managed by the CRDs, however failures in the pods are not being reflected to the CRDs, so propeller is blind to them.

For example, Kuberay (which is what's used in the Ray plugin) already closed a similar gh issue as by design.

flyteorg / flyte