Is there a way to detect when workflow is in a Failed State but is still "Running"

dhiaayachi commented 2 months ago

Is your feature request related to a problem? Please describe. If I do this https://github.com/temporalio/samples-typescript/compare/main...bijeebuss:samples-typescript:main and then start the workflow it will show up as "Running" but it's actually in a sort of failed state. It also does this in other cases like when you forget to export the workflow.

Things I tried

Try/Catch (does not catch the error)
Custom Logger: All I see in the logs is "workflow started" and "workflow completed" and the meta has no way to detect that it actually failed.
setting retry: { maximumAttempts: 1 } when starting the workflow does not cause the workflow to enter an actual "Failed" state

Describe the solution you'd like A way to detect when a workflow enters a state like this. Maybe it already exists but I can't find anything

dhiaayachi commented 2 months ago

Detecting Workflow Execution Failures

This issue describes a scenario where a Workflow Execution appears as "Running" in the Temporal Web UI but is actually in a failed state. This can occur due to various reasons like non-deterministic code changes, missing exports, or errors that aren't caught in your code.

Here's a breakdown of the problem and potential solutions:

Understanding Workflow Execution States

Workflow Executions in Temporal can be in one of two states: Open or Closed.

Open: This state indicates that the Workflow Execution is able to make progress and is either actively running or waiting for something.
Closed: This state signifies that the Workflow Execution cannot make further progress due to one of the following reasons:
- Cancelled: The Workflow Execution successfully handled a cancellation request.
- Completed: The Workflow Execution has finished successfully.
- Continued-As-New: The Workflow Execution Continued-As-New.
- Failed: The Workflow Execution returned an error and failed.
- Terminated: The Workflow Execution was terminated.
- Timed Out: The Workflow Execution reached a timeout limit.

Reasons for Workflow Execution Failures

The scenarios you described highlight common causes for Workflow Execution failures that can appear in the "Running" state:

Non-deterministic Code Changes: If you introduce changes to your Workflow code that break the determinism rules, the Workflow Execution may enter a failed state, but the Temporal Web UI might still show it as "Running".
Missing Exports: If you forget to export a Workflow function or Activity function, the Temporal Server won't be able to locate it and the Workflow Execution will fail, appearing as "Running" in the UI.
Uncaught Errors: If your Workflow code encounters errors that aren't caught by your code or handled correctly, it can result in the Workflow entering a failed state, but still appearing as "Running" in the Web UI.

Detecting Workflow Failures

To detect these failures, you can leverage the following methods:

Review the Workflow Execution History: The Event History of a Workflow Execution is a valuable resource for debugging and identifying issues. You can use the Temporal CLI (temporal workflow show) or the Temporal Web UI to review the Event History and locate errors or unexpected events.
Enable Worker Logging: Configure logging in your Worker. The Temporal TypeScript SDK offers a Workflow Context logger, which is essential for debugging. Monitor Worker logs for errors or unexpected events that indicate a Workflow Execution failure.
Use Test Servers and Replay Tests: The Temporal SDK includes a test server that supports time-skipping and replay testing. You can use these tools to simulate failures and ensure that your Workflow code is deterministic and handles errors correctly.
Inspect Workflow Task Failures: In the Temporal Web UI, look for Workflow Task Failures. These failures are usually due to non-deterministic changes, missing exports, or uncaught errors. You can often identify the specific problem by examining the details of the failure.

Solution for the Specific Issue

Since the workflow "completes" without entering a failed state, it seems like the code is encountering an uncaught error, which is being masked by the execution model, potentially in the form of an exception that is not considered a fatal error.

Here's how you can debug further:

Check the Worker Logs: Review the Worker logs for any error messages related to the Workflow or Activities involved.
Analyze the Workflow Execution History: Inspect the Event History for unexpected events, such as WorkflowTaskFailed with a specific cause.
Increase Logging Verbosity: Configure the Temporal TypeScript SDK logger to a higher level, such as DEBUG or TRACE, to see more detailed information about the Workflow Execution.
Identify the Error Source: The Workflow is probably throwing an error but not explicitly handling it.
Implement Error Handling: Add appropriate try...catch blocks in your Workflow code to capture and handle any exceptions, logging them or responding to them appropriately, ensuring that the Workflow transitions to a Failed state in the Temporal UI.

Additional Tips

Use a Dedicated Task Queue: When developing and testing new versions of your Workflow, use a dedicated Task Queue to avoid interfering with other workflows in production.
Monitor Workflow Executions: Utilize the Temporal Web UI or CLI to monitor the status of your Workflows, enabling you to catch errors quickly.
Review Temporal Documentation: The Temporal documentation is a rich resource. Consult it for detailed information on Workflows, Activities, error handling, testing, and other essential topics.

I hope this explanation helps you identify and address the issue. Please feel free to provide additional information or code snippets if you need further assistance.

dhiaayachi commented 2 months ago

Thanks for reporting this issue.

Temporal Workflows can't detect failures that cause them to be suspended, such as missing exports, but you can use workflow.GetInfo to verify the state of the workflow and identify potential issues:

async function myWorkflow(name: string): Promise<void> {
  try {
    // Your workflow logic here
  } catch (err) {
    console.log('workflow failed: ', err);
    const info = workflow.GetInfo(); 
    if (info.executionStatus === ExecutionStatus.RUNNING) {
      console.log('Workflow is running, but failed unexpectedly.');
      console.log('Workflow History: ', info.history); 
    }
  }
}

Temporal CLI temporal workflow show command could also help.

dhiaayachi / temporal