Metric for Complete Workflow/Activity Failure

dhiaayachi commented 3 weeks ago

Is your feature request related to a problem? Please describe.

I'd like to be able to clearly understand how many Workflows suffered a complete failure after exhausting all retries. (See Additional Context section).

Describe the solution you'd like A metric representing the failure of a workflow/activity after any and all retries have been exhausted.

Describe alternatives you've considered N/A?

Additional context Sometimes our Temporal service goes down, and during the outage, various metrics show "failures" (temporal_workflow_failed, temporal_activity_execution_failure, etc. etc.). There are client-side retries, so would be good to know when there's been the "final" failure after all client side retries have been exhausted and the WF / Activity has "actually" failed for real and won't be re-attempted.

If my feature request doesn't make sense, then let me present our larger scenario for context: Quite reasonably, we want to "assess impact" of the service outage by knowing how many activities or workflows "actually, permanently failed" (i.e. all forms of retires are exhausted, while the service was down and they didn't get to run ever again). How can we do this?

dhiaayachi commented 1 week ago

Is your feature request related to a problem? Please describe.

The current metrics (temporal_workflow_failed, temporal_activity_execution_failure, etc.) don't provide a clear understanding of how many Workflows and Activities "actually, permanently failed" after exhausting all retries. This is especially problematic when a Temporal service goes down. During an outage, these metrics might show failures, but don't indicate if they were truly final failures or simply retries due to the outage.

Describe the solution you'd like

A dedicated metric is needed to represent the failure of a workflow/activity after any and all retries have been exhausted. This metric would provide a more accurate assessment of the actual impact of a service outage.

Describe alternatives you've considered

N/A

Additional context

The need for this metric stems from the desire to assess the impact of service outages. Understanding how many activities or workflows "actually, permanently failed" (i.e. all retries exhausted during an outage) is crucial to determining the severity of the outage and its impact on the application.

To achieve this, we can analyze the Event History of each Workflow Execution and count occurrences of the WorkflowExecutionFailed event, where the Failure field indicates the failure occurred after exhausting all retries. This approach would provide a more accurate count of actual failures, as opposed to simple retries.

Here's a breakdown of the potential solutions:

Server-side metrics: Introduce a new server-side metric specifically designed to capture the final failure count after all retries have been exhausted. This metric would need to be implemented within the Temporal service and would be emitted along with other metrics.
Client-side logic: Add logic to the client SDK to track retries for each Workflow Execution and Activity Execution. When a failure occurs after exhausting all retries, a signal can be sent to a designated monitoring service or a dedicated Temporal Workflow to record this final failure count.

While both solutions have their merits, the server-side approach offers a more comprehensive and accurate picture.

To gain a deeper understanding of these concepts, you can refer to the following resources:

Event History: The Event History serves as a detailed record of a Workflow Execution's lifecycle.
Workflow Execution Failed Event: This Event is triggered when a Workflow Execution fails.
Retry Policies: Learn how Retry Policies work and influence Workflow and Activity retries.
Application Failure: Understand the different types of failures in Temporal and how they affect Workflow Executions.

By implementing this feature, we can enhance our ability to measure the true impact of service outages and ensure the stability and resilience of our Temporal applications.

dhiaayachi commented 1 week ago

Thanks for reporting this feature request.

There's currently no direct metric available to represent the failure of a workflow/activity after exhausting all retries.

Here's a possible workaround:

Add a custom Search Attribute to represent the final failure state: You can use upsert_search_attributes to update a Workflow Execution with a custom Search Attribute called final_failure when a Workflow or Activity has exhausted all retries.
Use Visibility to Query based on the Search Attribute: Use Visibility to query Workflow Executions based on final_failure to assess the impact of service outages.

Let us know if you have any other questions or feedback.

dhiaayachi commented 1 week ago

Thanks for the feature request!

Currently, Temporal does not have a metric that captures the number of workflow/activity failures after retries are exhausted.

However, you can use the existing temporal_workflow_failed and temporal_activity_execution_failure metrics to approximate this behavior.

Filter these metrics using labels like ExecutionStatus or attempt to focus on the final attempts, especially those where the attempt is greater than or equal to the maximum allowed by your retry policy.

You can also combine these metrics with other metrics like temporal_workflow_task_failed to understand the context of the failures.

Let us know if this approach works for you. We're open to feedback and suggestions.

dhiaayachi / temporal