Open dhiaayachi opened 3 weeks ago
The current metrics (temporal_workflow_failed
, temporal_activity_execution_failure
, etc.) don't provide a clear understanding of how many Workflows and Activities "actually, permanently failed" after exhausting all retries. This is especially problematic when a Temporal service goes down. During an outage, these metrics might show failures, but don't indicate if they were truly final failures or simply retries due to the outage.
A dedicated metric is needed to represent the failure of a workflow/activity after any and all retries have been exhausted. This metric would provide a more accurate assessment of the actual impact of a service outage.
N/A
The need for this metric stems from the desire to assess the impact of service outages. Understanding how many activities or workflows "actually, permanently failed" (i.e. all retries exhausted during an outage) is crucial to determining the severity of the outage and its impact on the application.
To achieve this, we can analyze the Event History of each Workflow Execution and count occurrences of the WorkflowExecutionFailed
event, where the Failure
field indicates the failure occurred after exhausting all retries. This approach would provide a more accurate count of actual failures, as opposed to simple retries.
Here's a breakdown of the potential solutions:
Server-side metrics: Introduce a new server-side metric specifically designed to capture the final failure count after all retries have been exhausted. This metric would need to be implemented within the Temporal service and would be emitted along with other metrics.
Client-side logic: Add logic to the client SDK to track retries for each Workflow Execution and Activity Execution. When a failure occurs after exhausting all retries, a signal can be sent to a designated monitoring service or a dedicated Temporal Workflow to record this final failure count.
While both solutions have their merits, the server-side approach offers a more comprehensive and accurate picture.
To gain a deeper understanding of these concepts, you can refer to the following resources:
By implementing this feature, we can enhance our ability to measure the true impact of service outages and ensure the stability and resilience of our Temporal applications.
Thanks for reporting this feature request.
There's currently no direct metric available to represent the failure of a workflow/activity after exhausting all retries.
Here's a possible workaround:
upsert_search_attributes
to update a Workflow Execution with a custom Search Attribute called final_failure
when a Workflow or Activity has exhausted all retries.final_failure
to assess the impact of service outages.Let us know if you have any other questions or feedback.
Thanks for the feature request!
Currently, Temporal does not have a metric that captures the number of workflow/activity failures after retries are exhausted.
However, you can use the existing temporal_workflow_failed
and temporal_activity_execution_failure
metrics to approximate this behavior.
Filter these metrics using labels like ExecutionStatus
or attempt
to focus on the final attempts, especially those where the attempt
is greater than or equal to the maximum allowed by your retry policy.
You can also combine these metrics with other metrics like temporal_workflow_task_failed
to understand the context of the failures.
Let us know if this approach works for you. We're open to feedback and suggestions.
Is your feature request related to a problem? Please describe.
I'd like to be able to clearly understand how many Workflows suffered a complete failure after exhausting all retries. (See Additional Context section).
Describe the solution you'd like A metric representing the failure of a workflow/activity after any and all retries have been exhausted.
Describe alternatives you've considered N/A?
Additional context Sometimes our Temporal service goes down, and during the outage, various metrics show "failures" (
temporal_workflow_failed
,temporal_activity_execution_failure
, etc. etc.). There are client-side retries, so would be good to know when there's been the "final" failure after all client side retries have been exhausted and the WF / Activity has "actually" failed for real and won't be re-attempted.If my feature request doesn't make sense, then let me present our larger scenario for context: Quite reasonably, we want to "assess impact" of the service outage by knowing how many activities or workflows "actually, permanently failed" (i.e. all forms of retires are exhausted, while the service was down and they didn't get to run ever again). How can we do this?