Open dhiaayachi opened 2 months ago
This issue addresses the performance impact of losing cached histories when a worker restarts. Large histories result in increased latency when these workflows are replayed on a new worker.
When a worker shuts down, either gracefully or due to a crash, all cached histories are invalidated. This means that subsequent tasks for these workflows need to be replayed, leading to significant latency. This issue is particularly acute for workflows with large histories.
The ideal solution is to enable the recovery of cached workflows on other workers during a shutdown. This involves the following:
The user suggestion of "require preemptive cache allocation" highlights the need for a mechanism to prioritize the caching of workflows with large histories, making them less vulnerable to worker restarts.
This proposed solution addresses a significant pain point for users with workflows containing large histories. By enabling efficient recovery of cached workflows during worker restarts, it can significantly improve the performance and resilience of Temporal workflows.
Thank you for your feature request. We understand the importance of maintaining workflow cache consistency during worker restarts, especially for large histories.
Currently, Temporal doesn't have a built-in feature to preemptively move cached workflows to other workers. However, you can consider these workarounds:
We appreciate your suggestion and will consider it for future development.
Thank you for reporting this issue. This is a known limitation with the current implementation of the Temporal worker, where restarting workers invalidates the cached histories. We are working on improving the caching behavior and will look into options for preemptive caching in the future.
In the meantime, you can try the following workaround:
workflowCacheSize
to minimize the impact of cache invalidation.temporal task-queue describe
command to monitor the backlog of tasks in your task queues. If the backlog grows too large, it indicates that you need to increase the number of workers to keep up with the demand.This is a valuable feature request and we appreciate you bringing it to our attention. We will keep you updated on any progress made in this area.
Thank you for reporting this issue.
This is a feature request. While this specific feature doesn't exist yet, you can use the following to work around the issue:
We appreciate your feedback, and we'll consider adding this feature in a future release.
Thank you for reporting this issue. It's a known issue that invalidating all the cached histories when a worker restarts leads to increased latency.
Currently, there's no mechanism to preemptively cache workflows on other workers before a worker restarts. However, you can try using the Sticky Execution
feature for workflows with large histories. This will help by keeping the workflow on the same worker for the majority of its execution, reducing the frequency of history invalidation.
You can read more about Sticky Execution here: https://docs.temporal.io/workers#sticky-execution
We're working on solutions to improve the performance and reliability of cached workflows, and we appreciate your feedback.
Thank you for the feature request!
It's understandable that having to restart a worker and lose cached histories can cause increased latency. While we don't currently have a feature to directly address worker cache handover, you can explore a few workarounds:
workflowCacheSize
and maxWorkflowThreadCount
limits in your WorkerFactoryOptions (JavaSDK) or worker
package (GoSDK). By increasing these limits, you'll be able to cache more workflows, potentially reducing the impact of a worker restart.We will consider your feature request and keep you updated on any future developments in this area.
Is your feature request related to a problem? Please describe. When histories are large, a worker restart invalidates all the cached histories. This leads to increased latency for all the workflows that were cached on that worker.
Describe the solution you'd like When a worker gracefully shuts down (or even crashes), the workflows that are cached on it are recovered on other workers to get cached before any tasks are generated for them.
Additional context This is how a user requested it: