Worker cache handover - Githubissues

dhiaayachi commented 2 months ago

Is your feature request related to a problem? Please describe. When histories are large, a worker restart invalidates all the cached histories. This leads to increased latency for all the workflows that were cached on that worker.

Describe the solution you'd like When a worker gracefully shuts down (or even crashes), the workflows that are cached on it are recovered on other workers to get cached before any tasks are generated for them.

Additional context This is how a user requested it:

Moving workflows that have a high history size from one worker to another when you have to kill the worker (to update for example) can be painful. It would be great if there was a way to mark workflows as "require preemptive cache alocation" or something that ensures that the cache is coordinated when killing the worker (or some similar strategy to make killing such a worker less painful)

dhiaayachi commented 1 month ago

Recovering Cached Workflows During Worker Restart

This issue addresses the performance impact of losing cached histories when a worker restarts. Large histories result in increased latency when these workflows are replayed on a new worker.

The Problem

When a worker shuts down, either gracefully or due to a crash, all cached histories are invalidated. This means that subsequent tasks for these workflows need to be replayed, leading to significant latency. This issue is particularly acute for workflows with large histories.

The Proposed Solution

The ideal solution is to enable the recovery of cached workflows on other workers during a shutdown. This involves the following:

Graceful Shutdown: When a worker is about to shut down, it needs to inform the Temporal service about the workflows it's currently caching.
Workflow Transfer: The Temporal service should then identify other workers and transfer the cached workflows to them. This transfer should happen before the worker shuts down completely.
Pre-emptive Caching: An alternative is to introduce a mechanism for developers to mark workflows as requiring "pre-emptive cache allocation." This would proactively cache these workflows on multiple workers, minimizing the impact of a single worker shutdown.

Additional Context

The user suggestion of "require preemptive cache allocation" highlights the need for a mechanism to prioritize the caching of workflows with large histories, making them less vulnerable to worker restarts.

References

Worker Performance: Discusses worker configuration and optimization, including cache management.
Workflows: Provides an overview of Workflow execution and related concepts, including cache.
Task Queue: Describes the role of Task Queues in workflow execution and worker interaction.
Sticky Execution: Explains the concept of sticky execution, which can be relevant to cache management and workflow performance.

This proposed solution addresses a significant pain point for users with workflows containing large histories. By enabling efficient recovery of cached workflows during worker restarts, it can significantly improve the performance and resilience of Temporal workflows.

dhiaayachi commented 1 month ago

Thank you for your feature request. We understand the importance of maintaining workflow cache consistency during worker restarts, especially for large histories.

Currently, Temporal doesn't have a built-in feature to preemptively move cached workflows to other workers. However, you can consider these workarounds:

Schedule a graceful shutdown: Instead of abruptly killing the worker, try scheduling a graceful shutdown. This gives the worker time to complete its current tasks and transfer cached workflows to other workers.
Workflow Partitioning: Split your workflows into smaller units to minimize the size of individual histories and reduce the impact of cache invalidation.

We appreciate your suggestion and will consider it for future development.

dhiaayachi commented 1 month ago

Thank you for reporting this issue. This is a known limitation with the current implementation of the Temporal worker, where restarting workers invalidates the cached histories. We are working on improving the caching behavior and will look into options for preemptive caching in the future.

In the meantime, you can try the following workaround:

Use a smaller workflowCacheSize to minimize the impact of cache invalidation.
Implement a mechanism to track the workflows that were cached on a specific worker. This will help you quickly identify and restore the state of those workflows after the worker restarts.
You can also use the temporal task-queue describe command to monitor the backlog of tasks in your task queues. If the backlog grows too large, it indicates that you need to increase the number of workers to keep up with the demand.

This is a valuable feature request and we appreciate you bringing it to our attention. We will keep you updated on any progress made in this area.

dhiaayachi commented 1 month ago

Thank you for reporting this issue.

This is a feature request. While this specific feature doesn't exist yet, you can use the following to work around the issue:

Manually migrate workflows from one worker to another by using the Temporal CLI or SDKs.

We appreciate your feedback, and we'll consider adding this feature in a future release.

dhiaayachi commented 1 month ago

Thank you for reporting this issue. It's a known issue that invalidating all the cached histories when a worker restarts leads to increased latency.

Currently, there's no mechanism to preemptively cache workflows on other workers before a worker restarts. However, you can try using the Sticky Execution feature for workflows with large histories. This will help by keeping the workflow on the same worker for the majority of its execution, reducing the frequency of history invalidation.

You can read more about Sticky Execution here: https://docs.temporal.io/workers#sticky-execution

We're working on solutions to improve the performance and reliability of cached workflows, and we appreciate your feedback.

dhiaayachi commented 1 month ago

Thank you for the feature request!

It's understandable that having to restart a worker and lose cached histories can cause increased latency. While we don't currently have a feature to directly address worker cache handover, you can explore a few workarounds:

Increase Workflow Cache Size: You can adjust the workflowCacheSize and maxWorkflowThreadCount limits in your WorkerFactoryOptions (JavaSDK) or worker package (GoSDK). By increasing these limits, you'll be able to cache more workflows, potentially reducing the impact of a worker restart.
Use Sticky Task Queues: Employ sticky execution for workflows that are expected to have long running histories. Sticky task queues ensure that subsequent workflow tasks are sent to the same worker, minimizing the need for cache rehydration.
Utilize Workflow History Export: Export closed workflow histories to an object storage for compliance and analytics purposes. While this doesn't directly handle worker cache handover, it provides a mechanism to preserve workflow history for future retrieval if necessary.

We will consider your feature request and keep you updated on any future developments in this area.

dhiaayachi / temporal

Worker cache handover #199

Recovering Cached Workflows During Worker Restart

The Problem

The Proposed Solution

Additional Context

References