Open dhiaayachi opened 2 months ago
Thank you for reporting this issue. I understand that during rolling restarts, there can be a delay in ringpop updates propagating to the History service, Frontend service, and Matching service, leading to tasks being stuck in the database.
This is a known issue, and we are working on a solution. In the meantime, you can consider the following workaround:
matchingService.taskQueuePollInterval
: This parameter controls how often the Matching service polls the task queue. Increasing this interval might help to mitigate the issue.matchingService.taskQueueMaxPollInterval
: This parameter controls the maximum time the Matching service will wait for a task before giving up. Increasing this interval might also help to mitigate the issue.You can find more information about these parameters in the Temporal Server Configuration documentation.
We are tracking this issue and will provide updates as they become available. Please let us know if you have any other questions.
Thank you for reporting this issue.
This issue appears to be related to potential delays in ringpop ownership updates during rolling restarts, leading to tasks being stuck in the database.
We have a known issue with task queue manager not reacting to ringpop ownership changes which is being addressed in this issue: [Link to issue]
To mitigate this issue, we recommend considering a workaround where the matching service subscribes to ringpop events and proactively unloads the task queue manager if it detects a loss of ownership. This way, the matching service can ensure it's always processing tasks from the correct partition.
Let me know if you have any more questions.
Thank you for reporting this issue. This sounds like a complex problem related to the interaction between Ringpop and the Temporal services during a rolling restart.
To better understand the situation, could you please provide the following information:
With this information, we can analyze the logs and potentially identify the root cause of the issue.
In the meantime, you might consider the following workaround:
Let me know if you have any other questions or can provide the requested details.
Thank you for reporting this issue.
This seems to be a known issue related to the delay in ringpop updates during rolling restarts.
The current workaround is to use the --ringpop.gossip.interval=10s
flag to configure a shorter gossip interval, which will reduce the delay in updates.
You can find more information about ringpop configuration in the Temporal documentation.
We are actively working on improving the handling of ringpop updates during rolling restarts, and we will provide further updates as they become available.
Thank you for reporting this issue!
It seems you've identified a potential bottleneck in task processing during rolling restarts where the task queue manager doesn't get updated quickly enough after ringpop ownership changes. This can lead to tasks getting stuck in the database as the old matching pod is no longer monitoring it, and the new pod is not yet aware of the new tasks.
Your proposed solution of having the matching service subscribe to ringpop events and proactively unload the task queue manager if it detects lost ownership seems like a promising approach. This proactive approach would ensure that the correct matching service is handling tasks promptly.
To help us investigate and understand the issue further, could you please provide more details about:
This information will help us identify potential root causes and validate proposed solutions.
During rolling restart, ringpop ownership of a task queue partition may have moved but the ringpop update propagate to history service and frontend service and matching service may have some delay. The long poll may still waiting on the old matching pod, but history already push tasks to new matching pod owner. The task will be sitting in DB because no one is polling the new matching pod. And the old matching pod won’t know there is new task in DB.
Matching service need to subscribe to ringpop events, and proactively unload task queue manager if it detect lost of ownership.