Increased task schedule_to_start latency during task queue membership change

dhiaayachi commented 2 months ago

During rolling restart, ringpop ownership of a task queue partition may have moved but the ringpop update propagate to history service and frontend service and matching service may have some delay. The long poll may still waiting on the old matching pod, but history already push tasks to new matching pod owner. The task will be sitting in DB because no one is polling the new matching pod. And the old matching pod won’t know there is new task in DB.

Matching service need to subscribe to ringpop events, and proactively unload task queue manager if it detect lost of ownership.

dhiaayachi commented 2 months ago

Thank you for reporting this issue. I understand that during rolling restarts, there can be a delay in ringpop updates propagating to the History service, Frontend service, and Matching service, leading to tasks being stuck in the database.

This is a known issue, and we are working on a solution. In the meantime, you can consider the following workaround:

Increase the matchingService.taskQueuePollInterval: This parameter controls how often the Matching service polls the task queue. Increasing this interval might help to mitigate the issue.
Increase the matchingService.taskQueueMaxPollInterval: This parameter controls the maximum time the Matching service will wait for a task before giving up. Increasing this interval might also help to mitigate the issue.

You can find more information about these parameters in the Temporal Server Configuration documentation.

We are tracking this issue and will provide updates as they become available. Please let us know if you have any other questions.

dhiaayachi commented 2 months ago

Thank you for reporting this issue.

This issue appears to be related to potential delays in ringpop ownership updates during rolling restarts, leading to tasks being stuck in the database.

We have a known issue with task queue manager not reacting to ringpop ownership changes which is being addressed in this issue: [Link to issue]

To mitigate this issue, we recommend considering a workaround where the matching service subscribes to ringpop events and proactively unloads the task queue manager if it detects a loss of ownership. This way, the matching service can ensure it's always processing tasks from the correct partition.

Let me know if you have any more questions.

dhiaayachi commented 2 months ago

Thank you for reporting this issue. This sounds like a complex problem related to the interaction between Ringpop and the Temporal services during a rolling restart.

To better understand the situation, could you please provide the following information:

Temporal version: Please specify the version of Temporal you are using.
Deployment method: How are you deploying Temporal (e.g., Docker Compose, Kubernetes)?
Rolling restart strategy: How are you performing the rolling restart (e.g., rolling update with a deployment controller)?
Detailed logs: Could you share the relevant logs from the history, frontend, matching, and ringpop services during the restart?

With this information, we can analyze the logs and potentially identify the root cause of the issue.

In the meantime, you might consider the following workaround:

Increase the polling interval for the matching service: You can try increasing the polling interval for the matching service to give it more time to detect the ownership change. This might be a temporary workaround while you investigate the issue.
Disable the long poll for the matching service: This is a more aggressive approach, but you can disable the long poll entirely for the matching service. However, be aware this might increase the latency for task processing.

Let me know if you have any other questions or can provide the requested details.

dhiaayachi commented 2 months ago

Thank you for reporting this issue.

This seems to be a known issue related to the delay in ringpop updates during rolling restarts.

The current workaround is to use the --ringpop.gossip.interval=10s flag to configure a shorter gossip interval, which will reduce the delay in updates.

You can find more information about ringpop configuration in the Temporal documentation.

We are actively working on improving the handling of ringpop updates during rolling restarts, and we will provide further updates as they become available.

dhiaayachi commented 2 months ago

Thank you for reporting this issue!

It seems you've identified a potential bottleneck in task processing during rolling restarts where the task queue manager doesn't get updated quickly enough after ringpop ownership changes. This can lead to tasks getting stuck in the database as the old matching pod is no longer monitoring it, and the new pod is not yet aware of the new tasks.

Your proposed solution of having the matching service subscribe to ringpop events and proactively unload the task queue manager if it detects lost ownership seems like a promising approach. This proactive approach would ensure that the correct matching service is handling tasks promptly.

To help us investigate and understand the issue further, could you please provide more details about:

The version of Temporal you are using.
The specific configuration of your Temporal cluster (e.g., number of history, frontend, and matching nodes).
The frequency of your rolling restarts.

This information will help us identify potential root causes and validate proposed solutions.

dhiaayachi / temporal

Increased task schedule_to_start latency during task queue membership change #228