Azure / azure-sdk-for-js

This repository is for active development of the Azure SDK for JavaScript (NodeJS & Browser). For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/javascript/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-js.
MIT License
2.06k stars 1.19k forks source link

[azure-service-bus] High occurrences of AMQP link churn when upgrading from 7.4.0 to 7.9.3 #28023

Open hfaivre opened 9 months ago

hfaivre commented 9 months ago

Describe the bug Since our recent upgrade of the Azure service bus SDK from 7.4.0 to 7.9.3 in our staging environment, we're seeing the following error :

ServiceBusError: Failed to complete the message as the AMQP link with which the message was received is no longer alive.

at translateServiceBusError (/home/dog/node_modules/@azure/service-bus/src/serviceBusError.ts:175:12)
at settleMessageOperation (/home/dog/node_modules/@azure/service-bus/src/receivers/receiverCommon.ts:287:13)
at Object.operation (/home/dog/node_modules/@azure/service-bus/src/receivers/receiverCommon.ts:251:14)
at Object.retry (/home/dog/node_modules/@azure/core-amqp/src/retry.ts:213:36)
at retry (/home/dog/node_modules/@azure/service-bus/src/receivers/receiverCommon.ts:248:10)
at tracingClient.withSpan.Object.assign.spanLinks.spanLinks (/home/dog/node_modules/@azure/service-bus/src/receivers/receiverCommon.ts:106:7)
at /home/dog/node_modules/@azure/core-tracing/src/tracingClient.ts:71:25
at Object.withContext (/home/dog/node_modules/@azure/core-tracing/src/instrumenter.ts:50:14)
at withContext (/home/dog/node_modules/@azure/core-tracing/src/tracingClient.ts:91:30)
at Object.withSpan (/home/dog/node_modules/@azure/core-tracing/src/tracingClient.ts:70:28)

With the following error object :

{
    error {
        code: SessionLockLost
        name: ServiceBusError
        retryable: false
    }
    level: error
    timestamp: 2023-12-06T12:11:16.721Z
}

To Reproduce Unfortunately, we don't have any deterministic steps to reproduce this issue. We are seeing this issue occur sporadically when trying to complete a message once our worker has successfully processed it from the queue.

I've added relevant logs for a specific message ID where this is happening.

renewing of the lock :

2023-12-06T12:11:17.099Z : azure:service-bus:receiver:verbose [connection-1|batching:<queue-name-redacted>] message with id '8d1d5c59-ff8a-4b2e-8051-d847de7705fb' is locked until Wed Dec 06 2023 12:11:44 GMT+0000 (GMT)."
2023-12-06T12:11:17.099Z : azure:service-bus:receiver:verbose [connection-1|batching:s<queue-name-redacted>] Total autolockrenew duration for message with id '8d1d5c59-ff8a-4b2e-8051-d847de7705fb' is: Wed Dec 06 2023 12:16:16 GMT+0000 (GMT)"
2023-12-06T12:11:17.099Z : azure:service-bus:receiver:verbose Locked until utc : 1701864704959"
2023-12-06T12:11:17.099Z : azure:service-bus:receiver:verbose Current time is : 1701864676167"
2023-12-06T12:11:17.099Z : azure:service-bus:receiver:verbose Remaining time is : 28792"
2023-12-06T12:11:17.099Z : azure:service-bus:receiver:verbose Renew after : 18792"
2023-12-06T12:11:17.099Z : azure:service-bus:receiver:verbose [connection-1|batching:<queue-name-redacted>] Sleeping for 18792 milliseconds while renewing the lock for message with id '8d1d5c59-ff8a-4b2e-8051-d847de7705fb'"

Settlement attempt :

2023-12-06T12:11:17.102Z : azure:service-bus:receiver:verbose [connection-1] Completing the message with id '8d1d5c59-ff8a-4b2e-8051-d847de7705fb'."
2023-12-06T12:11:17.102Z : azure:core-amqp:verbose [connection-1] Attempt number for 'settlement': 1."
2023-12-06T12:11:17.102Z : azure:service-bus:receiver:verbose [connection-1|batching:<queue-name-redacted>] is open? false"
2023-12-06T12:11:17.102Z : azure:service-bus:receiver:warning [connection-1] An error occurred when settling a message with id '8d1d5c59-ff8a-4b2e-8051-d847de7705fb' : ServiceBusError: Failed to complete the message as the AMQP link with which the message was received is no longer alive."

Expected behavior No AMQP churn errors.

Additional context We did not have any occurrence of this AMQP link churn before upgrading from 7.4.0 to 7.9.3.

From the stacktrace of the error, it seems like the error is being thrown at core-tracing tracing client level, so I've tried downgrading the SDK version from 7.9.3 to 7.6.0, version after which the core-tracing dependency was upgraded. We're not seeing any errors anymore with 7.6.0.

jeremymeng commented 9 months ago

@hfaivre Thanks a lot for reporting this issue! The tracing code is unlikely related. They appear in the stack because they wrap around other operations.

The error suggests that the receiver is not open https://github.com/Azure/azure-sdk-for-js/blob/17de1a2b7f3ad61f34ff62876eced7d077c10d4b/sdk/servicebus/service-bus/src/receivers/receiverCommon.ts#L287

Since v7.6.0, we added changes to close underlying AMQP links to prevent code being blocked indefinitely, for example, PR #25369 and PR #25355. Do you see any logs about timing out before the SessionLockLost error?

It's probably too much to ask but if possible, could you please try other versions after 7.6.0 in your staging environment? It would greatly help us pin point the changes that caused this.

Thanks again!

hfaivre commented 9 months ago

Hey @jeremymeng, the version being used in our staging environment when seeing these errors is actually the latest published version, v7.9.3. We would like to use this version in prod, but we are concerned with the errors we're seeing in staging and would like to understand/resolve them before we can proceed with the upgrade in our prod env.

We are NOT seeing these errors in staging when using v7.6.0 .

I can see some timeout logs occurring during a draining operation :

2023-12-06T12:11:16.098Z : 2023-12-06T12:11:15.713Z rhea-promise:receiver [connection-1] receiver '<queue-name-redacted> got event: 'settled'. Re-emitting the translated context."
2023-12-06T12:11:16.098Z : 2023-12-06T12:11:15.713Z rhea-promise:translate [connection-1] Translating the context for event: 'settled'."
2023-12-06T12:11:16.098Z : azure:service-bus:receiver:verbose [connection-1|batching:<queue-name-redacted>] Delivery with id 1113, remote_settled: true, remote_state: c { value: [ [length]: 0 ] } has been received."
2023-12-06T12:11:16.098Z : azure:service-bus:receiver:verbose [connection-1|batching:<queue-name-redacted>] Found the delivery with id 1113 in the map and cleared the timer."
2023-12-06T12:11:16.098Z : azure:service-bus:receiver:verbose [connection-1|batching:<queue-name-redacted>] Successfully deleted the delivery with id 1113 from the map. true"
2023-12-06T12:11:16.099Z : azure:core-amqp:verbose [connection-1] Success for 'settlement', after attempt number: 1."
2023-12-06T12:11:16.099Z : azure:service-bus:receiver:verbose [connection-1|r:<queue-name-redacted>] Batching, waited for 1000 milliseconds after receiving the first message."
2023-12-06T12:11:16.099Z : azure:service-bus:receiver:verbose [connection-1|r:<queue-name-redacted>] Draining leftover credits(7), waiting for event_drained event, or timing out after 200 milliseconds..."
2023-12-06T12:11:16.099Z : 2023-12-06T12:11:15.964Z rhea:frames [connection-1]:1 -> flow#13 {""next_incoming_id"":1125,""incoming_window"":2045,""outgoing_window"":4294967295,""delivery_count"":6813,""link_credit"":7,""drain"":true}"
2023-12-06T12:11:16.340Z : Error pushing message to local queue: Failed to complete the message as the AMQP link with which the message was received is no longer alive."
2023-12-06T12:11:16.424Z : Error pushing message to local queue: Failed to complete the message as the AMQP link with which the message was received is no longer alive."
2023-12-06T12:11:16.721Z : Error pushing message to local queue: Failed to complete the message as the AMQP link with which the message was received is no longer alive."
2023-12-06T12:11:17.098Z : azure:service-bus:receiver:warning [connection-1|r:<queue-name-redacted>] Time out after 200 milliseconds when draining credits. Closing receiver..."

I hope this helps, let me know if you need anything else!

jeremymeng commented 9 months ago

@hfaivre Thanks for the log! It appears that receiver was indeed closed because of the credit draining timeout. I wonder whether the 200 ms limit that we are using is too aggressive when network latency is large. Is it possible for you to try this higher value to see whether it helps?

diff --git a/node_modules/@azure/service-bus/dist/index.js b/node_modules/@azure/service-bus/dist/index.js
index 9cf8a84..90a57a5 100644
--- a/node_modules/@azure/service-bus/dist/index.js
+++ b/node_modules/@azure/service-bus/dist/index.js
@@ -141,7 +141,7 @@ const packageJsonInfo = {
  * will wait while draining credits before returning.
  * @internal
  */
-const receiveDrainTimeoutInMs = 200;
+const receiveDrainTimeoutInMs = 1000;
 /**
  * @internal
  */
hfaivre commented 8 months ago

Hi @jeremymeng ,

Sorry for the long delay. We've patched the SDK with the increased drain timeout in our staging environment : even though we do see a decrease of the AMQP churn logs, we do still see errors happening (1/4th of the volume with 200ms drain timeout).

jeremymeng commented 8 months ago

@hfaivre The reason we added credit draining timeout is that sometime The Azure Service Bus service takes a long time (up to 7 seconds in our internal testing) to respond to the drain request, or doesn't respond at all which cause the SDK code to stuck indefinitely.

Is there anyway we can figure out why the drain timeout is being reached, even at 1000ms ?

We don't have a complete understanding yet on why that happened. If you used a much bigger value like 10 seconds and timeout still happens, it would be better to open a support ticket with information of when, which namespace, and queue/topic the timeout happens so that service team can investigate further.

What would be the impact of increasing the timeout even more ?

If the service response always comes back quickly then it doesn't matter. If not, one thing is that if the waiting time is longer, message locks may be renewed more.

Would you mind explaining the mechanism of the credit drain ?

The receiveMessages() API takes a max message count, let's say n, and a max wait time. Internally we add n credits to the link, telling service that we are ready to receive up to n messages. However, within the max wait time, service may not have that many messages. Once max wait time reached, we drain the left over credits so it doesn't interfere with the next receiveMessages() call, then we return the messages received so far.

When draining times out, the rhea (our underlying dependency) link would be in a state that SDK cannot recover from, so we have to close the link.

hfaivre commented 8 months ago

Thanks for those details @jeremymeng ,

We're not seeing any more AMQP churn errors in our staging environment when setting the drainTimeout to 2000ms. That being said, given we're not dealing with the same volumes on our production environment, we'd like to have the ability to change that timeout easily. Would it be possible to have the next release of the SDK allow us to set the drainTimeout on the fly rather than having a hardcoded value ?

jeremymeng commented 8 months ago

Would it be possible to have the next release of the SDK allow us to set the drainTimeout on the fly rather than having a hardcoded value ?

It doesn't feel right to me to expose an option to configure a behavior internal to the SDK. I will discuss with my team and investigate more for a better approach. It's unlikely that next release will address this though.