Azure / azure-sdk-for-js

This repository is for active development of the Azure SDK for JavaScript (NodeJS & Browser). For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/javascript/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-js.
MIT License
2.1k stars 1.21k forks source link

Publishing to event hub is very slow once every 75 minutes #26073

Closed joost-lambregts closed 1 year ago

joost-lambregts commented 1 year ago

Describe the bug We have an Azure Function that consumes from an event-hub using an Event Hub trigger, and pushes those events to another event hub using the @azure/event-hubs library. Once every 70-80 minutes publishing takes way longer than usual. Sometimes it eventually completes, sometimes it eventually times out. The interval at which the issue occurs is very regular. It is always between 70 and 80 minutes between incidents.

We sometimes encounter these error messages, with the first one being more frequent: The request with message_id "f83e606d-e826-a246-a5a4-7f5df85779ce" to "$cbs" endpoint timed out. Please try again later. The task timed out waiting to acquire a lock for negotiateClaim - bb96d88e-651e-b349-ad3d-21ce38a95b7a

To Reproduce Steps to reproduce the behavior:

  1. Create an Azure Function hosted on an App Service Plan with 'always on' enabled. Make sure it is under decent load.
  2. Make sure you cache and reuse the EventHubProducerClient instead of creating a new one each time.
  3. Have the function publish a message to event hub in every function execution

Expected behavior Publishing to event hub is consistently fast

Additional context The Azure function runs on an app service plan with 'always on' enabled. This means that the NodeJS environment keeps running between function executions. It get's triggered about 500 times per minute.

We cache the EventhubProducerClient like this, and never close it:

function getEventhubClient (eventHubConnectionString, logcontext) {
  logDebug(logcontext, 'Eventhub client requested')
  let eventhubClient = eventhubClientPool[eventHubConnectionString]
  if (!eventhubClient) {
    eventhubClient = new EventHubProducerClient(eventHubConnectionString)
    eventhubClientPool[eventHubConnectionString] = eventhubClient
    logDebug(logcontext, `New eventhub Producer created for eventhub: ${eventHubConnectionString}`)
  } else {
    logDebug(logcontext, 'Eventhub producer reused.')
  }
  return eventhubClient
}

We have another Azure Function that uses a very similar approach. This function also sometimes experiences timeouts when publishing to Event Hub, but there it is much more sporadic, The main difference that I can see between the two functions is that the latter is using version 5.8.0 of the @azure/event-hubs package.

We have the exact same function deployed to our development, acceptance and production environment. We are experiencing the same issue on all 3 environments. This makes it less likely that the issue is related to a problem in Azure infrastructure. To be sure, we also created a support request with Azure support. I will update this issue if any relevant information turns up there.

deyaaeldeen commented 1 year ago

Hi @joost-lambregts Thanks for filing this issue! It sounds like it is the same problem tracked in https://github.com/Azure/azure-sdk-for-js/issues/25928. Did you notice anything in the logs around the time when the timeout errors occured?

joost-lambregts commented 1 year ago

Hi @deyaaeldeen, we sometimes see the following error messages, with the first being more frequent: The request with message_id "f83e606d-e826-a246-a5a4-7f5df85779ce" to "$cbs" endpoint timed out. Please try again later. The task timed out waiting to acquire a lock for negotiateClaim - bb96d88e-651e-b349-ad3d-21ce38a95b7a

On our sandbox environment we have some more logging, and we see the following messages: azure:core-amqp:verbose MessagingError: The request with message_id "f2d68b46-9f4d-ad45-a589-95b7c05f8e05" to "$cbs" endpoint timed out. Please try again later.

2023-05-23T11:14:52.844Z rhea:message decoding section: Typed { type: TypeDesc { name: 'Map8', typecode: 193, width: 1, category: 3, create: [Function (anonymous)] { typecode: 193 } }, value: [ Typed { type: [TypeDesc], value: 'status-code' }, Typed { type: [TypeDesc], value: 401 }, Typed { type: [TypeDesc], value: 'status-description' }, Typed { type: [TypeDesc], value: "ExpiredToken: The token is expired. Expiration time: '2023-05-22 22:28:22Z'" }, Typed { type: [TypeDesc], value: 'error-condition' }, Typed { type: [TypeDesc], value: 'com.microsoft:auth-failed' } ], descriptor: Typed { type: TypeDesc { name: 'SmallUlong', typecode: 83, width: 1, category: 1, read: [Function: read], write: [Function: write], create: [Function] }, value: 116 } } of type: Typed { type: TypeDesc { name: 'SmallUlong', typecode: 83, width: 1, category: 1, read: [Function: read], write: [Function: write], create: [Function (anonymous)] { typecode: 83 } }, value: 116 }

azure:core-amqp:verbose MessagingError: ExpiredToken: The token is expired. Expiration time: '2023-05-21 12:08:06Z' azure:event-hubs:verbose [connection-1] Sender evh-der-rt-derived-telemetry: an error occurred while renewing the token: undefined

session_error: [Function (anonymous)],

onError: [Function (anonymous)],

An additional noteworthy thing is that the ExpiredToken messages consistently have an expiration time that is multiple days in the past.

joost-lambregts commented 1 year ago

Here is all te logmessages with severity > 1 from one of the times where the problem occured.

timestamp                   message
2023-05-26 13:49:19.5742475 azure:core-amqp:verbose MessagingError: ExpiredToken: The token is expired. Expiration time: '2023-05-25 11:15:14Z'
2023-05-26 13:49:19.5746997 azure:core-amqp:verbose MessagingError: ExpiredToken: The token is expired. Expiration time: '2023-05-25 11:15:14Z'
2023-05-26 13:49:33.4778284 2023-05-26T13:49:33.465Z rhea:message decoding section: Typed { type: TypeDesc { name: 'Map8', typecode: 193, width: 1, category: 3, create: [Function (anonymous)] { typecode: 193 } }, value: [ Typed { type: [TypeDesc], value: 'status-code' }, Typed { type: [TypeDesc], value: 401 }, Typed { type: [TypeDesc], value: 'status-description' }, Typed { type: [TypeDesc], value: "ExpiredToken: The token is expired. Expiration time: '2023-05-21 23:58:30Z'" }, Typed { type: [TypeDesc], value: 'error-condition' }, Typed { type: [TypeDesc], value: 'com.microsoft:auth-failed' } ], descriptor: Typed { type: TypeDesc { name: 'SmallUlong', typecode: 83, width: 1, category: 1, read: [Function: read], write: [Function: write], create: [Function] }, value: 116 } } of type: Typed { type: TypeDesc { name: 'SmallUlong', typecode: 83, width: 1, category: 1, read: [Function: read], write: [Function: write], create: [Function (anonymous)] { typecode: 83 } }, value: 116 }
2023-05-26 13:49:33.4779736 azure:core-amqp:warning MessagingError: ExpiredToken: The token is expired. Expiration time: '2023-05-21 23:58:30Z'
2023-05-26 13:49:33.4780042 azure:core-amqp:verbose MessagingError: ExpiredToken: The token is expired. Expiration time: '2023-05-21 23:58:30Z'
2023-05-26 13:49:33.4783933 azure:core-amqp:warning [connection-19] An error occurred while negotiating the cbs claim: MessagingError: ExpiredToken: The token is expired. Expiration time: '2023-05-21 23:58:30Z'
2023-05-26 13:49:33.4784225 azure:core-amqp:verbose MessagingError: ExpiredToken: The token is expired. Expiration time: '2023-05-21 23:58:30Z'
2023-05-26 13:49:33.4787772 azure:event-hubs:verbose [connection-1] Sender evh-der-rt-derived-telemetry: an error occurred while renewing the token: undefined
2023-05-26 13:49:41.7846610 azure:core-amqp:warning [connection-19] An error occurred while negotiating the cbs claim: MessagingError: The request with message_id "5c5a50b5-a493-334e-a1cd-537b6dd6157a" to "$cbs" endpoint timed out. Please try again later.
2023-05-26 13:49:41.7847397 azure:core-amqp:verbose MessagingError: The request with message_id "5c5a50b5-a493-334e-a1cd-537b6dd6157a" to "$cbs" endpoint timed out. Please try again later.
2023-05-26 13:49:41.7849215 azure:event-hubs:warning [connection-1] Sender evh-der-rt-derived-telemetry: an error occurred while being created: MessagingError: The request with message_id "5c5a50b5-a493-334e-a1cd-537b6dd6157a" to "$cbs" endpoint timed out. Please try again later.
2023-05-26 13:49:41.7849692 azure:event-hubs:verbose MessagingError: The request with message_id "5c5a50b5-a493-334e-a1cd-537b6dd6157a" to "$cbs" endpoint timed out. Please try again later.
2023-05-26 13:49:41.7851484 azure:core-amqp:verbose [connection-19] Error occurred for 'senderLink' in attempt number 3: MessagingError: The request with message_id "5c5a50b5-a493-334e-a1cd-537b6dd6157a" to "$cbs" endpoint timed out. Please try again later.
2023-05-26 13:49:41.7854641 code: 'OperationTimeoutError'
2023-05-26 13:49:53.5344416 2023-05-26T13:49:53.522Z rhea:message decoding section: Typed { type: TypeDesc { name: 'Map8', typecode: 193, width: 1, category: 3, create: [Function (anonymous)] { typecode: 193 } }, value: [ Typed { type: [TypeDesc], value: 'status-code' }, Typed { type: [TypeDesc], value: 401 }, Typed { type: [TypeDesc], value: 'status-description' }, Typed { type: [TypeDesc], value: "ExpiredToken: The token is expired. Expiration time: '2023-05-22 21:20:50Z'" }, Typed { type: [TypeDesc], value: 'error-condition' }, Typed { type: [TypeDesc], value: 'com.microsoft:auth-failed' } ], descriptor: Typed { type: TypeDesc { name: 'SmallUlong', typecode: 83, width: 1, category: 1, read: [Function: read], write: [Function: write], create: [Function] }, value: 116 } } of type: Typed { type: TypeDesc { name: 'SmallUlong', typecode: 83, width: 1, category: 1, read: [Function: read], write: [Function: write], create: [Function (anonymous)] { typecode: 83 } }, value: 116 }
2023-05-26 13:49:53.5345813 azure:core-amqp:warning MessagingError: ExpiredToken: The token is expired. Expiration time: '2023-05-22 21:20:50Z'
2023-05-26 13:49:53.5346114 azure:core-amqp:verbose MessagingError: ExpiredToken: The token is expired. Expiration time: '2023-05-22 21:20:50Z'
2023-05-26 13:49:53.5411176 azure:core-amqp:warning [connection-19] An error occurred while negotiating the cbs claim: MessagingError: ExpiredToken: The token is expired. Expiration time: '2023-05-22 21:20:50Z'
2023-05-26 13:49:53.5411466 azure:core-amqp:verbose MessagingError: ExpiredToken: The token is expired. Expiration time: '2023-05-22 21:20:50Z'
2023-05-26 13:49:53.5415484 azure:event-hubs:verbose [connection-1] Sender evh-der-rt-derived-telemetry: an error occurred while renewing the token: undefined
2023-05-26 13:50:09.8421223 2023-05-26T13:50:09.834Z rhea:frames [connection-19]:0 <- detach#16 {"closed":true,"error":{"condition":"amqp:connection:forced","description":"The connection was inactive for more than the allowed 300000 milliseconds and is closed by container 'LinkTracker'. TrackingId:0c1502492d0f45fda3f1135e2cf16bc4_G19, SystemTracker:gateway5, Timestamp:2023-05-26T13:50:09"}}
2023-05-26 13:50:09.8421559 2023-05-26T13:50:09.834Z rhea:events [connection-19] Link got event: sender_error
2023-05-26 13:50:09.8421869 2023-05-26T13:50:09.834Z rhea-promise:sender [connection-19] sender 'a5dff1a1-f541-b14b-8749-806ee6d27323' got event: 'sender_error'. Re-emitting the translated context.
2023-05-26 13:50:09.8422188 2023-05-26T13:50:09.834Z rhea-promise:translate [connection-19] Translating the context for event: 'sender_error'.
2023-05-26 13:50:09.8422487 azure:core-amqp:verbose [connection-19] An error occurred on the cbs sender link.. MessagingError: The connection was inactive for more than the allowed 300000 milliseconds and is closed by container 'LinkTracker'. TrackingId:0c1502492d0f45fda3f1135e2cf16bc4_G19, SystemTracker:gateway5, Timestamp:2023-05-26T13:50:09
2023-05-26 13:50:09.8428084 code: 'ConnectionForcedError'
2023-05-26 13:50:09.8532167 2023-05-26T13:50:09.836Z rhea:frames [connection-19]:0 <- detach#16 {"handle":1,"closed":true,"error":{"condition":"amqp:connection:forced","description":"The connection was inactive for more than the allowed 300000 milliseconds and is closed by container 'LinkTracker'. TrackingId:0c1502492d0f45fda3f1135e2cf16bc4_G19, SystemTracker:gateway5, Timestamp:2023-05-26T13:50:09"}}
2023-05-26 13:50:09.8535603 2023-05-26T13:50:09.836Z rhea:events [connection-19] Link got event: receiver_error
2023-05-26 13:50:09.8536177 2023-05-26T13:50:09.836Z rhea-promise:receiver [connection-19] receiver 'cbs-b90d3815-0a35-9b46-aa41-416fdccd1c07' got event: 'receiver_error'. Re-emitting the translated context.
2023-05-26 13:50:09.8536512 2023-05-26T13:50:09.837Z rhea-promise:translate [connection-19] Translating the context for event: 'receiver_error'.
2023-05-26 13:50:09.8536811 azure:core-amqp:verbose [connection-19] An error occurred on the cbs receiver link.. MessagingError: The connection was inactive for more than the allowed 300000 milliseconds and is closed by container 'LinkTracker'. TrackingId:0c1502492d0f45fda3f1135e2cf16bc4_G19, SystemTracker:gateway5, Timestamp:2023-05-26T13:50:09
2023-05-26 13:50:09.8624239 code: 'ConnectionForcedError'
2023-05-26 13:50:09.8789197 2023-05-26T13:50:09.840Z rhea:frames [connection-19]:0 <- close#18 {"error":{"condition":"amqp:connection:forced","description":"The connection was inactive for more than the allowed 300000 milliseconds and is closed by container 'LinkTracker'. TrackingId:0c1502492d0f45fda3f1135e2cf16bc4_G19, SystemTracker:gateway5, Timestamp:2023-05-26T13:50:09"}}
2023-05-26 13:50:09.8809215 2023-05-26T13:50:09.843Z rhea:events [connection-19] Connection got event: connection_error
2023-05-26 13:50:09.8809740 2023-05-26T13:50:09.843Z rhea-promise:connection [connection-19] connection 'connection-19' got event: 'connection_error'. Re-emitting the translated context.
2023-05-26 13:50:09.8810094 2023-05-26T13:50:09.843Z rhea-promise:translate [connection-19] Translating the context for event: 'connection_error'.
2023-05-26 13:50:09.8815013 azure:event-hubs:verbose [connection-19] Accompanying error on the context.connection: c {
2023-05-26 13:50:09.8818141 azure:event-hubs:verbose [connection-19] Accompanying error on the context: Error: read ECONNRESET
2023-05-26 13:50:09.8820579 2023-05-26T13:50:09.862Z rhea-promise:error [connection-19] The sender 'evh-der-rt-derived-telemetry' on amqp session 'local-1_remote-1_connection-19' is open ? -> false
2023-05-26 13:50:09.8821126 2023-05-26T13:50:09.863Z rhea-promise:error [connection-19] The amqp session 'local-1_remote-1_connection-19' is open ? -> false
2023-05-26 13:50:09.8822039 2023-05-26T13:50:09.869Z rhea-promise:error [connection-19] The connection is open ? -> false
2023-05-26 13:50:09.8822343 2023-05-26T13:50:09.870Z rhea-promise:eventhandler [connection-20] rhea-promise 'connection' object is listening for events: [ 'connection_open', 'connection_close', 'connection_error', 'protocol_error', 'error', 'disconnected', 'settled', 'sender_error', 'sender_close', 'receiver_error', 'receiver_close', 'session_error', 'session_close' ] emitted by rhea's 'connection' object.
2023-05-26 13:50:11.8741472 2023-05-26T13:50:11.872Z rhea-promise:eventhandler [connection-20] rhea-promise 'session' object 'local-0_remote-undefined_connection-20' is listening for events: [ 'session_open', 'session_error', 'session_close', 'settled', 'sender_error', 'sender_close', 'receiver_error', 'receiver_close' ] emitted by rhea's 'session' object.
2023-05-26 13:50:11.8835213 2023-05-26T13:50:11.882Z rhea-promise:eventhandler [connection-20] rhea-promise 'sender' object is listening for events: [ 'sendable', 'sender_open', 'sender_draining', 'sender_flow', 'sender_error', 'sender_close', 'accepted', 'released', 'rejected', 'modified', 'settled' ] emitted by rhea's 'sender' object.
2023-05-26 13:50:11.8835489 2023-05-26T13:50:11.882Z rhea-promise:eventhandler [connection-20] ListenerCount for event 'sender_error' on rhea's 'sender' object is: 1.
2023-05-26 13:50:11.8921804 2023-05-26T13:50:11.889Z rhea-promise:session [connection-20] Added event handler for event 'session_error' on rhea-promise 'session: local-0_remote-0_connection-20', while creating the 'receiver'.
2023-05-26 13:50:11.8922090 2023-05-26T13:50:11.890Z rhea-promise:eventhandler [connection-20] rhea-promise 'receiver' object is listening for events: [ 'message', 'receiver_open', 'receiver_drained', 'receiver_flow', 'receiver_error', 'receiver_close', 'settled' ] emitted by rhea's 'receiver' object.
2023-05-26 13:50:11.8922355 2023-05-26T13:50:11.890Z rhea-promise:eventhandler [connection-20] ListenerCount for event 'receiver_error' on rhea's 'receiver' object is: 1.
2023-05-26 13:50:11.9086746 WARN-APETEL-0001: Failed to send messages to eventhub: The request with message_id "ec358292-7af2-6547-8b44-03d1305192fd" to "$cbs" endpoint timed out. Please try again later.
2023-05-26 13:50:11.9176769 azure:core-amqp:warning [connection-20] An error occurred while negotiating the cbs claim: MessagingError: The request with message_id "ec358292-7af2-6547-8b44-03d1305192fd" to "$cbs" endpoint timed out. Please try again later.
2023-05-26 13:50:11.9177052 azure:core-amqp:verbose MessagingError: The request with message_id "ec358292-7af2-6547-8b44-03d1305192fd" to "$cbs" endpoint timed out. Please try again later.
2023-05-26 13:50:11.9178794 azure:event-hubs:warning [connection-1] Sender evh-der-rt-derived-telemetry: an error occurred while being created: MessagingError: The request with message_id "ec358292-7af2-6547-8b44-03d1305192fd" to "$cbs" endpoint timed out. Please try again later.
2023-05-26 13:50:11.9179075 azure:event-hubs:verbose MessagingError: The request with message_id "ec358292-7af2-6547-8b44-03d1305192fd" to "$cbs" endpoint timed out. Please try again later.
2023-05-26 13:50:11.9180831 azure:core-amqp:verbose [connection-19] Error occurred for 'senderLink' in attempt number 4: MessagingError: The request with message_id "ec358292-7af2-6547-8b44-03d1305192fd" to "$cbs" endpoint timed out. Please try again later.
2023-05-26 13:50:11.9182573 code: 'OperationTimeoutError'
2023-05-26 13:50:11.9183139 azure:event-hubs:warning [connection-1] Sender evh-der-rt-derived-telemetry: an error occurred while creating: MessagingError: The request with message_id "ec358292-7af2-6547-8b44-03d1305192fd" to "$cbs" endpoint timed out. Please try again later.
2023-05-26 13:50:11.9183442 azure:event-hubs:verbose MessagingError: The request with message_id "ec358292-7af2-6547-8b44-03d1305192fd" to "$cbs" endpoint timed out. Please try again later.
2023-05-26 13:50:12.6995356 onError: [Function (anonymous)],
2023-05-26 13:50:12.6997070 onSessionError: [Function (anonymous)],
2023-05-26 13:50:12.7075567 2023-05-26T13:50:12.602Z rhea-promise:eventhandler [connection-20] rhea-promise 'session' object 'local-1_remote-undefined_connection-20' is listening for events: [ 'session_open', 'session_error', 'session_close', 'settled', 'sender_error', 'sender_close', 'receiver_error', 'receiver_close' ] emitted by rhea's 'session' object.
2023-05-26 13:50:12.7207127 2023-05-26T13:50:12.620Z rhea-promise:session [connection-20] Added event handler for event 'session_error' on rhea-promise 'session: local-1_remote-1_connection-20', while creating the sender.
2023-05-26 13:50:12.7207688 2023-05-26T13:50:12.620Z rhea-promise:eventhandler [connection-20] rhea-promise 'sender' object is listening for events: [ 'sendable', 'sender_open', 'sender_draining', 'sender_flow', 'sender_error', 'sender_close', 'accepted', 'released', 'rejected', 'modified', 'settled' ] emitted by rhea's 'sender' object.
2023-05-26 13:50:12.7207963 2023-05-26T13:50:12.620Z rhea-promise:eventhandler [connection-20] ListenerCount for event 'sender_error' on rhea's 'sender' object is: 1.
2023-05-26 13:50:12.7273368 onError: [Function (anonymous)],
2023-05-26 13:50:12.7274125 onSessionError: [Function (anonymous)],
2023-05-26 13:50:12.7278239 sender_error: [Function (anonymous)],
2023-05-26 13:50:12.7281014 session_error: [Function (anonymous)],
2023-05-26 13:50:12.7368536 closed_with_non_fatal_error: false,
2023-05-26 13:50:12.7371633 onError: [Function (anonymous)],
2023-05-26 13:50:12.7372221 onSessionError: [Function (anonymous)],
2023-05-26 13:50:12.7381912 sender_error: [Function (anonymous)],
2023-05-26 13:50:12.7385618 onError: [Function (anonymous)],
2023-05-26 13:50:12.7386161 onSessionError: [Function (anonymous)],
2023-05-26 13:50:15.5463526 2023-05-26T13:50:15.487Z rhea:message decoding section: Typed { type: TypeDesc { name: 'Map8', typecode: 193, width: 1, category: 3, create: [Function (anonymous)] { typecode: 193 } }, value: [ Typed { type: [TypeDesc], value: 'status-code' }, Typed { type: [TypeDesc], value: 401 }, Typed { type: [TypeDesc], value: 'status-description' }, Typed { type: [TypeDesc], value: "ExpiredToken: The token is expired. Expiration time: '2023-05-20 15:56:34Z'" }, Typed { type: [TypeDesc], value: 'error-condition' }, Typed { type: [TypeDesc], value: 'com.microsoft:auth-failed' } ], descriptor: Typed { type: TypeDesc { name: 'SmallUlong', typecode: 83, width: 1, category: 1, read: [Function: read], write: [Function: write], create: [Function] }, value: 116 } } of type: Typed { type: TypeDesc { name: 'SmallUlong', typecode: 83, width: 1, category: 1, read: [Function: read], write: [Function: write], create: [Function (anonymous)] { typecode: 83 } }, value: 116 }
2023-05-26 13:50:15.5465068 azure:core-amqp:warning MessagingError: ExpiredToken: The token is expired. Expiration time: '2023-05-20 15:56:34Z'
2023-05-26 13:50:15.5465356 azure:core-amqp:verbose MessagingError: ExpiredToken: The token is expired. Expiration time: '2023-05-20 15:56:34Z'
2023-05-26 13:50:15.5469109 azure:core-amqp:warning [connection-20] An error occurred while negotiating the cbs claim: MessagingError: ExpiredToken: The token is expired. Expiration time: '2023-05-20 15:56:34Z'
2023-05-26 13:50:15.5469388 azure:core-amqp:verbose MessagingError: ExpiredToken: The token is expired. Expiration time: '2023-05-20 15:56:34Z'
2023-05-26 13:50:15.5473638 azure:event-hubs:verbose [connection-1] Sender evh-der-rt-derived-telemetry: an error occurred while renewing the token: undefined
jeremymeng commented 1 year ago

we just released @Azure/event-hubs v5.11.1. Please let us know if it hasn't addressed the issue that you are seeing.

joost-lambregts commented 1 year ago

Hi @jeremymeng, Thanks for the response. We recently refactored our code to use output bindings instead of the azure/event-hubs SDK, which also resolved our problem. This issue can be closed as far as I'm concerned.

jeremymeng commented 1 year ago

Thanks @joost-lambregts for circling back!