Azure / azure-sdk-for-go

This repository is for active development of the Azure SDK for Go. For consumers of the SDK we recommend visiting our public developer docs at:
https://docs.microsoft.com/azure/developer/go/
MIT License
1.64k stars 844 forks source link

[Event Hub] Event Hub consumer stuck on retry loop #23282

Closed tinenbruno closed 3 months ago

tinenbruno commented 3 months ago

Bug Report

From time to time we are experiencing a retry loop on the Event Hub connection from a consumer. It gets stuck on this retry loop and requires a manual restart of the service in order for it to be able to reconnect to the Event Hub again.

Example of logs:

Jul 29, 2024 20:23:54.115 [azeh.Consumer] (c:1,l:u_UeM,p:1) Retry attempt 0 returned retryable error: write tcp 10.1.1.142:35090->40.125.103.251:5671: write: broken pipe
Jul 29, 2024 20:23:54.115 [azeh.Consumer] (c:1,l:u_UeM,p:1) Retry attempt 1 sleeping for 4.497875431s
Jul 29, 2024 20:23:54.616 [azeh.Consumer] (c:1,l:u_UeM,p:1) Retry attempt 0 returned retryable error: write tcp 10.1.1.142:35090->40.125.103.251:5671: write: broken pipe

So it never gets to really finish the retry attempts.

One of the following:

That's one of the issues we have, we are unable to easily reproduce this error as it depends on a transient network error between the cluster and event hubs.

The service is running inside AKS connecting directly to this Event Hub instance inside the same service group. From time to time we also notice some connection instabilities on a strimzi deployment we have running inside the same cluster that manages some mongoDB kafka producer connectors we have running.

For the consumer we have a context with timeout set for 60s and it is set to wait for a single event before processing it.

github-actions[bot] commented 3 months ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.

richardpark-msft commented 3 months ago

I'll take a look - the retries definitely should not keep resetting!

richardpark-msft commented 3 months ago

@tinenbruno, are there any other logs with this? There should be something showing that we're attempting to recreate the connection, for instance.

tinenbruno commented 3 months ago

Taking a closer look we also have the following logs:

  1. When this error starts to happen we have from the producer side:

    Jul 18, 2024 04:32:33.005 [azeh.Producer] Error in RPCLink, stopping response router: read tcp 10.1.1.88:54334->40.125.103.251:5671: i/o timeout
    Jul 18, 2024 04:32:33.005 [azeh.Producer] Response router has shut down
  2. On the connection/auth side we have

    Jul 18, 2024 04:32:49.516 [azeh.Conn] (c:1,l:whrw1,p:2) Error when cleaning up old link: read tcp 10.1.1.88:54328->40.125.103.251:5671: i/o timeout
    Jul 18, 2024 04:32:49.516 [azeh.Conn] Error closing old client: read tcp 10.1.1.88:54328->40.125.103.251:5671: i/o timeout
    Jul 18, 2024 04:32:49.516 [azeh.Conn] Creating a new client (rev:1)
    Jul 18, 2024 04:32:49.516 [azeh.Conn] Creating new client, current rev: 1
    Jul 18, 2024 04:32:49.546 [azeh.Conn] Client created, new rev: 2, took 29ms
    Jul 18, 2024 04:32:49.546 [azeh.Conn] Creating link for partition ID '2'
    Jul 18, 2024 04:32:49.546 [azeh.Auth] (mongo.payment.payments/ConsumerGroups/orderconsumergroup/Partitions/2) refreshing claim
    Jul 18, 2024 04:32:49.546 [azeh.Auth] (mongo.payment.payments/ConsumerGroups/orderconsumergroup/Partitions/2) negotiate claim, token expires on 2024-07-18T04:47:49Z
    Jul 18, 2024 04:32:49.549 [azeh.Auth] negotiating claim for audience amqps://rs-event-hub-prod.servicebus.windows.net/mongo.payment.payments/ConsumerGroups/orderconsumergroup/Partitions/2 with token type servicebus.windows.net:sastoken and expiry of 1721284370
    Jul 18, 2024 04:32:49.552 [azeh.Auth] Response router has shut down
    Jul 18, 2024 04:32:49.553 [azeh.Auth] (mongo.payment.payments/ConsumerGroups/orderconsumergroup/Partitions/2) next refresh in 4m59.992437794s
    Jul 18, 2024 04:32:49.555 [azeh.Consumer] Enabling prefetch with 300 credits
    Jul 18, 2024 04:32:49.555 [azeh.Consumer] Creating receiver:   source:mongo.payment.payments/ConsumerGroups/orderconsumergroup/Partitions/2   instanceID: ea14d8c3-86bc-40ee-79ee-b226315fa602   owner level: 824637597136   offset: amqp.annotation.x-opt-sequence-number > '138202'   manual: false   prefetch: 0
    Jul 18, 2024 04:32:49.616 [azeh.Conn] (c:2,l:b_at8,p:2): Succesfully created link for partition ID '2'
    Jul 18, 2024 04:32:50.294 [azeh.Conn] (c:1,l:4c_a5,p:0) Error when cleaning up old link: read tcp 10.1.1.88:54328->40.125.103.251:5671: i/o timeout
    Jul 18, 2024 04:32:50.294 [azeh.Conn] Skipping connection recovery, already recovered: 2 vs 1. Links will still be recovered.
    Jul 18, 2024 04:32:50.294 [azeh.Conn] Creating link for partition ID '0'
    Jul 18, 2024 04:32:50.294 [azeh.Auth] (mongo.payment.payments/ConsumerGroups/orderconsumergroup/Partitions/0) refreshing claim
    Jul 18, 2024 04:32:50.294 [azeh.Auth] (mongo.payment.payments/ConsumerGroups/orderconsumergroup/Partitions/0) negotiate claim, token expires on 2024-07-18T04:47:50Z
    Jul 18, 2024 04:32:50.297 [azeh.Auth] negotiating claim for audience amqps://rs-event-hub-prod.servicebus.windows.net/mongo.payment.payments/ConsumerGroups/orderconsumergroup/Partitions/0 with token type servicebus.windows.net:sastoken and expiry of 1721284370
    Jul 18, 2024 04:32:50.298 [azeh.Auth] Response router has shut down
    Jul 18, 2024 04:32:50.299 [azeh.Auth] (mongo.payment.payments/ConsumerGroups/orderconsumergroup/Partitions/0) next refresh in 4m59.994437996s
    Jul 18, 2024 04:32:50.300 [azeh.Consumer] Enabling prefetch with 300 credits
    Jul 18, 2024 04:32:50.300 [azeh.Consumer] Creating receiver:   source:mongo.payment.payments/ConsumerGroups/orderconsumergroup/Partitions/0   instanceID: ea14d8c3-86bc-40ee-79ee-b226315fa602   owner level: 824637597136   offset: amqp.annotation.x-opt-sequence-number > '138290'   manual: false   prefetch: 0
    Jul 18, 2024 04:32:50.304 [azeh.Conn] (c:2,l:9RRmG,p:0): Succesfully created link for partition ID '0'
    Jul 18, 2024 04:32:50.467 [azeh.Conn] (c:1,l:OL8Xx,p:1) Error when cleaning up old link: read tcp 10.1.1.88:54328->40.125.103.251:5671: i/o timeout
    Jul 18, 2024 04:32:50.467 [azeh.Conn] Skipping connection recovery, already recovered: 2 vs 1. Links will still be recovered.
    Jul 18, 2024 04:32:50.467 [azeh.Conn] Creating link for partition ID '1'
    Jul 18, 2024 04:32:50.467 [azeh.Auth] (mongo.payment.payments/ConsumerGroups/orderconsumergroup/Partitions/1) refreshing claim
    Jul 18, 2024 04:32:50.467 [azeh.Auth] (mongo.payment.payments/ConsumerGroups/orderconsumergroup/Partitions/1) negotiate claim, token expires on 2024-07-18T04:47:50Z
    Jul 18, 2024 04:32:50.470 [azeh.Auth] negotiating claim for audience amqps://rs-event-hub-prod.servicebus.windows.net/mongo.payment.payments/ConsumerGroups/orderconsumergroup/Partitions/1 with token type servicebus.windows.net:sastoken and expiry of 1721284370
    Jul 18, 2024 04:32:50.471 [azeh.Auth] Response router has shut down
    Jul 18, 2024 04:32:50.475 [azeh.Auth] (mongo.payment.payments/ConsumerGroups/orderconsumergroup/Partitions/1) next refresh in 4m59.991934194s
    Jul 18, 2024 04:32:50.476 [azeh.Consumer] Enabling prefetch with 300 credits
    Jul 18, 2024 04:32:50.476 [azeh.Consumer] Creating receiver:   source:mongo.payment.payments/ConsumerGroups/orderconsumergroup/Partitions/1   instanceID: ea14d8c3-86bc-40ee-79ee-b226315fa602   owner level: 824637597136   offset: amqp.annotation.x-opt-sequence-number > '139053'   manual: false   prefetch: 0
    Jul 18, 2024 04:32:50.481 [azeh.Conn] (c:2,l:Jx8WM,p:1): Succesfully created link for partition ID '1'
  3. We have two consumers for different event hubs, one of them in this case also kept logging this error

    Jul 18, 2024 04:38:02.417 [azeh.Auth] (mongo.commerce-kitchen-display.card/ConsumerGroups/ordering-cg/Partitions/0) refreshing claim
    Jul 18, 2024 04:38:02.417 [azeh.Auth] (mongo.commerce-kitchen-display.card/ConsumerGroups/ordering-cg/Partitions/0) negotiate claim, token expires on 2024-07-18T04:53:02Z
    Jul 18, 2024 04:38:02.417 [azeh.Conn] Error closing old client: *Error{Condition: amqp:connection:forced, Description: The connection was closed by container '79d31b61bdc241baa0fbc205f11ed405_G4' because it did not have any active links in the past 60000 milliseconds. TrackingId:79d31b61bdc241baa0fbc205f11ed405_G4, SystemTracker:gateway5, Timestamp:2024-07-18T04:35:10, Info: map[]}
    Jul 18, 2024 04:38:02.417 [azeh.Conn] Creating a new client (rev:3)
    Jul 18, 2024 04:38:02.417 [azeh.Conn] Creating new client, current rev: 3
    Jul 18, 2024 04:38:02.441 [azeh.Conn] Client created, new rev: 4, took 23ms
    Jul 18, 2024 04:38:02.441 [azeh.Auth] (mongo.commerce-kitchen-display.card/ConsumerGroups/ordering-cg/Partitions/0) negotiate claim, failed: *Error{Condition: amqp:connection:forced, Description: The connection was closed by container '79d31b61bdc241baa0fbc205f11ed405_G4' because it did not have any active links in the past 60000 milliseconds. TrackingId:79d31b61bdc241baa0fbc205f11ed405_G4, SystemTracker:gateway5, Timestamp:2024-07-18T04:35:10, Info: map[]}
    Jul 18, 2024 04:38:02.441 [azeh.Auth] (NegotiateClaimRefresh) Retry attempt 0 returned retryable error: *Error{Condition: amqp:connection:forced, Description: The connection was closed by container '79d31b61bdc241baa0fbc205f11ed405_G4' because it did not have any active links in the past 60000 milliseconds. TrackingId:79d31b61bdc241baa0fbc205f11ed405_G4, SystemTracker:gateway5, Timestamp:2024-07-18T04:35:10, Info: map[]}
    Jul 18, 2024 04:38:02.441 [azeh.Auth] (NegotiateClaimRefresh) Retry attempt 1 sleeping for 4.958336178s

Hope this helps!

richardpark-msft commented 3 months ago

Okay, I think I get a bit more of what you're saying. The logs you just posted are "normal", in that network connections/disruptions can occur.

AMQP (which is the underlying protocol here) differs from HTTP in that it's critical that it maintains a persistent connection. The library itself will reconnect on demand - if you're in the middle of an operation, for instance, it'll do the recovery logic that you see above.

What we can do is take a look at the errors that caused the problem in the first place. Let's walk through those real quick:

Network disruption/disconnects

Jul 29, 2024 20:23:54.115 [azeh.Consumer] (c:1,l:u_UeM,p:1) Retry attempt 0 returned retryable error: write tcp 10.1.1.142:35090->40.125.103.251:5671: write: broken pipe Jul 29, 2024 20:23:54.115 [azeh.Consumer] (c:1,l:u_UeM,p:1) Retry attempt 1 sleeping for 4.497875431s

These errors indicate that the socket itself has died. This can happen, and our recovery action (which isn't shown in this particular snippet) will indicate that we're going to bring the connection back up. To avoid getting too aggressive we do sleep in between retries, which is what the 'sleeping for 4.497875431s' is indicating.

Now, what might be happening is that the network in your infrastructure is getting disrupted for an extended period of time. If that's the case then you'll see that error come up on each retry, but that's not indicating infinite retries, only that we're attempting to recover.

Idle timeouts

Now, there are other kinds of disconnects that can happen, that are also expected:

Jul 18, 2024 04:38:02.441 [azeh.Auth] (mongo.commerce-kitchen-display.card/ConsumerGroups/ordering-cg/Partitions/0) negotiate claim, failed: *Error{Condition: amqp:connection:forced, Description: The connection was closed by container '79d31b61bdc241baa0fbc205f11ed405_G4' because it did not have any active links in the past 60000 milliseconds. TrackingId:79d31b61bdc241baa0fbc205f11ed405_G4, SystemTracker:gateway5, Timestamp:2024-07-18T04:35:10, Info: map[]}

This one is indicating that your connection was idle - no messages were being received and nothing was being sent. This is also normal - the client will automatically reconnect whenever you do another operation (sending more events, attempting to receive).

Final

The first segment you posted (ie: timestamp was Jul 29, 2024 20:23:54.115) for the bug concerns me because I don't see any recovery actions happening there

Do you have more of that log still around?

carlfriess commented 3 months ago

Hi there! I think I might be having the same or a related issue. After the first attempt to re-establish the connection fails, no further attempt is made. This consistently seems to happen several times per day making the service very unreliable. I'm running this on an Azure Container Instance. Here is an instance of the logs for this issue:

[Aug  9 12:32:12.106142] azeh.Consumer: (c:1,l:6yfp-,p:0) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:12.106161] azeh.Consumer: (c:1,l:6yfp-,p:0) Retry attempt 1 sleeping for 4.309669693s
[Aug  9 12:32:12.333665] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:12.333809] azeh.Consumer: (c:1,l:YRuUC,p:1) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:12.333822] azeh.Consumer: (c:1,l:YRuUC,p:1) Retry attempt 1 sleeping for 3.970533436s
[Aug  9 12:32:12.490110] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:12.490234] azeh.Consumer: (c:1,l:nyEBa,p:2) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:12.490259] azeh.Consumer: (c:1,l:nyEBa,p:2) Retry attempt 1 sleeping for 4.317139172s
[Aug  9 12:32:12.813476] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:12.813588] azeh.Consumer: (c:1,l:XElsM,p:3) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:12.813597] azeh.Consumer: (c:1,l:XElsM,p:3) Retry attempt 1 sleeping for 3.888786064s
[Aug  9 12:32:13.107385] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:13.107488] azeh.Consumer: (c:1,l:6yfp-,p:0) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:13.107498] azeh.Consumer: (c:1,l:6yfp-,p:0) Retry attempt 1 sleeping for 3.944064426s
[Aug  9 12:32:13.334799] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:13.334950] azeh.Consumer: (c:1,l:YRuUC,p:1) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:13.334960] azeh.Consumer: (c:1,l:YRuUC,p:1) Retry attempt 1 sleeping for 4.025138557s
[Aug  9 12:32:13.490288] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:13.490492] azeh.Consumer: (c:1,l:nyEBa,p:2) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:13.490597] azeh.Consumer: (c:1,l:nyEBa,p:2) Retry attempt 1 sleeping for 4.151132665s
[Aug  9 12:32:13.814523] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:13.814691] azeh.Consumer: (c:1,l:XElsM,p:3) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:13.814704] azeh.Consumer: (c:1,l:XElsM,p:3) Retry attempt 1 sleeping for 3.571592323s
[Aug  9 12:32:14.108433] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:14.108573] azeh.Consumer: (c:1,l:6yfp-,p:0) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:14.108602] azeh.Consumer: (c:1,l:6yfp-,p:0) Retry attempt 1 sleeping for 4.177165246s
[Aug  9 12:32:14.335157] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:14.335401] azeh.Consumer: (c:1,l:YRuUC,p:1) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:14.335416] azeh.Consumer: (c:1,l:YRuUC,p:1) Retry attempt 1 sleeping for 4.825796069s
[Aug  9 12:32:14.491285] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:14.491445] azeh.Consumer: (c:1,l:nyEBa,p:2) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:14.491456] azeh.Consumer: (c:1,l:nyEBa,p:2) Retry attempt 1 sleeping for 3.725545778s
[Aug  9 12:32:14.814773] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:14.814897] azeh.Consumer: (c:1,l:XElsM,p:3) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:14.814907] azeh.Consumer: (c:1,l:XElsM,p:3) Retry attempt 1 sleeping for 3.962249427s
[Aug  9 12:32:14.937429] azeh.Auth: (emost-dev-iothub/$management) refreshing claim
[Aug  9 12:32:14.937498] azeh.Auth: (emost-dev-iothub/$management) negotiate claim, token expires on 2024-08-09T12:47:14Z
[Aug  9 12:32:14.937561] azeh.Conn: Error closing old client: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:14.937568] azeh.Conn: Creating a new client (rev:1)
[Aug  9 12:32:14.937573] azeh.Conn: Creating new client, current rev: 1
[Aug  9 12:32:14.976531] azeh.Conn: Client created, new rev: 2, took 38ms
[Aug  9 12:32:14.976583] azeh.Auth: (emost-dev-iothub/$management) negotiate claim, failed: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:14.976616] azeh.Auth: (NegotiateClaimRefresh) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:14.976629] azeh.Auth: (NegotiateClaimRefresh) Retry attempt 1 sleeping for 4.810257012s
[Aug  9 12:32:15.026431] azeh.Auth: (emost-dev-iothub/ConsumerGroups/telegraf_prod/Partitions/2) refreshing claim
[Aug  9 12:32:15.026511] azeh.Auth: (emost-dev-iothub/ConsumerGroups/telegraf_prod/Partitions/2) negotiate claim, token expires on 2024-08-09T12:47:15Z
[Aug  9 12:32:15.034716] azeh.Auth: negotiating claim for audience amqps://iothub-ns-emost-dev-1314556-381821673b.servicebus.windows.net/emost-dev-iothub/ConsumerGroups/telegraf_prod/Partitions/2 with token type servicebus.windows.net:sastoken and expiry of 1723213935
[Aug  9 12:32:15.037571] azeh.Auth: Response router has shut down
[Aug  9 12:32:15.039744] azeh.Auth: (emost-dev-iothub/ConsumerGroups/telegraf_prod/Partitions/2) next refresh in 4m59.986771599s
[Aug  9 12:32:15.109311] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:15.109480] azeh.Consumer: (c:1,l:6yfp-,p:0) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:15.109496] azeh.Consumer: (c:1,l:6yfp-,p:0) Retry attempt 1 sleeping for 4.026711049s
[Aug  9 12:32:15.336141] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:15.336290] azeh.Consumer: (c:1,l:YRuUC,p:1) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:15.336306] azeh.Consumer: (c:1,l:YRuUC,p:1) Retry attempt 1 sleeping for 4.025931186s
[Aug  9 12:32:15.491615] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:15.492010] azeh.Consumer: (c:1,l:nyEBa,p:2) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:15.492052] azeh.Consumer: (c:1,l:nyEBa,p:2) Retry attempt 1 sleeping for 4.872281597s
[Aug  9 12:32:15.815982] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:15.816147] azeh.Consumer: (c:1,l:XElsM,p:3) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:15.816161] azeh.Consumer: (c:1,l:XElsM,p:3) Retry attempt 1 sleeping for 4.701055497s
[Aug  9 12:32:16.110540] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:16.110714] azeh.Consumer: (c:1,l:6yfp-,p:0) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:16.110729] azeh.Consumer: (c:1,l:6yfp-,p:0) Retry attempt 1 sleeping for 3.992063348s
[Aug  9 12:32:16.336974] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:16.337194] azeh.Consumer: (c:1,l:YRuUC,p:1) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:16.337210] azeh.Consumer: (c:1,l:YRuUC,p:1) Retry attempt 1 sleeping for 4.364732737s
[Aug  9 12:32:16.492885] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:16.493026] azeh.Consumer: (c:1,l:nyEBa,p:2) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:16.493048] azeh.Consumer: (c:1,l:nyEBa,p:2) Retry attempt 1 sleeping for 4.086521338s
[Aug  9 12:32:16.817371] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:16.817483] azeh.Consumer: (c:1,l:XElsM,p:3) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:16.817492] azeh.Consumer: (c:1,l:XElsM,p:3) Retry attempt 1 sleeping for 3.97672526s
[Aug  9 12:32:17.111513] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:17.111831] azeh.Consumer: (c:1,l:6yfp-,p:0) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:17.111881] azeh.Consumer: (c:1,l:6yfp-,p:0) Retry attempt 1 sleeping for 3.567847306s
[Aug  9 12:32:17.337892] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:17.337998] azeh.Consumer: (c:1,l:YRuUC,p:1) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:17.338024] azeh.Consumer: (c:1,l:YRuUC,p:1) Retry attempt 1 sleeping for 4.379238298s
[Aug  9 12:32:17.494089] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:17.494205] azeh.Consumer: (c:1,l:nyEBa,p:2) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:17.494214] azeh.Consumer: (c:1,l:nyEBa,p:2) Retry attempt 1 sleeping for 3.864261723s
[Aug  9 12:32:17.817669] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:17.817849] azeh.Consumer: (c:1,l:XElsM,p:3) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:17.817859] azeh.Consumer: (c:1,l:XElsM,p:3) Retry attempt 1 sleeping for 4.636310059s
[Aug  9 12:32:18.112002] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:18.112107] azeh.Consumer: (c:1,l:6yfp-,p:0) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:18.112116] azeh.Consumer: (c:1,l:6yfp-,p:0) Retry attempt 1 sleeping for 3.306362705s
[Aug  9 12:32:18.338715] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:18.338826] azeh.Consumer: (c:1,l:YRuUC,p:1) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:18.338835] azeh.Consumer: (c:1,l:YRuUC,p:1) Retry attempt 1 sleeping for 4.497291541s
[Aug  9 12:32:18.495354] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:18.495459] azeh.Consumer: (c:1,l:nyEBa,p:2) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:18.495468] azeh.Consumer: (c:1,l:nyEBa,p:2) Retry attempt 1 sleeping for 4.102999542s
[Aug  9 12:32:18.818486] azeh.Consumer: No events received, returning error context deadline exceeded
[Aug  9 12:32:18.818600] azeh.Consumer: (c:1,l:XElsM,p:3) Retry attempt 0 returned retryable error: read tcp 10.20.30.6:46500->68.219.160.143:5671: i/o timeout
[Aug  9 12:32:18.818610] azeh.Consumer: (c:1,l:XElsM,p:3) Retry attempt 1 sleeping for 5.003270105s
[Aug  9 12:32:19.112686] azeh.Consumer: No events received, returning error context deadline exceeded
tinenbruno commented 3 months ago

@richardpark-msft Thanks for the response!

Now, what might be happening is that the network in your infrastructure is getting disrupted for an extended period of time. If that's the case then you'll see that error come up on each retry, but that's not indicating infinite retries, only that we're attempting to recover.

I omitted the retry part so it was really not that clear but it is very similar to what @carlfriess has posted. It goes retry attempt 0, 1 and then stops there, resuming again on 0 for all partitions. If I just restart the service after seeing this error going on it just connects successfully and I don't see this error again.

Do you have more of that log still around?

I will go through the logs again to check if there is anything I missed.

richardpark-msft commented 3 months ago

Just a quick note, I'm typing up a rather long reply but, in any case, I'm still going to see if I can come up with some more test cases for this kind of scenario. I have a few, but I might have missed something.

richardpark-msft commented 3 months ago

@tinenbruno, @carlfriess - what do you set as the context deadline that you're passing in to the consumer or producer? (ie, how much time are you giving it?)

richardpark-msft commented 3 months ago

Okay, I think I have an idea of how this can happen. Thanks both of you for being patient!

TLDR: The code has a bug, currently, which can lead to the "infinite" loop you're both seeing.

Details

The way our recovery works (today) is that when you fail in a call (say, a network I/O error) we sleep and then attempt a recovery. You're seeing that in your log and that part is normal.

One the sleep is complete we will recover the connection. However, if your context is cancelled while we're sleeping (either, deadline or explicitly) it's possible to avoid that entire process. In your logs you can see that the code claims it's going to sleep for 4 seconds but you can see that we return far earlier than that.

Interim workarounds

While I work on the fix there are a few things you can do to alleviate/eliminate this bug in your setups.

  1. You can simply increase the amount of time you're willing to wait by adjusting the context.Context you're passing in to ReceiveEvents(), giving recovery (and the sleep) time to recover.

  2. Another option is to change the retry parameters so they don't sleep as long, and will try to recover earlier - it's still a balancing act with your context.Context timeout, however, so I'd recommend approach 1, rather than this. Nevertheless:

    consumerClient, err = azeventhubs.NewConsumerClient("<ex: myeventhubnamespace.servicebus.windows.net>", "eventhub-name", azeventhubs.DefaultConsumerGroup, defaultAzureCred, &azeventhubs.ConsumerClientOptions{
          // NOTE: you don't need to configure these explicitly if you like the defaults.
          // For more information see:
          //  https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/messaging/azservicebus#RetryOptions
          RetryOptions: azeventhubs.RetryOptions{
              // MaxRetries specifies the maximum number of attempts a failed operation will be retried
              // before producing an error.
              MaxRetries: 3,
              // RetryDelay specifies the initial amount of delay to use before retrying an operation.
              // The delay increases exponentially with each retry up to the maximum specified by MaxRetryDelay.
              RetryDelay: 4 * time.Second,
              // MaxRetryDelay specifies the maximum delay allowed before retrying an operation.
              // Typically the value is greater than or equal to the value specified in RetryDelay.
              MaxRetryDelay: 120 * time.Second,
          },
      })
    
tinenbruno commented 3 months ago

Hey @richardpark-msft you nailed it! And I am feeling a bit dumb right now because I've reached the same conclusion as you a couple of weeks back and ended up changing the context deadline of one of the consumers we had inside the service. I, however, forgot to change the other one we also have inside the same service, saw the error in the log and thought that it was something else but not the context.

I took a closer look at the logs and actually the one that had the context changed recovered and kept receiving the events, but it was covered by all the other retry logs from the consumer that was failing.

Thanks for the help! Very much appreciated.

carlfriess commented 3 months ago

Thank you @richardpark-msft! This is a helpful analysis. The events I am ingesting are somewhat real-time, so increasing the context deadline isn't a great option here. Do you know when a fix might be available? Than you again for the support!

richardpark-msft commented 3 months ago

Thank you @richardpark-msft! This is a helpful analysis. The events I am ingesting are somewhat real-time, so increasing the context deadline isn't a great option here.

Yes, that's definitely a difficult one. You might want to go down the "tune the retries" path since it'll let you reduce the delay.

What is your context timeout, btw? Just curious, I'd like to setup something similar on my end for a long running test. We have a suite of them, and I try to keep them updated with customer scenarios.

Do you know when a fix might be available? Than you again for the support!

Working on it - it's in one of those "inner inner loop" spots so I want to tread carefully. I'll update here when I get a PR in.

richardpark-msft commented 3 months ago

Hey @richardpark-msft you nailed it! And I am feeling a bit dumb right now because I've reached the same conclusion as you a couple of weeks back and ended up changing the context deadline of one of the consumers we had inside the service. I, however, forgot to change the other one we also have inside the same service, saw the error in the log and thought that it was something else but not the context.

I took a closer look at the logs and actually the one that had the context changed recovered and kept receiving the events, but it was covered by all the other retry logs from the consumer that was failing.

Thanks for the help! Very much appreciated.

Not a problem. Between your report and @carlfriess, I was able to narrow it down. I really do appreciate you taking the time to file the issue.

carlfriess commented 3 months ago

Thanks @richardpark-msft! :)

What is your context timeout, btw? Just curious, I'd like to setup something similar on my end for a long running test. We have a suite of them, and I try to keep them updated with customer scenarios.

I'm using a 1 second timeout at the moment. The background is that I am consuming from an Azure IoT Hub's endpoint which we are using to transmit real-time telemetry from device in the field. We can tolerate short interruptions but in general we want to be able to deliver data to our customers with as short of a latency as possible.

Since the number of active devices and therefore the number of events varies significantly, I rely on the timeout to maintain predictable end-to-end latency. As a workaround I can set the number of events to receive at a time to a lower value at the cost of reducing performance downstream.

richardpark-msft commented 3 months ago

@carlfriess, @tinenbruno , I believe I've nailed it - package published: https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/messaging/azeventhubs@v1.2.2

Let me know if you have any feedback, or if that didn't help.