Eventhub stops getting data when under load

Azure / azure-event-hubs-go

Golang client library for Azure Event Hubs https://azure.microsoft.com/services/event-hubs

MIT License

88 stars 69 forks source link

Eventhub stops getting data when under load #273

Open MartinKosicky opened 1 year ago

MartinKosicky commented 1 year ago

When we run eventhub go listener after some time we get stuck on https://github.com/Azure/azure-event-hubs-go/blob/master/receiver.go#L291 . It seems that the session get's broken, however I know that the connection is OK because I watched it in wireshark and when I call GetPartitionInfo (it uses the same connection so the socket is not dead) I see that I am not at the end of the partition.

I would like to ask if there shouldn't be some kind of timeout if the session get's broken somehow as I saw such code on the C# variant of this library. However I dont see anything in this code like that. Maybe a call to Recover if there is no data after some time?

MartinKosicky commented 1 year ago

I just change that line to:

    newContext, _ := context.WithTimeout(ctx, 30 * time.Second)
    msg, err := r.listenForMessage(newContext)    (this would trigger a Recover on the session if no data arrives in 30 seconds)

and it works now, should I make a PR or can we possibly discuss it if you have some other idea?

richardpark-msft commented 1 year ago

The issue with this is that it'll force recovery every 'n' seconds (in your case 30) if there's no activity. So really we need to fix the core bug here, which appears to be that the Receiver is no longer "live" and so it's not responding to messages. There's a few reasons this could happen.

Can you give me a better idea of how you reproduce this? How long is "after some time"? Are we talking 3 days, 4 days, that kind of thing? Also, do you see this after longer idle periods or is this even when activity is active?

MartinKosicky commented 1 year ago

I totally agree to fix the core issue here, although I'm afraid about the reselience here. I can reproduce this by running a read from the start of an eventhub with prefetch 2000. After a while (few minutes max) the reading stops, and we have logic that if i get no data for 30 secs it checks if I am at the end of partition (over same connection). And when i'm not we let the microservice crash and get restarted, from a checkpoint. This happens only when there is a lot of activity. Also this can be a server issue