[Event hub][Error]New receiver 'nil' with higher epoch of '0' is created hence current receiver 'nil' with epoch '0' is getting disconnected

Azure / azure-sdk-for-js

This repository is for active development of the Azure SDK for JavaScript (NodeJS & Browser). For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/javascript/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-js.

MIT License

2.08k stars 1.2k forks source link

[Event hub][Error]New receiver 'nil' with higher epoch of '0' is created hence current receiver 'nil' with epoch '0' is getting disconnected #15893

Closed amit12cool closed 1 year ago

amit12cool commented 3 years ago

Package Name: @azure/event-hubs
Package Version: 5.5.1
Operating system:
[ ] nodejs
- version: v14.15.4
[ ] browser
- name/version:
[ ] typescript
- version: 3.7.3
Is the bug related to documentation in
- [ ] README.md
- [ ] source code documentation
- [ ] SDK API docs on https://docs.microsoft.com

Describe the bug Currently we run 8 instances of our application.We are regularly seeing below errors in processError method. We know the reason of it being ownershipLost as we have the same number of messages of ownershipLost and MessagingError at the same point of time.

MessagingError: New receiver 'nil' with higher epoch of '0' is created hence current receiver 'nil' with epoch '0' is getting disconnected. If you are recreating the receiver, make sure a higher epoch is used. TrackingId:cb4ef8780002f9180003aaad60d145bb_G25_B17, SystemTracker:iot-prd01-evhns:eventhub:iot-prd01-evh~20479|iot-prd01-devices-events-configuration-service-cg, Timestamp:2021-06-22T02:07:34}\"}

We have some questions:

1) Are the above errors due to load balancing done by Azure Event hub sdk? If yes, is load balancing an ongoing process even if none of our services restarted i.e. is the ownership of partitions keep changing even if every application instance is running 24*7. As per, this line it seems load balancing is an ongoing process and the ownership keeps on changing from time to time and our logs also suggest that seeing the amount of error[Ownership] logs we get frequently. Please confirm.

2) Is processInitialize method called each time the ownership is transferred from one application instance to other? As per our logs, we never received the processInitialize call after the MessagingError or OwnershipLost error log received. As per this line it should have been called.

To Reproduce Steps to reproduce the behavior:

Run multiple instance of your application with Event hub consumer.

Expected behavior processInitialize should be called if ownership is changed between instance of application.

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

chradek commented 3 years ago

@amit12cool

We are regularly seeing below errors in processError method.

Do you notice any patterns (e.g. they get called every 30 minutes? All stolen from the same consumer?) Just to confirm, are you using the BlobCheckpointStore?

Are the above errors due to load balancing done by Azure Event hub sdk?

The OwnershipLost error is expected when load balancing occurs. Load balancing is indeed an ongoing process. When using the BlobCheckpointStore with your consumer, every ~10 seconds (by default) the consumer will check to see which consumers have a claim for each partition, then run some logic to determine whether it needs to 'steal' a partition from another consumer.

If yes, is load balancing an ongoing process even if none of our services restarted i.e. is the ownership of partitions keep changing even if every application instance is running 24*7.

If a consumer doesn't reassert its claim over a partition within 60 seconds, then other consumers will assume that partition is no longer being read from and will attempt to claim it. So it is possible to see this occur if a consumer was unable to update Storage Blob within 60 seconds (this is separate from the calls to updateCheckpoint you make.) There are 3 situations I can think of that could lead to this happening:

A temporary network issue prevented the consumer from calling Storage Blob to reassert its claims.
Some processing took up a lot of time on the thread preventing the Storage Blob call to take too long to occur. (Since the calls occur every ~10 seconds, and it takes ~60 seconds to consider a claim as invalid, that would have to be a long blocking task)
One of the hosts a consumer is running on is experiencing clock skew and thus thinks that partition claims are stale.

If it's #3, then I'd expect a single consumer to be stealing partitions from other consumers periodically, so that may be something to consider if you see this as the pattern.

Aside from these 3, if a new EventHubConsumerClient was created and started reading, then as you say, you'd expect to see OwnershipLost as well.

Is processInitialize method called each time the ownership is transferred from one application instance to other?

processInitialize should be called by whichever consumer is 'stealing' the partition, and it should occur prior to OwnershipLost on the consumer being stolen from. Do you never see processInitialize called before OwnershipLost on any of the other consumers? Do you see processInitialize when you start your application? Can you tell which consumer is reading from the partition after you see an OwnershipLost?

amit12cool commented 3 years ago

Do you notice any patterns (e.g. they get called every 30 minutes? All stolen from the same consumer?)

I have attached screenshot of 2 instances with my observations pod1 My observation on screenshot:-

Partition 23 claim was lost on June 15,15.20
Each day the same pod has lost claim randomly to partition numbers
For each day partition claim lost is almost around the same time with in 1 minute interval
For each day instance is loosing claim for at most 9 partition for example June 14th

Inked123074511-ff354080-d434-11eb-938e-f903c2b1d636_LI

Somewhat same observations for this instance too.

Just to confirm, are you using the BlobCheckpointStore?

yes, we use BlobCheckpointStore

A temporary network issue prevented the consumer from calling Storage Blob to reassert its claims.

We don't see any network issue in our service as are services are functioning functioning rightly.

Some processing took up a lot of time on the thread preventing the Storage Blob call to take too long to occur. (Since the calls occur every ~10 seconds, and it takes ~60 seconds to consider a claim as invalid, that would have to be a long blocking task) For this, as soon as the instance get an event we process it using some business logic and make some MongoDb calls for read write plus get/update twin on IoT hub and for some events we make an REST api call to other services. Each of this step may or may not run for specific event type received. So the processing time of event in our service is varying.

Question(s):- At what point in time the storage call is made from SDK to blobstorage for claim reassertion? Will the claim be lost in the scenario if I'm getting an empty array on any partition for more than 60 seconds, since as per screenshot I see randomly the claim is lost for any partition can this be the case of claim lost?

One of the hosts a consumer is running on is experiencing clock skew and thus thinks that partition claims are stale. Since we don't get the processInitialize method I could not tell it's been claimed by same consumer.

Question:- Even if one consumer is having clock skew, until other consumer(s) don't loose the claim the clock skew consumer cannot claim. Is it correct understanding?

Do you never see processInitialize called before OwnershipLost on any of the other consumers?

processInitialize is called, I can see the logs now, somehow the filter was not working for on the logs. Changed the filter query resulted in the processInitialize logs appear on Kibana.

Do you see processInitialize when you start your application?

Yes, they are there.

Can you tell which consumer is reading from the partition after you see an OwnershipLost?

Yes. Below are the logs captured for claim lost[partition 22] and first events received on that partition on another consumer. Inked123081778-f136ee00-d43b-11eb-8c07-cf475d4b0985_LI

My observation:-

Event was received on another consumer 13f75a21-ca99-11eb-bad6-020335e365a1 and then the Ownership lost came on the consumer 05baebd9-ca99-11eb-bad6-020335e365a1 which had previously claimed partiton 22.

So in totality, the flow seems to be working correctly. Now the only point to clear is the reason for frequent claim loss by consumers. I hope I was able to answer your questions.

amit12cool commented 3 years ago

@chradek Any update on this?

chradek commented 3 years ago

At what point in time the storage call is made from SDK to blobstorage for claim reassertion? Will the claim be lost in the scenario if I'm getting an empty array on any partition for more than 60 seconds, since as per screenshot I see randomly the claim is lost for any partition can this be the case of claim lost?

By default, the SDK will attempt to reassert its claims every 10 seconds. This is separate from checkpointing. Even if you receive 0 events for 60 seconds, the claim reassertion will continue to happen in the background. It takes ~60 seconds of not being reclaimed before a partition is considered not claimed. Now, each partition that is claimed/reclaimed is 1 call to storage-blob, so if for some reason it took more than 60 seconds to reclaim all the partitions, then another consumer could come in and steal one away.

Even if one consumer is having clock skew, until other consumer(s) don't loose the claim the clock skew consumer cannot claim. Is it correct understanding?

The answer is a little complicated. If a consumer had clock skew such that it was falling behind the real time, then it would see the expiry time of the existing claims as being further in the future than they really are. However, if may still try to steal a partition anyway in order to balance the load. (e.g. If you have 32 partitions and 4 consumers, each one would try to get to a steady state of 8 partitions, even if all partitions are claimed)

If a consumer had clock skew such that it was ahead of the real time, then it would see the expiry time of the existing claims as being further in the past than they really are. So, it may think that the other consumers shutdown for some reason as their claims appear to be expired, and attempt to steal. In this scenario, you'd see one consumer repeatedly stealing partitions away randomly from the other consumers.

You just have 1 consumer it looks like that keeps getting stolen from, so I don't think it matches these scenarios.

At this point I can't really say why one pod keeps losing some partitions throughout the day. Can you tell when the pod steals back some partitions? (You'd see this via the processInitialize calls)

Turning on debugging may help us determine if there's an issue with claiming the partitions within a certain amount of time. You can turn that on by setting the AZURE_LOG_LEVEL environment variable to verbose. This will generate a lot of logs to stderr though.

amit12cool commented 3 years ago

You just have 1 consumer it looks like that keeps getting stolen from, so I don't think it matches these scenarios.

We have 8 consumers and all 8 consumers are randomly loosing claims to random partitions. I shared screenshot above from one of the consumer.

Can you tell when the pod steals back some partitions? (You'd see this via the processInitialize calls)

Below is screen shot for consumer id 05baebd9-ca99-11eb-bad6-020335e365a1 for processInitialize calls. Inked12_LI

My observation:- -> This consumer lost claim for partition 22[Screen shot shared in my last reply] and at around same time this consumer claimed partition 10. Other that that on that same day this consumer didn't claimed any other partition. -> On the next day this consumer claimed 3 partitions 23,25,27.

Another screenshot for the same consumer loosing partition claim on next day. Inked123_LI

My observation:- -> The consumer lost claim for partition 10 on 23rd June which it had claimed on 22nd June[As per above observations].

Turning on debugging may help us determine if there's an issue with claiming the partitions within a certain amount of time. You can turn that on by setting the AZURE_LOG_LEVEL environment variable to verbose. This will generate a lot of logs to stderr though. Yes, I will set this soon and share the logs.

Meanwhile, Is the above info any help to find a pattern/issue?

chradek commented 3 years ago

@amit12cool

Thanks for answering my questions. Do all 8 consumers lose/steal partitions at the same time? If so, that could indicate that your pods are having a temporary issue talking to storage blob, as I'm not sure what else would explain that kind of coincidence. Otherwise, I think we need logs to figure out what's going on further.

amit12cool commented 3 years ago

@chradek Thanks for your response.

Do all 8 consumers lose/steal partitions at the same time?

No, I see the pattern is like 2-3 consumers[ any consumer ] is loosing partition daily that too 2-3 or max 7 partitions. And for past 3 days, I did not see these errors too.

Otherwise, I think we need logs to figure out what's going on further.

I have added the AZURE_LOG_LEVEL env in our service. Logs are coming in too. Please let me know where can I share those logs.

amit12cool commented 3 years ago

@chradek Any update on this?

chradek commented 3 years ago

@amit12cool If there's no sensitive information, you can share logs here. Otherwise (or if uncomfortable sharing logs here), please open a support ticket and we can take a look at your logs through there. If you open a support ticket, feel free to mention me (chradek) so I get notified more quickly.

chradek commented 3 years ago

@amit12cool I wanted to summarize some of what we found/discussed from our call here.

When subscribing to events, there is a loop that periodically attempts to claim ownership of partitions. The bulk of that loop is this: https://github.com/Azure/azure-sdk-for-js/blob/fb755e31805106a016e42da053ac596a1b0f4a72/sdk/eventhub/event-hubs/src/eventProcessor.ts#L433-L456

So there are a few non-blocking operations that could cause the loop to appear to hang if they never returned:

get list of partition ids
calculate which partitions to claim and claims them (when using the @azure/eventhubs-checkpointstore-blob package, this leads to calls to Storage Blob)
wait for processError to yield in the event an error occurs
sleep for X seconds (~10 seconds minus the time since the last iteration started)

What we were able to see in logs is that at some point in time, this event hubs loop seemed to hang. We looked at storage blob calls and didn't see any non-200 results. To be sure this isn't where we are hanging, we could look for the logs:

Attempting to claim ownership of partition

This will include the partition id being claimed. Then we should see a matching log that follows it:

Successfully claimed ownership of partition

One theory I had was that the processError handler may be hanging. If this is the case, then we should see

An error occured within the EventProcessor loop

without the following appearing after it:

Pausing the EventProcessor loop

So, we're still not sure where the hang was occurring. What we can do on our side is add timeouts to the claimOwnership calls so that there's no possibility of them hanging indefinitely. We can also add a timeout around the processError handler to ensure it can't hang the loop as well.

amit12cool commented 3 years ago

@chradek Thanks for summarizing our discussion. Any idea on when can we expect the optimization changes in the sdk?

chradek commented 3 years ago

Hi @amit12cool I don't have a clear timeline at this time. I'll discuss with the team and report back here with an update.

mzesri commented 3 years ago

I don't know if I should add on to this thread. I get the exact same message:

com.azure.core.amqp.exception.AmqpException: New receiver 'nil' with higher epoch of '0' is created hence current receiver 'nil' with epoch '0' is getting disconnected. If you are recreating the receiver, make sure a higher epoch is used.

And it is definitely caused by two EventProcessorClients consuming the same event hub using the same consumer group. In other words, as soon as I started another EventProcessorClients to get data using the same consumer group, I get this error from the error handler in my first process. I wonder if this is the same issue. I am using azure-messaging-eventhub 5.10.1.

Here is the stacktrace: [2021-10-22 00:02:21.759] [ERROR] [reactor.core.publisher.Operators] Operator called default onErrorDropped   com.azure.core.amqp.exception.AmqpException: New receiver 'nil' with higher epoch of '0' is created hence current receiver 'nil' with epoch '0' is getting disconnected. If you are recreating the receiver, make sure a higher epoch is used. TrackingId:dec292a1000272a6001ca9de6171fcbb_G21_B30, SystemTracker:xxxxtesthub:eventhub:xxxxtest~32766|mz_consumer, Timestamp:2021-10-22T00:02:21, errorContext[NAMESPACE: xxxxtesthub.servicebus.windows.net. ERROR CONTEXT: N/A, PATH: xxxxtest/ConsumerGroups/xx_consumer/Partitions/0, REFERENCE_ID: 0_aad076_1634860219635, LINK_CREDIT: 86]  at com.azure.core.amqp.implementation.ExceptionUtil.toException(ExceptionUtil.java:85)  at com.azure.core.amqp.implementation.handler.LinkHandler.handleRemoteLinkClosed(LinkHandler.java:110)  at com.azure.core.amqp.implementation.handler.LinkHandler.onLinkRemoteClose(LinkHandler.java:61)  at com.azure.core.amqp.implementation.handler.ReceiveLinkHandler.onLinkRemoteClose(ReceiveLinkHandler.java:193)  at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:176)  at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)  at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)  at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:291)  at com.azure.core.amqp.implementation.ReactorExecutor.run(ReactorExecutor.java:86)  at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)  at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)  at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)  at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)  at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)  at java.base/java.lang.Thread.run(Thread.java:829)

tony0918 commented 2 years ago

Hi Team, Is there any updates for this issue? I was getting this issue in latest SDK. Thanks.

"@azure/event-hubs": "^5.8.0", "@azure/eventhubs-checkpointstore-blob": "^1.0.1",

deyaaeldeen commented 2 years ago

Hi @tony0918, could you please share more details about the issue you're facing and how your environment is setup? What are the recent changes performed by the load balancer right up to the occurrence of the issue?

Please see the troubleshooting guide for more details on it.

tony0918 commented 2 years ago

Hi @tony0918, could you please share more details about the issue you're facing and how your environment is setup? What are the recent changes performed by the load balancer right up to the occurrence of the issue?

Please see the troubleshooting guide for more details on it.

@deyaaeldeen , I was running my Nodejs application with Azure Event SDK on my local laptop. The application is connecting to an Event Hub via connection string and save checkpoints on Storage account by SAS connection string. The Event Hub has 32 partitions but only one application was running. When application just started, no error log but after ~30 seconds, it repeated appears.

deyaaeldeen commented 2 years ago

@tony0918 Thanks for the context, could you share a small repro code that I can use for testing on my end?

wstest123 commented 1 year ago

@chradek We got the same error when using checkpointstore SDK to consume message from EventHub. And we found that the entity which occur the same issue is an empty entity. We didn't send any messages to this entity during the past one month, could you please help clarify if the empty entity would cause this issue? Thanks so much!

yunbozhang-msft commented 1 year ago

Hi @chradek , we encounter this same issue, we have one namespace with 2 empty entity(except for those 2 empty entity, we also has some entity in use). So we are afraid those 2 empty can cause claim eventhub partition timeout? And we also saw lots of timeout errors in logs.

And do we have any workaround? Any comments are welcome, thanks in advance!

deyaaeldeen commented 1 year ago

Hi wstest123 and zhangyunbo1994, apologies for the late reply. Thanks a lot for reporting this issue.

Could you please share more details about the setup used on the consumer side?

How many partitions are there and how many partitions the problematic consumer is subscribed to?
How often do you observe this error?
Are there anything that stands out in the log right before this error occurs? E.g. load-balancing activity?

godt21 commented 1 year ago

Hi @deyaaeldeen，we encounter this same issue，and i do not make sure if I lose some log when clock skew. For more, how do I found how How many partitions are there we have. thanks

deyaaeldeen commented 1 year ago

@godt21 Please follow the instructions in the [troubleshooting guide] to enable the logs. For your second question, the number of partitions in one event hub can be found either in the Azure portal or in the logs.

ghost commented 1 year ago

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

belius commented 1 year ago

Hi all, I also have this error that happens very often in 2 environments

The error is:

New receiver 'P27-b2130d71-b77b-46bd-9ea8-e13e2aba7432' with higher epoch of '0' is created hence current receiver 'P27-cbb6f205-69fa-4c30-902f-f3013a8d4820' with epoch '0' is getting disconnected. If you are recreating the receiver, make sure a higher epoch is used. TrackingId:fef5ffe50001924600095cb664467e65_G23S3B10S1, SystemTracker:[my hidden-event-hub-name]:eventhub:[hidden]~28671|sa-consumer-production, Timestamp:2023-04-24T13:21:31 (e[my-hidden-service-name]_). For troubleshooting information, see https://aka.ms/azsdk/net/eventhubs/exceptions/troubleshoot

It is a function that listen to Event Hub for messages
the checkpoint stand on the underneath storage account (standard)
The event hub has 32 partitions
The function is on a consumption plan (windows)
no network error found
other functions working the same on different EH but same storage account doesn't have problem
tried to change storage account and EH but nothing changed
function execution time is around 300ms/500ms get messages, elaborate and send to Redis

how can we solve this?