Closed amit12cool closed 1 year ago
@amit12cool
We are regularly seeing below errors in processError method.
Do you notice any patterns (e.g. they get called every 30 minutes? All stolen from the same consumer?) Just to confirm, are you using the BlobCheckpointStore
?
Are the above errors due to load balancing done by Azure Event hub sdk?
The OwnershipLost
error is expected when load balancing occurs. Load balancing is indeed an ongoing process. When using the BlobCheckpointStore
with your consumer, every ~10 seconds (by default) the consumer will check to see which consumers have a claim for each partition, then run some logic to determine whether it needs to 'steal' a partition from another consumer.
If yes, is load balancing an ongoing process even if none of our services restarted i.e. is the ownership of partitions keep changing even if every application instance is running 24*7.
If a consumer doesn't reassert its claim over a partition within 60 seconds, then other consumers will assume that partition is no longer being read from and will attempt to claim it. So it is possible to see this occur if a consumer was unable to update Storage Blob within 60 seconds (this is separate from the calls to updateCheckpoint
you make.) There are 3 situations I can think of that could lead to this happening:
If it's #3, then I'd expect a single consumer to be stealing partitions from other consumers periodically, so that may be something to consider if you see this as the pattern.
Aside from these 3, if a new EventHubConsumerClient was created and started reading, then as you say, you'd expect to see OwnershipLost as well.
Is processInitialize method called each time the ownership is transferred from one application instance to other?
processInitialize
should be called by whichever consumer is 'stealing' the partition, and it should occur prior to OwnershipLost
on the consumer being stolen from. Do you never see processInitialize
called before OwnershipLost
on any of the other consumers? Do you see processInitialize
when you start your application? Can you tell which consumer is reading from the partition after you see an OwnershipLost?
Do you notice any patterns (e.g. they get called every 30 minutes? All stolen from the same consumer?)
I have attached screenshot of 2 instances with my observations My observation on screenshot:-
Just to confirm, are you using the BlobCheckpointStore?
yes, we use BlobCheckpointStore
A temporary network issue prevented the consumer from calling Storage Blob to reassert its claims.
We don't see any network issue in our service as are services are functioning functioning rightly.
Some processing took up a lot of time on the thread preventing the Storage Blob call to take too long to occur. (Since the calls occur every ~10 seconds, and it takes ~60 seconds to consider a claim as invalid, that would have to be a long blocking task) For this, as soon as the instance get an event we process it using some business logic and make some MongoDb calls for read write plus get/update twin on IoT hub and for some events we make an REST api call to other services. Each of this step may or may not run for specific event type received. So the processing time of event in our service is varying.
Question(s):- At what point in time the storage call is made from SDK to blobstorage for claim reassertion? Will the claim be lost in the scenario if I'm getting an empty array on any partition for more than 60 seconds, since as per screenshot I see randomly the claim is lost for any partition can this be the case of claim lost?
One of the hosts a consumer is running on is experiencing clock skew and thus thinks that partition claims are stale. Since we don't get the processInitialize method I could not tell it's been claimed by same consumer.
Question:- Even if one consumer is having clock skew, until other consumer(s) don't loose the claim the clock skew consumer cannot claim. Is it correct understanding?
Do you never see processInitialize called before OwnershipLost on any of the other consumers?
processInitialize is called, I can see the logs now, somehow the filter was not working for on the logs. Changed the filter query resulted in the processInitialize logs appear on Kibana.
Do you see processInitialize when you start your application?
Yes, they are there.
Can you tell which consumer is reading from the partition after you see an OwnershipLost?
Yes. Below are the logs captured for claim lost[partition 22] and first events received on that partition on another consumer.
My observation:-
13f75a21-ca99-11eb-bad6-020335e365a1
and then the Ownership lost came on the consumer 05baebd9-ca99-11eb-bad6-020335e365a1
which had previously claimed partiton 22.So in totality, the flow seems to be working correctly. Now the only point to clear is the reason for frequent claim loss by consumers. I hope I was able to answer your questions.
@chradek Any update on this?
At what point in time the storage call is made from SDK to blobstorage for claim reassertion? Will the claim be lost in the scenario if I'm getting an empty array on any partition for more than 60 seconds, since as per screenshot I see randomly the claim is lost for any partition can this be the case of claim lost?
By default, the SDK will attempt to reassert its claims every 10 seconds. This is separate from checkpointing. Even if you receive 0 events for 60 seconds, the claim reassertion will continue to happen in the background. It takes ~60 seconds of not being reclaimed before a partition is considered not claimed. Now, each partition that is claimed/reclaimed is 1 call to storage-blob, so if for some reason it took more than 60 seconds to reclaim all the partitions, then another consumer could come in and steal one away.
Even if one consumer is having clock skew, until other consumer(s) don't loose the claim the clock skew consumer cannot claim. Is it correct understanding?
The answer is a little complicated. If a consumer had clock skew such that it was falling behind the real time, then it would see the expiry time of the existing claims as being further in the future than they really are. However, if may still try to steal a partition anyway in order to balance the load. (e.g. If you have 32 partitions and 4 consumers, each one would try to get to a steady state of 8 partitions, even if all partitions are claimed)
If a consumer had clock skew such that it was ahead of the real time, then it would see the expiry time of the existing claims as being further in the past than they really are. So, it may think that the other consumers shutdown for some reason as their claims appear to be expired, and attempt to steal. In this scenario, you'd see one consumer repeatedly stealing partitions away randomly from the other consumers.
You just have 1 consumer it looks like that keeps getting stolen from, so I don't think it matches these scenarios.
At this point I can't really say why one pod keeps losing some partitions throughout the day. Can you tell when the pod steals back some partitions? (You'd see this via the processInitialize calls)
Turning on debugging may help us determine if there's an issue with claiming the partitions within a certain amount of time. You can turn that on by setting the AZURE_LOG_LEVEL
environment variable to verbose
. This will generate a lot of logs to stderr though.
You just have 1 consumer it looks like that keeps getting stolen from, so I don't think it matches these scenarios.
We have 8 consumers and all 8 consumers are randomly loosing claims to random partitions. I shared screenshot above from one of the consumer.
Can you tell when the pod steals back some partitions? (You'd see this via the processInitialize calls)
Below is screen shot for consumer id 05baebd9-ca99-11eb-bad6-020335e365a1
for processInitialize calls.
My observation:- -> This consumer lost claim for partition 22[Screen shot shared in my last reply] and at around same time this consumer claimed partition 10. Other that that on that same day this consumer didn't claimed any other partition. -> On the next day this consumer claimed 3 partitions 23,25,27.
Another screenshot for the same consumer loosing partition claim on next day.
My observation:- -> The consumer lost claim for partition 10 on 23rd June which it had claimed on 22nd June[As per above observations].
Turning on debugging may help us determine if there's an issue with claiming the partitions within a certain amount of time. You can turn that on by setting the AZURE_LOG_LEVEL environment variable to verbose. This will generate a lot of logs to stderr though. Yes, I will set this soon and share the logs.
Meanwhile, Is the above info any help to find a pattern/issue?
@amit12cool
Thanks for answering my questions. Do all 8 consumers lose/steal partitions at the same time? If so, that could indicate that your pods are having a temporary issue talking to storage blob, as I'm not sure what else would explain that kind of coincidence. Otherwise, I think we need logs to figure out what's going on further.
@chradek Thanks for your response.
Do all 8 consumers lose/steal partitions at the same time?
No, I see the pattern is like 2-3 consumers[ any consumer ] is loosing partition daily that too 2-3 or max 7 partitions. And for past 3 days, I did not see these errors too.
Otherwise, I think we need logs to figure out what's going on further.
I have added the AZURE_LOG_LEVEL env in our service. Logs are coming in too. Please let me know where can I share those logs.
@chradek Any update on this?
@amit12cool If there's no sensitive information, you can share logs here. Otherwise (or if uncomfortable sharing logs here), please open a support ticket and we can take a look at your logs through there. If you open a support ticket, feel free to mention me (chradek) so I get notified more quickly.
@amit12cool I wanted to summarize some of what we found/discussed from our call here.
When subscribing to events, there is a loop that periodically attempts to claim ownership of partitions. The bulk of that loop is this: https://github.com/Azure/azure-sdk-for-js/blob/fb755e31805106a016e42da053ac596a1b0f4a72/sdk/eventhub/event-hubs/src/eventProcessor.ts#L433-L456
So there are a few non-blocking operations that could cause the loop to appear to hang if they never returned:
@azure/eventhubs-checkpointstore-blob
package, this leads to calls to Storage Blob)What we were able to see in logs is that at some point in time, this event hubs loop seemed to hang. We looked at storage blob calls and didn't see any non-200 results. To be sure this isn't where we are hanging, we could look for the logs:
Attempting to claim ownership of partition
This will include the partition id being claimed. Then we should see a matching log that follows it:
Successfully claimed ownership of partition
One theory I had was that the processError handler may be hanging. If this is the case, then we should see
An error occured within the EventProcessor loop
without the following appearing after it:
Pausing the EventProcessor loop
So, we're still not sure where the hang was occurring. What we can do on our side is add timeouts to the claimOwnership calls so that there's no possibility of them hanging indefinitely. We can also add a timeout around the processError
handler to ensure it can't hang the loop as well.
@chradek Thanks for summarizing our discussion. Any idea on when can we expect the optimization changes in the sdk?
Hi @amit12cool I don't have a clear timeline at this time. I'll discuss with the team and report back here with an update.
I don't know if I should add on to this thread. I get the exact same message:
com.azure.core.amqp.exception.AmqpException: New receiver 'nil' with higher epoch of '0' is created hence current receiver 'nil' with epoch '0' is getting disconnected. If you are recreating the receiver, make sure a higher epoch is used.
And it is definitely caused by two EventProcessorClients consuming the same event hub using the same consumer group. In other words, as soon as I started another EventProcessorClients to get data using the same consumer group, I get this error from the error handler in my first process. I wonder if this is the same issue. I am using azure-messaging-eventhub 5.10.1.
Here is the stacktrace:
[2021-10-22 00:02:21.759] [ERROR] [reactor.core.publisher.Operators] Operator called default onErrorDropped
com.azure.core.amqp.exception.AmqpException: New receiver 'nil' with higher epoch of '0' is created hence current receiver 'nil' with epoch '0' is getting disconnected. If you are recreating the receiver, make sure a higher epoch is used. TrackingId:dec292a1000272a6001ca9de6171fcbb_G21_B30, SystemTracker:xxxxtesthub:eventhub:xxxxtest~32766|mz_consumer, Timestamp:2021-10-22T00:02:21, errorContext[NAMESPACE: xxxxtesthub.servicebus.windows.net. ERROR CONTEXT: N/A, PATH: xxxxtest/ConsumerGroups/xx_consumer/Partitions/0, REFERENCE_ID: 0_aad076_1634860219635, LINK_CREDIT: 86]
at com.azure.core.amqp.implementation.ExceptionUtil.toException(ExceptionUtil.java:85)
at com.azure.core.amqp.implementation.handler.LinkHandler.handleRemoteLinkClosed(LinkHandler.java:110)
at com.azure.core.amqp.implementation.handler.LinkHandler.onLinkRemoteClose(LinkHandler.java:61)
at com.azure.core.amqp.implementation.handler.ReceiveLinkHandler.onLinkRemoteClose(ReceiveLinkHandler.java:193)
at org.apache.qpid.proton.engine.BaseHandler.handle(BaseHandler.java:176)
at org.apache.qpid.proton.engine.impl.EventImpl.dispatch(EventImpl.java:108)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.dispatch(ReactorImpl.java:324)
at org.apache.qpid.proton.reactor.impl.ReactorImpl.process(ReactorImpl.java:291)
at com.azure.core.amqp.implementation.ReactorExecutor.run(ReactorExecutor.java:86)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:68)
at reactor.core.scheduler.SchedulerTask.call(SchedulerTask.java:28)
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.base/java.lang.Thread.run(Thread.java:829)
Hi Team, Is there any updates for this issue? I was getting this issue in latest SDK. Thanks.
"@azure/event-hubs": "^5.8.0", "@azure/eventhubs-checkpointstore-blob": "^1.0.1",
Hi @tony0918, could you please share more details about the issue you're facing and how your environment is setup? What are the recent changes performed by the load balancer right up to the occurrence of the issue?
Please see the troubleshooting guide for more details on it.
Hi @tony0918, could you please share more details about the issue you're facing and how your environment is setup? What are the recent changes performed by the load balancer right up to the occurrence of the issue?
Please see the troubleshooting guide for more details on it.
@deyaaeldeen , I was running my Nodejs application with Azure Event SDK on my local laptop. The application is connecting to an Event Hub via connection string and save checkpoints on Storage account by SAS connection string. The Event Hub has 32 partitions but only one application was running. When application just started, no error log but after ~30 seconds, it repeated appears.
@tony0918 Thanks for the context, could you share a small repro code that I can use for testing on my end?
@chradek We got the same error when using checkpointstore SDK to consume message from EventHub. And we found that the entity which occur the same issue is an empty entity. We didn't send any messages to this entity during the past one month, could you please help clarify if the empty entity would cause this issue? Thanks so much!
Hi @chradek , we encounter this same issue, we have one namespace with 2 empty entity(except for those 2 empty entity, we also has some entity in use). So we are afraid those 2 empty can cause claim eventhub partition timeout? And we also saw lots of timeout errors in logs.
And do we have any workaround? Any comments are welcome, thanks in advance!
Hi wstest123 and zhangyunbo1994, apologies for the late reply. Thanks a lot for reporting this issue.
Could you please share more details about the setup used on the consumer side?
Hi @deyaaeldeen,we encounter this same issue,and i do not make sure if I lose some log when clock skew. For more, how do I found how How many partitions are there we have. thanks
@godt21 Please follow the instructions in the [troubleshooting guide] to enable the logs. For your second question, the number of partitions in one event hub can be found either in the Azure portal or in the logs.
Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!
Hi all, I also have this error that happens very often in 2 environments
The error is:
New receiver 'P27-b2130d71-b77b-46bd-9ea8-e13e2aba7432' with higher epoch of '0' is created hence current receiver 'P27-cbb6f205-69fa-4c30-902f-f3013a8d4820' with epoch '0' is getting disconnected. If you are recreating the receiver, make sure a higher epoch is used. TrackingId:fef5ffe50001924600095cb664467e65_G23S3B10S1, SystemTracker:[my hidden-event-hub-name]:eventhub:[hidden]~28671|sa-consumer-production, Timestamp:2023-04-24T13:21:31 (e[my-hidden-service-name]_). For troubleshooting information, see https://aka.ms/azsdk/net/eventhubs/exceptions/troubleshoot
how can we solve this?
Describe the bug Currently we run 8 instances of our application.We are regularly seeing below errors in
processError
method. We know the reason of it beingownershipLost
as we have the same number of messages ofownershipLost
andMessagingError
at the same point of time.We have some questions:
1) Are the above errors due to load balancing done by Azure Event hub sdk? If yes, is load balancing an ongoing process even if none of our services restarted i.e. is the ownership of partitions keep changing even if every application instance is running 24*7. As per, this line it seems load balancing is an ongoing process and the ownership keeps on changing from time to time and our logs also suggest that seeing the amount of error[Ownership] logs we get frequently. Please confirm.
2) Is
processInitialize
method called each time the ownership is transferred from one application instance to other? As per our logs, we never received the processInitialize call after theMessagingError or OwnershipLost
error log received. As per this line it should have been called.To Reproduce Steps to reproduce the behavior:
Expected behavior processInitialize should be called if ownership is changed between instance of application.
Screenshots If applicable, add screenshots to help explain your problem.
Additional context Add any other context about the problem here.