MicrosoftDocs / azure-docs

Open source documentation of Microsoft Azure
https://docs.microsoft.com/azure
Creative Commons Attribution 4.0 International
10.2k stars 21.34k forks source link

Event hub event processed twice by two processors at the same time #121643

Closed yp87 closed 4 months ago

yp87 commented 4 months ago

Hello, we got an issue where an event was processed twice for the same partition by two different processors at the same time. We know an event can be processed twice at different time, but we did not expect it to be able to be processed twice at the same time.

The documentation says this:

Each event processor instance acquires ownership of a partition and starts processing the partition from last known checkpoint. If a processor fails (VM shuts down), then other instances detect it by looking at the last modified time. Other instances try to get ownership of the partitions previously owned by the inactive instance. The checkpoint store guarantees that only one of the instances succeeds in claiming ownership of a partition. So, at any given point of time, there is at most one processor that receives events from a partition.

We also see that there are two load balancing strategies (greedy (default) and balanced). Would switching to balanced would prevent such issue? I think the documentation should be more clear about this subject.

Thank you


Document Details

Do not edit this section. It is required for learn.microsoft.com ➟ GitHub issue linking.

PesalaPavan commented 4 months ago

@yp87 Thanks for your feedback! We will investigate and update as appropriate.

Naveenommi-MSFT commented 4 months ago

Hello @yp87

Regarding the documentation you mentioned, it is correct that each event processor instance acquires ownership of a partition and starts processing the partition from the last known checkpoint. If a processor fails, other instances will try to acquire ownership of the partitions previously owned by the inactive instance. The checkpoint store guarantees that only one of the instances succeeds in claiming ownership of a partition, so at any given point in time, there should be at most one processor that receives events from a partition.

However, it is possible that two processors could acquire ownership of the same partition at the same time if there is a race condition or other issue with the checkpoint store. This could result in the same event being processed twice by different processors.

Regarding the load balancing strategies, the "greedy" strategy is the default and is designed to maximize throughput by assigning partitions to the event processor instances with the highest available capacity. The "balanced" strategy is designed to evenly distribute partitions across all event processor instances, regardless of their capacity. Switching to the "balanced" strategy may help to prevent the issue you experienced by ensuring that partitions are evenly distributed across all event processor instances.

If there are any further questions regarding the documentation, please tag me in your reply and we will be happy to continue the conversation.

jsquire commented 4 months ago

@yp87: This is normal and expected behavior for processors during scaling, node migrations, or host/node crash recovery. These activities cause partitions move between owners, and they will rewind to the last recorded checkpoint. Because there is no ordered hand-off, you may see overlap for a 1-2 batch period as the old owner is not aware that a new owner has taken over until the next time they attempt to read.

We do not recommend adjusting the load balancing strategy; "greedy" is the right default for the vast majority of applications. Moving to "balanced" may reduce ownership changes when your application is first started and all of the nodes are first claiming partitions, but at best, it would prevent 1-2 migrations at the cost of greatly increasing the time that it takes for all partitions to find owners and begin processing. It would have no effect for applications that are already running.

If you are seeing partitions migrate owners frequently outside of when you are scaling your application, that is a sign that your application is unhealthy. You'll want to review the Event Hubs Troubleshooting Guide which discusses common causes and mitigations.

spelluru commented 4 months ago

Thank you @jsquire

please-close