[QUERY]Azure Event Hub Trigger reading duplicate records after increase the partition count on Event Hub

ch798543 commented 5 months ago

Library name and version

Microsoft.Azure.WebJobs.Extensions.EventHubs (5.2.0)

Query/Question

We have Event Hub which is on dedicated tier, having two consumer groups, recently we changed the partition count from 16 to 64 and observed that our associated Azure Event Hub trigger started reading duplicate events.

My function app is on isolated app service plan(isolated v2 I2V2) and running 8 instances We are using below settings in our function app host.json maxEventBatchSize : 2048 batchCheckpointFrequency : 1 prefetchCount : 4096

I am using FixedDelayRetry attribute for my event hub trigger function with maxretrycount set to 3 and delayinterval set to 5 mins.

My questions

Is it possible that after increasing the partition count, the function app offset count got reset or existing checkpoints got removed?
Is it possible that each event batch size is read by multiple function app instances?
Any other information that would help me in triaging the issue?

Environment

No response

github-actions[bot] commented 5 months ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.

jsquire commented 5 months ago

Hi @ch798543. Thanks for reaching out and we regret that you're experiencing difficulties. There's not enough information available to comment on why you're seeing the behavior. A few questions:

When you say that you "changed the partition count", can you clarify what you mean? Did you delete and recreate the hub or dynamically increase the partitions against the existing hub?
Did you restart your Function after changing partitions or just leave it running?
Did you make any changes to your Function configuration?

To answer your questions:

1) Unless you deleted/recreated the Event Hub, physically removed data from the Azure Blob storage container, changed configuration to use a new Blob container, or changed configuration to use a new consumer group, then your existing checkpoints would continue to exist and to be valid. Any one of those conditions, however, would invalidate your checkpoints.

2) No. You will see 1-2 batches of events potentially duplicated between instances each time your Function scales up/down as partition ownership changes. You will also see duplication due to rewinds during scaling, as there is no coordinated hand-off when ownership changes and you cannot assume the old owner wrote a checkpoint that the new owner sees. The new owner will rewind to the last checkpoint written. (more information)

3) By default, Azure SDK logs are emitted by your Function into your AppInsights instance. (see: docs) Sharing a 5-minute slice of logs from the "Azure-Messaging-EventHubs" source at the time you observed the behavior would help to understand what the client saw and was reacting to. Logs should be filtered to these events.

github-actions[bot] commented 5 months ago

Hi @ch798543. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

ch798543 commented 5 months ago

Thanks @jsquire for the response.

Find below my response inline When you say that you "changed the partition count", can you clarify what you mean? Did you delete and recreate the hub or dynamically increase the partitions against the existing hub? -- We updated the partition count from the Azure Portal by going to the Configuration settings. We did not delete or recreate the hub.

Did you restart your Function after changing partitions or just leave it running? -- We restarted the Function app after the partition count changes

Did you make any changes to your Function configuration? -- Yes, we updated the below settings in host.json maxEventBatchSize : 2048 (existing value - 60) batchCheckpointFrequency : 1 prefetchCount : 4096 (existing value - 60)

ch798543 commented 5 months ago

@jsquire I came across another setting in my function host.json file i.e initialOffsetOptions We did not have initialOffsetOptions property added and the default value is fromStart. Does that means when I increased the partition count from 16 to 64 then all new partitions started to read the messages from start of the stream?

jsquire commented 5 months ago

-- We restarted the Function app after the partition count changes

That would have caused listeners to stop and each partition would restart from the last checkpoint written.

batchCheckpointFrequency : 1

What was this previously? This value would indicate that a checkpoint was written after each Function invocation. That doesn't sound like the behavior that you're seeing.

Does that means when I increased the partition count from 16 to 64 then all new partitions started to read the messages from start of the stream?

That means that if there was no checkpoint found, processing for a partition would start at the beginning. The question what we need to answer is "why was there no checkpoint found?"

github-actions[bot] commented 5 months ago

Hi @ch798543. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

ch798543 commented 5 months ago

@jsquire Thanks for your reply

Please find my response inline

batchCheckpointFrequency : 1 What was this previously? This value would indicate that a checkpoint was written after each Function invocation. That doesn't sound like the behavior that you're seeing. --- Earlier it was set 4, we changed it to 1 when we started seeing the duplicate event processing issue. What is the recommended value to be used for batchCheckpointFrequency?

jsquire commented 5 months ago

Earlier it was set 4, we changed it to 1 when we started seeing the duplicate event processing issue.

Given your batch size setting (60), this would mean that you were checkpointing at most every 240 events. Depending on how many events reads were able to slurp up during reads, this may have been less. Under normal circumstances, this would mean that I'd expect to see a rewind between 0 and 300 events every time there was a scaling operation or a host migration in Functions. (non-deterministic, based on state at the time of the change)

What is the recommended value to be used for batchCheckpointFrequency?

There is none. It's a question that each application needs to determine. If you checkpoint more frequently, processing will be slower, but you'll see smaller rewinds and fewer duplicates. If you checkpoint less frequently, you'll see higher throughput, but bigger rewinds and more duplicates when scaling/migrations happen. That's your trade-off.

In either case, it's important to keep the Event Hubs at-least-once guarantee in mind. There will be some number of duplicate events possible, no matter what you do. Your application must be tolerant of duplicates and should be idempotent when processing. The question that I'd ask myself is how expensive it is for your application to process events and whether you want to be able to process more events quickly or guard against having to deal with duplicates.

github-actions[bot] commented 5 months ago

Hi @ch798543. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

ch798543 commented 5 months ago

@jsquire Thanks for your reply

I've updated my host.json settings as follows, resulting in fewer duplicates:

maxEventBatchSize: 2048 batchCheckpointFrequency: 1 prefetchCount: 4096

incoming events - approx 2000 per second

Previously, the following settings were causing numerous duplicate failures:

maxEventBatchSize: 2048 batchCheckpointFrequency: 4 prefetchCount: 4096 incoming events - approx 2000 per second Regarding the batchCheckpointFrequency, does a higher count mean we'll encounter more duplicates even without any scaling operations? Additionally, you mentioned host migration in Functions. Does this occur automatically?

jsquire commented 5 months ago

We generally advise at least a 3:1 ratio between prefetch count and batch size, though that will vary by application. If you're seeing your Function invoked consistently with your requested batch size, all is well. If you're consistently seeing fewer than that, you'll want to bump up prefetch.

Regarding the batchCheckpointFrequency, does a higher count mean we'll encounter more duplicates even without any scaling operations?

I can't answer this with any accuracy, as it depends on knowledge of how the Functions infrastructure manages where a Function lives and how often it moves around. This is outside of the insight and influence of the Event Hubs extension package.

The best that I can do is "maybe". The setting translates to "after how many batches are sent to your function should we write a checkpoint?" Your previous value, 4, meant "write a checkpoint after you call my Function 4 times." Your current value will checkpoint after each invocation of the Function.

Additionally, you mentioned host migration in Functions. Does this occur automatically?

Same answer - it depends and relies on platform knowledge that is outside the scope of the Azure SDK package.

That said, I would expect so. As with any orchestrator, Functions may rebalance work and move apps/instances around to accommodate for load, outages, rolling updates/patches, recover from crashes/errors, or just because it feels like it. I would expect that they try to limit this, as it would cause slow down for most trigger types, but I don't have that insight.

github-actions[bot] commented 5 months ago

Hi @ch798543. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

ch798543 commented 5 months ago

@jsquire What is the typical percentage of duplicates that are considered acceptable?

jsquire commented 5 months ago

@jsquire What is the typical percentage of duplicates that are considered acceptable?

The Event Hubs service should only return a tiny set of duplicates in rare circumstances, like crash recovery for a partition node. It's non-zero but enough of an edge case that it's not worth accounting for.
Host orchestration ignored, duplication in a healthy application is generally related to partition ownership changes. These are generally related to scaling - the more you scale up/down, the more frequent you'll see rollbacks of 1-2 batch sizes for a partition.
Host orchestration moves in a healthy system are entirely dependent on the host platform. No way that I can accurately comment on this case for "normal".
Duplication in a non-healthy system is unpredictable and tied to crashes, error recovery, network failures, and so on.

github-actions[bot] commented 5 months ago

Hi @ch798543. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

github-actions[bot] commented 5 months ago

Hi @ch798543, we're sending this friendly reminder because we haven't heard back from you in 7 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

Azure / azure-sdk-for-net