Azure / azure-sdk-for-go

This repository is for active development of the Azure SDK for Go. For consumers of the SDK we recommend visiting our public developer docs at:
https://docs.microsoft.com/azure/developer/go/
MIT License
1.63k stars 832 forks source link

Eventhub consumer receiving less number of events compared to ingested events. #22947

Closed shourabhpayal closed 1 month ago

shourabhpayal commented 4 months ago

Bug Report

Package = messaging/azeventhubs SDK version = 1.18 Version = v1.0.3 Commit : e63fa2461487e16f6b54a54191db7ac838bcd216

Hi Team I am facing an issue where Incoming message count > Outgoing message count (Seems like these message just get dropped. I tried multiple runs.): image

  1. Are you seeing duplicate messages?
  2. Are you seeing any failures from partitionClient.UpdateCheckpoint? If that function fails restarts will end up consuming the same events again.
  3. Are you running multiple Processor instances? And if so, are all of those instances using the same consumer group?
  4. Are you using the same Azure Storage Blob container for each run of the Processor?

Answering few questions here.

  1. No
  2. No
  3. Yes multiple instances using the same consumer group. I run 10 tasks which have consumers consuming messages from my eventhub's 32 partitions (Each consumer is assigned 6-7 partitions).
  4. Yes

Additionally if I increase the number for consumers for example 20 the outgoing messages also double. I don't think this should be the behaviour here as the azeventhub documentation mentions that throughput is 2 MB/s per Throughput unit and I have 40 of them.

This is also not a problem of capacity with ram or cpu of my consumer application as I have achieved a much larger throughput using the exact config of 10 consumers using a premium azure event hub deployment.

Kindly help.

github-actions[bot] commented 4 months ago

Thank you for your feedback. Tagging and routing to the team member best able to assist.

shourabhpayal commented 4 months ago

I did some more testing and formed a theory around it. In the doc it is mentioned that standard tier can support 40 TU and each TU supports and ingestion of 1MB/s. This should mean that I will be able to ingest a max of 40 MB/s with all TU active. Since I am ingesting a lot of data the TU units may be crashing and the messages get lost. Can this happen?

richardpark-msft commented 4 months ago

Hi @shourabhpayal, was on vacation, I'll take a look now.

One interesting thing about our Event Hubs library is that prefetching is on by default. This means there's an internal cache of messages that continually gets refreshed up to the limit configured in the Prefetch field when you create PartitionClients:

https://pkg.go.dev/github.com/Azure/azure-sdk-for-go/sdk/messaging/azeventhubs#PartitionClientOptions

By default this value is 300 per partition, so 32*300 might be adding up, depending on how often you close your clients without having pulled all the messages that are sitting in the cache.

You can actually disable prefetching by setting the Prefetch value to -1 (or any value less than 0). This turns off any background fetching of events.

shourabhpayal commented 4 months ago

I am not closing the connection I have partitionClient.ReceiveEvents() in a for loop as provided here (processEvents method) in this example: https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-go-get-started-send#code-to-receive-events-from-an-event-hub

I experimented with Prefetch but the results were the same.

richardpark-msft commented 4 months ago

@shourabhpayal, I made a mistake when I looked at this because you and I had a similar conversation on another thread where it was the opposite - they were consuming more events than they were sending.

I am facing an issue where Incoming message count > Outgoing message count (Seems like these message just get dropped. I tried multiple runs.):

I'm not sure if I understand your conclusion here about messages being dropped. Looking at your usage graph it looks like your production of events is far outpacing your ability to consume them, which seems like the opposite problem.

Additionally if I increase the number for consumers for example 20 the outgoing messages also double. I don't think this should be the behaviour here as the azeventhub documentation mentions that throughput is 2 MB/s per Throughput unit and I have 40 of them.

This is an interesting point - if you doubled your consumers and this doubled your consumption rate then perhaps you have a different bottleneck? Are you possibly network constrained and using more consumers spread you out across more hosts?

This is also not a problem of capacity with ram or cpu of my consumer application as I have achieved a much larger throughput using the exact config of 10 consumers using a premium azure event hub deployment.

One thing that can impact you (significantly) is if your events are not evenly distributed among your partitions. If that's the case then you can end up with some consumers being very busy, and others not reading much at all. We don't read from a single partition with multiple readers, so equal distribution is key.

Do you have some client-side statistics to see how evenly your events have been distributed amongst your consumers, per-partition?

github-actions[bot] commented 2 months ago

Hi @shourabhpayal. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

github-actions[bot] commented 2 months ago

Hi @shourabhpayal, we're sending this friendly reminder because we haven't heard back from you in 7 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!