Azure / azure-sdk-for-go

This repository is for active development of the Azure SDK for Go. For consumers of the SDK we recommend visiting our public developer docs at:
https://docs.microsoft.com/azure/developer/go/
MIT License
1.64k stars 839 forks source link

Long running consumer will not give up partition ownership when new consumer instance comes online #22666

Closed dhasek00 closed 5 months ago

dhasek00 commented 7 months ago

Bug Report

I'm using azeventhubs to build a Benthos benthos.dev consumer plugin with checkpointing following the example here: example_consuming_with_checkpoints_test.go In testing with a simple event hub containing 2 partitions, running 2 instances fails to properly load balance when the first benthos consumer input processor has been running for a moderate amount of time.

For example, start consumer 1 and wait ~60 seconds. It should now be processing both partitions. Then start consumer 2. Consumer 1 will continue processing both partitions while consumer 2 will process a single partition, resulting in duplicate events.

However, if both consumers are started around the same time then load balancing seems to operate as normal. I can start/stop either one and they will drop partitions or assume them normally. The problem only occurs once a single consumer has been working on both partitions for some longer time.

I've tried both the normal balanced and greedy types of load balanced settings.

No matter how long a consumer runs, I expect to be able to add or remove additional benthos consumers and see consistent load balancing.

Try to run a normal consumer outside of benthos and see if results are as expected. I'm not sure if it's because it's running as Benthos plugin, but all else behaves fine except for the balancing.

github-actions[bot] commented 7 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @jfggdl.

richardpark-msft commented 6 months ago

Hi @dhasek00, I've tried running this same test a few different ways and everything appears to be working correctly.

One thing I'm curious about is if it's possible that the two Benthos instances are NOT using the same Azure Storage container. If they were not, it's possible to see some of the behavior you describe where the two partition processors act as if they are unaware of each other.

github-actions[bot] commented 6 months ago

Hi @dhasek00. Thank you for opening this issue and giving us the opportunity to assist. To help our team better understand your issue and the details of your scenario please provide a response to the question asked above or the information requested above. This will help us more accurately address your issue.

github-actions[bot] commented 6 months ago

Hi @dhasek00, we're sending this friendly reminder because we haven't heard back from you in 7 days. We need more information about this issue to help address it. Please be sure to give us your input. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!