dotnet / orleans

Cloud Native application framework for .NET
https://docs.microsoft.com/dotnet/orleans
MIT License
10.06k stars 2.03k forks source link

Azure Queue stream provider does not delete messages if previous message has failed #7807

Open iamsamcoder opened 2 years ago

iamsamcoder commented 2 years ago

I believe I've found a bug in the Azure Queue Stream Provider package.

Scenario

Setup

Steps to reproduce

  1. Generate a stream message which will result in an exception by the consumer.
  2. Generate stream messages on the same queue as the failing message.
  3. Monitor dequeue count of messages, which increases indefinitely while failing message remains in queue. The failing message and good messages will all retry indefinitely.
  4. Manually delete the failing message from the queue.
  5. Monitor remaining messages on queue, which will process and be removed from queue by pulling agent.

Example:

Messages A and B

A throws exception in consumer and will be retried later. B is added to the same queue as A. The consumer processes B, but pulling agent does not remove B from the queue. A and B will remain in the queue indefinitely until A is manually deleted.

I was able to reproduce this in my environment.

Summary

I was able to determine the source of the exceptions for my failed messages. I can remediate through validating dependencies prior to pushing to stream, but I think there is bug in the pulling agent. Messages which are consumed without exceptions should be removed from the queue even if another previous message continues to fail in the consumer.

benjaminpetit commented 2 years ago

What do you think should happen with A? Ideally, B shouldn't be sent to the consumer until A was either processed or deleted from the queue.

iamsamcoder commented 2 years ago

I was looking for a "poison" option similar to how functions Azure Storage queue triggers work. I think it would be ideal for A to be sent to a poison queue after a configured number of retries. This way, you wouldn't have a failing message blocking other messages, but you could review/handle it later.

benjaminpetit commented 2 years ago

We could modify IStreamFailureHandler.OnDeliveryFailure to include the batch that couldn't be delivered.

The message should be deleted from the queue once it was purged from cache. Unfortunately, in this case, the cache keeps the event in memory, because it wasn't delivered... maybe we should modify SimpleQueueCache to purge event if they are too old.

If the event wasn't delivered to your consumer, you should get notified with OnErrorAsync. From there you could remove your subscription and resubscribe again with a null token to skip the event that caused the issue. Note that you might lose more than one event in this case (between the time you resubscribe, we might have purged some events from the cache).

iamsamcoder commented 1 year ago

Hi Benjamin,

I've seen that the message is delivered. What I've observed is that if Message A is received by a consumer and fails to complete (say it throws invalid operation), then Message A will get stuck in a retry loop. It keeps being delivered to the consumer, failing, and retrying. However, messages B, C, etc behind A on the same queue are also sent to a consumer and complete (I can see their result downstream). The issue for B, C, etc is that they also get stuck in a retry loop even though they succeed.

If I can, I'll try to create a simple experiment to test this.

On Thu, Jul 7, 2022 at 4:59 AM Benjamin Petit @.***> wrote:

We could modify IStreamFailureHandler.OnDeliveryFailure to include the batch that couldn't be delivered.

The message should be deleted from the queue once it was purged from cache. Unfortunately, in this case, the cache keeps the event in memory, because it wasn't delivered... maybe we should modify SimpleQueueCache to purge event if they are too old.

If the event wasn't delivered to your consumer, you should get notified with OnErrorAsync. From there you could remove your subscription and resubscribe again with a null token to skip the event that caused the issue. Note that you might lose more than one event in this case (between the time you resubscribe, we might have purged some events from the cache).

— Reply to this email directly, view it on GitHub https://github.com/dotnet/orleans/issues/7807#issuecomment-1177495957, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3BOUXYLHDK4FGONHTK4L3VS3BDDANCNFSM5ZVFNGWQ . You are receiving this because you authored the thread.Message ID: @.***>