Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.47k stars 287 forks source link

Poison messages handling #1040

Open saguiitay opened 4 months ago

saguiitay commented 4 months ago

As a rokochet of issue #1039 I've noticed that the message is being handled indefinitely. I see the following traces in my logs:

PoisonMessageDetected: d8bdcf4d-ef40-4501-a99b-6e72309644ee: Message [TaskScheduled#1] with ID 6128efd4-8904-494d-9a2d-d2d8d01edbe7 has been dequeued 79 times and is now considered poison: {"Account":"...","TaskHub":"Services","...":"TaskScheduled","TaskEventId":1,"MessageId":"6128efd4-8904-494d-9a2d-d2d8d01edbe7","InstanceId":"d8bdcf4d-ef40-4501-a99b-6e72309644ee","ExecutionId":"b27db20915cf4e5db6cad3589dde88df","PartitionId":"...-workitems","DequeueCount":79}
AbandoningMessage: d8bdcf4d-ef40-4501-a99b-6e72309644ee: Abandoning [TaskScheduled#1] message back to ...-workitems and setting a visibility delay of 600ms: {"Account":"...","TaskHub":"...","EventType":"TaskScheduled","TaskEventId":1,"MessageId":"6128efd4-8904-494d-9a2d-d2d8d01edbe7","InstanceId":"d8bdcf4d-ef40-4501-a99b-6e72309644ee","ExecutionId":"b27db20915cf4e5db6cad3589dde88df","PartitionId":"services-workitems","SequenceNumber":208,"PopReceipt":"AgAAAAMAAAAAAAAAxVkNR+xc2gE=","VisibilityTimeoutSeconds":600}

Side note: the message say setting a visibility delay of 600ms, while the code actually delays for 600s (not ms).

Notice the DequeueCount value of 79 (I've seen values much higher). Perhaps there should be a setting that controls the maximum number of dequeues attempts, after which the message should just be disposed?

@cgillum @davidmrdavid

cgillum commented 4 months ago

I think it makes sense to expose this as a setting. The reason that we allow it to keep going by default is to avoid data loss and allow users a chance to fix the root cause, but it makes sense to allow overriding this behavior.