Azure / durabletask

Durable Task Framework allows users to write long running persistent workflows in C# using the async/await capabilities.
Apache License 2.0
1.53k stars 296 forks source link

Poison messages handling #1040

Open saguiitay opened 9 months ago

saguiitay commented 9 months ago

As a rokochet of issue #1039 I've noticed that the message is being handled indefinitely. I see the following traces in my logs:

PoisonMessageDetected: d8bdcf4d-ef40-4501-a99b-6e72309644ee: Message [TaskScheduled#1] with ID 6128efd4-8904-494d-9a2d-d2d8d01edbe7 has been dequeued 79 times and is now considered poison: {"Account":"...","TaskHub":"Services","...":"TaskScheduled","TaskEventId":1,"MessageId":"6128efd4-8904-494d-9a2d-d2d8d01edbe7","InstanceId":"d8bdcf4d-ef40-4501-a99b-6e72309644ee","ExecutionId":"b27db20915cf4e5db6cad3589dde88df","PartitionId":"...-workitems","DequeueCount":79}
AbandoningMessage: d8bdcf4d-ef40-4501-a99b-6e72309644ee: Abandoning [TaskScheduled#1] message back to ...-workitems and setting a visibility delay of 600ms: {"Account":"...","TaskHub":"...","EventType":"TaskScheduled","TaskEventId":1,"MessageId":"6128efd4-8904-494d-9a2d-d2d8d01edbe7","InstanceId":"d8bdcf4d-ef40-4501-a99b-6e72309644ee","ExecutionId":"b27db20915cf4e5db6cad3589dde88df","PartitionId":"services-workitems","SequenceNumber":208,"PopReceipt":"AgAAAAMAAAAAAAAAxVkNR+xc2gE=","VisibilityTimeoutSeconds":600}

Side note: the message say setting a visibility delay of 600ms, while the code actually delays for 600s (not ms).

Notice the DequeueCount value of 79 (I've seen values much higher). Perhaps there should be a setting that controls the maximum number of dequeues attempts, after which the message should just be disposed?

@cgillum @davidmrdavid

cgillum commented 9 months ago

I think it makes sense to expose this as a setting. The reason that we allow it to keep going by default is to avoid data loss and allow users a chance to fix the root cause, but it makes sense to allow overriding this behavior.

sig5 commented 2 months ago

I'd like to work on this.

cgillum commented 2 months ago

@davidmrdavid you worked on a solution for this recently, right? My memory is fuzzy, but is your implementation broadly available and would it qualify for closing this issue?

davidmrdavid commented 2 months ago

@cgillum: I have an implementation for DTFx.AS v1, that's been recently superceded by @bachuv's re-implementation in DTFx.AS v2. My implementation has only been made available in private packages and has not been merged yet. The main blocker for @bachuv's PR today is for us to follow up on the action items we discussion after our internal design review: mostly removing some changes in DTFx.Core, and such.

I'm definitely interested in getting this moving again, but the group has been busy. I think at the very minimum what we should do is to in the open the remaining items for @bachuv's PR, and if the community wants to contribute those items, then I think we'd welcome it. @sig5 if you ping this thread again later this week, I can make sure to collect those remaining work items in case those are of interest/ you'd like to contribute to them. Does that sound like a good plan?

sig5 commented 2 months ago

Yes, that sounds good @davidmrdavid ! Looking forward to contribute 😺.

davidmrdavid commented 1 month ago

Just linking the poison message handling PR here: https://github.com/Azure/durabletask/pull/1130

I know I owe you some details here, @sig5. Let me try to first get some clarity on the next steps, it's unclear to me if someone from the core team is already working on this.