MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
46 stars 22 forks source link

flakey_flow fails with 12 or 24 messages unacknowledged. #1132

Open petersilva opened 4 months ago

petersilva commented 4 months ago

the flakey_tests on github often fail with 12 or 24 messages unacked, but are otherwise correct. It seems to be a problem with:

so after reading a batch of messages, we then pause until the next scheduled polling interval... then we continue processing and eventually ack them.

It would seem better to acknolwedge them before we pause.

petersilva commented 4 months ago

Screenshot 2024-07-18 172550

petersilva commented 4 months ago

tagging it harmless because there does not seem to be any data loss or actual real-life problem that would result from this. This is an admittedly sub-optimal behaviour, but it results from an obscure condition and torture test that should not cause anything other than display issues in operations.

reidsunderland commented 2 months ago

I think I found the problem.

image

The messages just go back and forth between ready and unacked because the poll is receiving them and rejecting them. But when the poll is in poll_catching_up state, rejected messages don't get acked.

https://github.com/MetPX/sarracenia/blob/5482e4922c7c78fea5414daa1865e6f5f0f3e6c0/sarracenia/flow/__init__.py#L603-L606

And because those messages never go away, there's always incoming messages, so poll_catching_up just stays True.

reidsunderland commented 1 month ago

I didn't update this discussion. After adding the ack of worklist.rejected in 2f03697, the flakey test was still inconsistent, but since it's been too long since I actually tested it, I forget what was still wrong. :(

After that didn't work 100% of the time, I made the change in 32f3d56.

That change: 1) always does work, even if the poll is in catching up state. 2) work is the normal place where worklist.incoming, ok and rejected get acked, so since it always does work now, the special case (just for polls when catching up) of acking the worklists wasn't needed inside the run loop.

This change eliminated duplicate posts from the poll, and if I remember right also fixed the unacked messages issue. So I think it's at least partially a good change and worth working on some more.

But it's not ready yet, I think it's because the polls would occasionally not post some messages and I'm not quite sure why.


Whether or not my change in 32f3d56 is good, we should ack worklist.rejected to eliminate the unacked messages. Maybe 2f03697 + @petersilva's changes to the flakey test in https://github.com/MetPX/sr_insects/pull/63 are enough to get consistent results.