Closed Seadoo2017 closed 4 years ago
I've seen this happen as well. I think we need to better handle this but find a way to detect if the work item is still being worked on and not stuck. Would you mind submitting a unit test for this to the base repo. I've seen this happen to the in memory queues as well.
@ejsmith I just reviewed the fix to port it to Redis queue (which my original issue referred to) and I have some concerns regarding the fix:
Specifically in InMemoryQueue, even if you try to update the item state to abandoned before removing from _dequeued, we might pass the "if abandoned" check in the user flow and then switch execution to the maintenance which will result the same exception in different race flow.
I couldn't think of a any solution to the basic race problem that this flow generates - a call to AbandonAsync in StartWorkingImpl catch clause might "legitimately" fail with exception. My suggestion is having inner try/catch in StartWorkingImpl catch clause. What do you think?
@Seadoo2017 finally got the tests working and merged the changes in. I think your concerns should be addressed. What do you think?
@Seadoo2017 there is a new 10.0.0-beta8 on nuget if you want to try it.
@ejsmith https://github.com/FoundatioFx/Foundatio/commit/43d975b8dd7f28034291681f3ed3cb2bbdbeb03f
Does not resolve the issue I raised for MemoryQueue and not for RedisQueue as I explained here: https://github.com/FoundatioFx/Foundatio/issues/239#issuecomment-666264565
The fix does make the unit test which reproduce this issue pass, but the issue itself isn't solved and can still happen. It's much harder to reproduce in unit test now but the fundamental issue still exists and will happen during runs randomly.
@Seadoo2017 I made some more changes to the in memory queue. I don't think we can have a race condition because only 1 of the TryRemove
operations in the complete or abandon calls is going to succeed. Take a look and tell me what you think.
@ejsmith I just pushed a new pull request (https://github.com/FoundatioFx/Foundatio/pull/250) which demonstrates that the issue still exists. The test is a bit ugly (sorry) :/ because it's much harder to reproduce in test the timing which causes the issue but it happens in real life as well. Please note that my test demonstrates one example of timing issue which reproduce the issue, but as I explained above, there are more scenarios.
The fundamental issue is that I don't think we can assume the queue entry flags validity on the catch clause of the StartWorkingImpl method. I believe there is no way around having inner try/catch.
Another important note: this issue was first observed in RedisQueue, I just demonstrated it on MemoryQueue as was asked. In RedisQueue it happens even more frequent as maintenance can run on different machine then the item being completed on marginal time.
@Seadoo2017 I just committed a change that fixed your test.
@ejsmith Thanks! Do you want me to create a pull request for porting the fix to RedisQueue?
Yes that would be really helpful. Thank you!
@Seadoo2017 I just committed changes to the Redis queue to match the behavior of the in memory queue. We will be putting out new releases. Would love to see what your friends thoughts on the cluster support issue are.
@ejsmith Great, thanks!
Regarding RedisQueue cluster when executing LUA script, we must provide the keys used in the evaluate api call so it will know on which shard to run the script, in addition, all keys used in the script must be on the same shard or it will fail. (The names of the keys used in the script must have prefix with { } to force them on the same shard) In addition, we can't pre-load the script as done in Cache as it can run on any of the shards depending on the actual keys used when calling the script. We already have a working tested fix for it but our code has more changes (we implemented priority Redis queue). I will ask my friend to port the fix to the RedisQueue and create PR, he is very busy this week, I hope he'll find time next week.
So you have Foundatio queues working on a Redis cluster setup? We haven't tried it, but last time we did it seemed like there was a bunch of issues with it.
It's a new setup so it's early to say, so far we encountered few issues:
So far it seems to work :)
Consider the following scenario
RedisQueue::CompleteAsync(IQueueEntry entry) will throw exception (as expected) here:
which will be handled here:
Then, inside the catch, the following condition will be true: (the work item is the original non abandoned item which we got when the handler was called)
The call to AbandonAsync is going to throw the unhandled exception and kill the task loop, because the following transaction will fail (the item was auto abandoned...):
Consider adding try / catch to the original catch? change catch logic? Thanks.