[RedisQueue] StartWorkingImpl worker loop throws unhandled exception in task which prevents further queue processing

Seadoo2017 commented 4 years ago

Consider the following scenario

StartWorkingImpl loop: Item is dequeued and user handle function is called with the queue entry
User handling function takes more time than queue WorkItemTimeout so the queue entry is "auto abandoned" (while in processing)
User handling function finish and calls CompleteAsync on the queue entry

RedisQueue::CompleteAsync(IQueueEntry entry) will throw exception (as expected) here:

    long result = await Run.WithRetriesAsync(() => Database.ListRemoveAsync(_workListName, entry.Id), logger: 
    _logger).AnyContext();
    if (result == 0)
        throw new InvalidOperationException("Queue entry not in work list, it may have been auto abandoned.");

which will be handled here:

                 try
                {
                    await handler(queueEntry, linkedCancellationToken.Token).AnyContext();
                    if (autoComplete && !queueEntry.IsAbandoned && !queueEntry.IsCompleted)
                        await queueEntry.CompleteAsync().AnyContext();
                }
                catch (Exception ex)
                {
                        Interlocked.Increment(ref _workerErrorCount);
                        if (_logger.IsEnabled(LogLevel.Error))
                            _logger.LogError(ex, "Worker error: {Message}", ex.Message);
                        if (!queueEntry.IsAbandoned && !queueEntry.IsCompleted)
                            await queueEntry.AbandonAsync().AnyContext();
                }

Then, inside the catch, the following condition will be true: (the work item is the original non abandoned item which we got when the handler was called)
```
if (!queueEntry.IsAbandoned && !queueEntry.IsCompleted)
           await queueEntry.AbandonAsync().AnyContext();
```

The call to AbandonAsync is going to throw the unhandled exception and kill the task loop, because the following transaction will fail (the item was auto abandoned...):

        var tx = Database.CreateTransaction();
        tx.AddCondition(Condition.KeyExists(GetRenewedTimeKey(entry.Id)));
        tx.ListRemoveAsync(_workListName, entry.Id);
        //tx.ListLeftPushAsync(_queueListName, entry.Id);
        tx.SortedSetAddAsync(_queueListName, entry.Id, entry.Value.Priority);
        tx.KeyDeleteAsync(GetRenewedTimeKey(entry.Id));
        bool success = await Run.WithRetriesAsync(() => tx.ExecuteAsync(), logger: _logger).AnyContext();
        if (!success)
            throw new InvalidOperationException("Queue entry not in work list, it may have been auto abandoned.");

Consider adding try / catch to the original catch? change catch logic? Thanks.

niemyjski commented 4 years ago

I've seen this happen as well. I think we need to better handle this but find a way to detect if the work item is still being worked on and not stuck. Would you mind submitting a unit test for this to the base repo. I've seen this happen to the in memory queues as well.

Seadoo2017 commented 4 years ago

@ejsmith I just reviewed the fix to port it to Redis queue (which my original issue referred to) and I have some concerns regarding the fix:

What happens when the maintenance is running in a different process than the dequeued item? the solution of having the same item instance won't help
Even with same instance item, I believe the issue can still happen but it's much harder to reproduce now, consider the following scenario: InMemoryQueue:AbandonAsync is called (in timer context which might run in parallel to the user handling function _dequeued.TryRemove is done and although entry instance is now the same, before we get to the entry.MarkAbandoned() execution, our user handling function is completed and will cause the same error (item not marked as abandoned yet and not in _dequeued list) - queue will stop handling items due to unhandled exception.

Specifically in InMemoryQueue, even if you try to update the item state to abandoned before removing from _dequeued, we might pass the "if abandoned" check in the user flow and then switch execution to the maintenance which will result the same exception in different race flow.

I couldn't think of a any solution to the basic race problem that this flow generates - a call to AbandonAsync in StartWorkingImpl catch clause might "legitimately" fail with exception. My suggestion is having inner try/catch in StartWorkingImpl catch clause. What do you think?

ejsmith commented 4 years ago

@Seadoo2017 finally got the tests working and merged the changes in. I think your concerns should be addressed. What do you think?

ejsmith commented 4 years ago

@Seadoo2017 there is a new 10.0.0-beta8 on nuget if you want to try it.

Seadoo2017 commented 4 years ago

@ejsmith https://github.com/FoundatioFx/Foundatio/commit/43d975b8dd7f28034291681f3ed3cb2bbdbeb03f

Does not resolve the issue I raised for MemoryQueue and not for RedisQueue as I explained here: https://github.com/FoundatioFx/Foundatio/issues/239#issuecomment-666264565

The fix does make the unit test which reproduce this issue pass, but the issue itself isn't solved and can still happen. It's much harder to reproduce in unit test now but the fundamental issue still exists and will happen during runs randomly.

ejsmith commented 4 years ago

@Seadoo2017 I made some more changes to the in memory queue. I don't think we can have a race condition because only 1 of the TryRemove operations in the complete or abandon calls is going to succeed. Take a look and tell me what you think.

Seadoo2017 commented 4 years ago

@ejsmith I just pushed a new pull request (https://github.com/FoundatioFx/Foundatio/pull/250) which demonstrates that the issue still exists. The test is a bit ugly (sorry) :/ because it's much harder to reproduce in test the timing which causes the issue but it happens in real life as well. Please note that my test demonstrates one example of timing issue which reproduce the issue, but as I explained above, there are more scenarios.

The fundamental issue is that I don't think we can assume the queue entry flags validity on the catch clause of the StartWorkingImpl method. I believe there is no way around having inner try/catch.

Another important note: this issue was first observed in RedisQueue, I just demonstrated it on MemoryQueue as was asked. In RedisQueue it happens even more frequent as maintenance can run on different machine then the item being completed on marginal time.

ejsmith commented 4 years ago

@Seadoo2017 I just committed a change that fixed your test.

Seadoo2017 commented 4 years ago

@ejsmith Thanks! Do you want me to create a pull request for porting the fix to RedisQueue?

ejsmith commented 4 years ago

Yes that would be really helpful. Thank you!

ejsmith commented 4 years ago

@Seadoo2017 I just committed changes to the Redis queue to match the behavior of the in memory queue. We will be putting out new releases. Would love to see what your friends thoughts on the cluster support issue are.

Seadoo2017 commented 4 years ago

@ejsmith Great, thanks!

Regarding RedisQueue cluster when executing LUA script, we must provide the keys used in the evaluate api call so it will know on which shard to run the script, in addition, all keys used in the script must be on the same shard or it will fail. (The names of the keys used in the script must have prefix with { } to force them on the same shard) In addition, we can't pre-load the script as done in Cache as it can run on any of the shards depending on the actual keys used when calling the script. We already have a working tested fix for it but our code has more changes (we implemented priority Redis queue). I will ask my friend to port the fix to the RedisQueue and create PR, he is very busy this week, I hope he'll find time next week.

ejsmith commented 4 years ago

So you have Foundatio queues working on a Redis cluster setup? We haven't tried it, but last time we did it seemed like there was a bunch of issues with it.

Seadoo2017 commented 4 years ago

It's a new setup so it's early to say, so far we encountered few issues:

My fix for (https://github.com/FoundatioFx/Foundatio.Redis/issues/64) is incompatible with cluster, we fixed this issue
Queue statistics (IMetricsClientStats.GetQueueStatsAsync) had some issues, we worked around it
RedisCacheClient has some methods which are not compatible with cluster but we are not using them at the moment

So far it seems to work :)

FoundatioFx / Foundatio

[RedisQueue] StartWorkingImpl worker loop throws unhandled exception in task which prevents further queue processing #239