Closed frol closed 7 years ago
Reading the implementation, I realized that it works, but it works not the way I expected :)
I assumed these things:
requeue=False
)fail_delay
is the delay between attempts, but it turns out that this delay is the time before the task gets discarded and there is no delay between retries as well as there is no way to steal the task to another nodeIs there a way to implement rejection/task stealing instead of discarding the failing tasks? In this case, the retry logic will be performed with fail_delay
delay.
I am not an expert in RabbitMQ, but I think there is no way to implement a global countdown for the retry counter, but our use-case implies that we don't want to ever loose even a single task, so we are OK with having infinite retries.
@cenkalti Any thoughts on this matter?
Hi @frol. Thanks for the detailed report. I couldn't have a chance to take a look at this issue. I will do it as soon as I find some time. Sorry about being late.
Hi @frol. Sorry for the loooong delay. I would like to clarify the retry
, fail_delay
and the reject_delay
parameters first:
retry
logic runs only in single worker. It just makes the worker run the same function again and again until the retry counter reaches to zero or the function returns successfully. That's all.reject_delay
is the number of seconds to wait before sending the rejected task back to it's original queue. It is considered when the task function raises kuyruk.exceptions.Reject
. We need it this scenario: Suppose you have a task that raises Reject
every time (due to a permanent error or an external service is not reachable). Without reject_delay
the task is sent back to queue immediately then consumed again. The task goes back and forth between the worker and the queue and burns CPU cycles. Because of this, we added this delay to keep the task in worker for some time before sending it back to queue.fail_delay
to be consistent on parameters but now I see that it is useless. I will remove it.I guess global retry counter can be implemented in RabbitMQ with AMQP transactions but it can make the implementation more complex. I am against this if it is not necessary.
In Kuyruk, failed tasks are not requeued. If you don't want to lose any task, you can use a wrapper function that catches exceptions and raise Reject
.
Please let me know if you have other questions.
Hmm, fail_delay
was actually useful, and replacing it with a hardcoded 0
doesn't seem to bring any benefit. In fact, we used fail_delay=sys.maxsize
to simply lock the task until we restart the server (which we will do to release a fix, which caused the crash, anyway).
The idea is interesting. However, I don't recommend doing this because the task still consumes memory on the RabbitMQ server. If the worker does not get restarted, they will accumulate indefinitely.
@cenkalti We have just realized that the implementation for #49 is buggy.
Here is a simple reproduction:
Here is how I run it:
And here is the output (notice that all the retries happen immediately with no delays and the traceback is printed at the very end after retries exceeded):
/cc @khorolets