Open petersilva opened 1 year ago
I implemented retryEmptyBeforeExit
for the post retry queue, see https://github.com/MetPX/sarracenia/commit/102620a21f856687ade8f9c39ecf0923af86df4e. This might be similar
should we just augment that to include download_retry as well (use the same setting to cover both?)
background #620
Another thing we could do: The housekeeping code for the retry module could look at:
If it finds instance files related to instances that are beyond the currently configured number, then it could add those items to it's own retry list. This could be done by each instance on a module instances basis... e.g. if instances 4 then instance 1 will look for instance 5, instance 2 will look for 6, etc...
When it finds a higher instance file, it then checks the next modulo... so instance 1 checks for 5, if it finds an instance 5 retry file, ir then looks for instance 9. It processes files from the high numbers back down to low.
also this would solve the problem in all cases, not just cloud.
It should be really easy to add the download retry check before exiting as well. retry.py reports the number of messages in each queue to flow
I like the modulo idea! Instances can shut down without having to wait and the retry queues still get dealt with.
the only routine needing adjustment is on_housekeeping(), and all it has to do is look for the higher numbered retry files, and catenate them to the lower numbered ones before calling the normal housekeeping routine...
The thing with forcing an empty retry is that, the usual reason for retries is that a remote server is down, so forcing the retries to empty means waiting until the remote server is back in service, which could take a very long time, it doesn't seem like a practical option in real life.
how the instance setting reduction gets propagated ... likely have to do a restart cycle? or... have a SIGHUP handler (linux/unix standard to have daemons re-read their configurations.)
Also if docker is launching the instances... we don't know how many there are... every time we launch, the "instances" number will increase.... so it is a varying thing....
So the .pid file would be renamed/rather than removed? 01.pid, 02.pid ... so the pid files contain the pid... could they contain keywords instead? "stopped", "retired" ...
@reidsunderland :
@tysonkaufmann ...Another option to get around the instance count, is to always use instance #1 to do this, but that means that instance 1 has to take over all the retries of all "retired" instances.
note:
so... in the dev meeting, consensus was that this is a blocker... but it's weird because:
we already have this problem currently... it's an issue we have lived with since day 1, and not any form of regression.
This is a fairly rare edge case... it requires the combination of retries and a reduction of instances. In the normal course of affairs, retries should be quite rare (only when a remote server is down) and the reduction of instances is a management intervention done in a planned way, so even rarer. the conjunction of these two does have this weakness... but making it a blocker seems unreasonable.
@tysonkaufmann is saying that because it involves data loss, it should be a high priority... not wrong... but very, very rare...
shuffling to in progress.
For #895 ... there is the concept of storing bindings in the state directory (.cache) one could also store the instances setting. on startup, compare the config setting to the cached one, and kill the extras, if any.
if you have 10 instances configured, and there are some retries outstanding, but the load has dropped significantly, you might want to reduce the number of active instances. so:
Any retries queued by the instances 6,7,8,9 and 10. will never be picked up. These are called "orphaned retries" for the purposes of this discussion.
for safe reduction of instances, avoid orphaned retries, or refuse to stop until it is empty. Exploring options for how to do that.