MetPX / sarracenia

https://MetPX.github.io/sarracenia
GNU General Public License v2.0
44 stars 22 forks source link

Deal with orphaned retries when instances are decreased #618

Open petersilva opened 1 year ago

petersilva commented 1 year ago

if you have 10 instances configured, and there are some retries outstanding, but the load has dropped significantly, you might want to reduce the number of active instances. so:

    sr3 stop subscribe/qq
    sr3 edit subscribe/qq # change instances from 10 to 5
   sr3 start subscribe/qq

Any retries queued by the instances 6,7,8,9 and 10. will never be picked up. These are called "orphaned retries" for the purposes of this discussion.

for safe reduction of instances, avoid orphaned retries, or refuse to stop until it is empty. Exploring options for how to do that.

reidsunderland commented 1 year ago

I implemented retryEmptyBeforeExit for the post retry queue, see https://github.com/MetPX/sarracenia/commit/102620a21f856687ade8f9c39ecf0923af86df4e. This might be similar

petersilva commented 1 year ago

should we just augment that to include download_retry as well (use the same setting to cover both?)

petersilva commented 1 year ago

background #620

petersilva commented 1 year ago

Another thing we could do: The housekeeping code for the retry module could look at:

If it finds instance files related to instances that are beyond the currently configured number, then it could add those items to it's own retry list. This could be done by each instance on a module instances basis... e.g. if instances 4 then instance 1 will look for instance 5, instance 2 will look for 6, etc...

When it finds a higher instance file, it then checks the next modulo... so instance 1 checks for 5, if it finds an instance 5 retry file, ir then looks for instance 9. It processes files from the high numbers back down to low.

petersilva commented 1 year ago

also this would solve the problem in all cases, not just cloud.

reidsunderland commented 1 year ago

It should be really easy to add the download retry check before exiting as well. retry.py reports the number of messages in each queue to flow

https://github.com/MetPX/sarracenia/blob/7a89be129870532ed7815bed80e10d47b5338b0b/sarracenia/flowcb/retry.py#L105-L111

I like the modulo idea! Instances can shut down without having to wait and the retry queues still get dealt with.

petersilva commented 1 year ago

the only routine needing adjustment is on_housekeeping(), and all it has to do is look for the higher numbered retry files, and catenate them to the lower numbered ones before calling the normal housekeeping routine...

petersilva commented 1 year ago

The thing with forcing an empty retry is that, the usual reason for retries is that a remote server is down, so forcing the retries to empty means waiting until the remote server is back in service, which could take a very long time, it doesn't seem like a practical option in real life.

petersilva commented 1 year ago

how the instance setting reduction gets propagated ... likely have to do a restart cycle? or... have a SIGHUP handler (linux/unix standard to have daemons re-read their configurations.)

Also if docker is launching the instances... we don't know how many there are... every time we launch, the "instances" number will increase.... so it is a varying thing....

So the .pid file would be renamed/rather than removed? 01.pid, 02.pid ... so the pid files contain the pid... could they contain keywords instead? "stopped", "retired" ...

@reidsunderland :

@tysonkaufmann ...Another option to get around the instance count, is to always use instance #1 to do this, but that means that instance 1 has to take over all the retries of all "retired" instances.

note:

petersilva commented 1 year ago

so... in the dev meeting, consensus was that this is a blocker... but it's weird because:

  1. we already have this problem currently... it's an issue we have lived with since day 1, and not any form of regression.

  2. This is a fairly rare edge case... it requires the combination of retries and a reduction of instances. In the normal course of affairs, retries should be quite rare (only when a remote server is down) and the reduction of instances is a management intervention done in a planned way, so even rarer. the conjunction of these two does have this weakness... but making it a blocker seems unreasonable.

petersilva commented 1 year ago

@tysonkaufmann is saying that because it involves data loss, it should be a high priority... not wrong... but very, very rare...

shuffling to in progress.

petersilva commented 8 months ago

For #895 ... there is the concept of storing bindings in the state directory (.cache) one could also store the instances setting. on startup, compare the config setting to the cached one, and kill the extras, if any.