Deal with orphaned retries when instances are decreased

petersilva commented 1 year ago

if you have 10 instances configured, and there are some retries outstanding, but the load has dropped significantly, you might want to reduce the number of active instances. so:

    sr3 stop subscribe/qq
    sr3 edit subscribe/qq # change instances from 10 to 5
   sr3 start subscribe/qq

Any retries queued by the instances 6,7,8,9 and 10. will never be picked up. These are called "orphaned retries" for the purposes of this discussion.

for safe reduction of instances, avoid orphaned retries, or refuse to stop until it is empty. Exploring options for how to do that.

reidsunderland commented 1 year ago

I implemented retryEmptyBeforeExit for the post retry queue, see https://github.com/MetPX/sarracenia/commit/102620a21f856687ade8f9c39ecf0923af86df4e. This might be similar

petersilva commented 1 year ago

should we just augment that to include download_retry as well (use the same setting to cover both?)

petersilva commented 1 year ago

background #620

petersilva commented 1 year ago

Another thing we could do: The housekeeping code for the retry module could look at:

the number of instance retry files extant indicating outstanding retries.
the number of instances configured.
the number of instance pid files present... to ensure "bad" instances aren't running.

If it finds instance files related to instances that are beyond the currently configured number, then it could add those items to it's own retry list. This could be done by each instance on a module instances basis... e.g. if instances 4 then instance 1 will look for instance 5, instance 2 will look for 6, etc...

When it finds a higher instance file, it then checks the next modulo... so instance 1 checks for 5, if it finds an instance 5 retry file, ir then looks for instance 9. It processes files from the high numbers back down to low.

This way no change in running code is needed... only an extra operation at maintenance intervals.
would solve problem in normal case when instances are reduced.
no need for a configuration option, just always on.
no global state --> better scalability.... well the global state is the shared .cache tree by all instances. but that's how things are already designed.

petersilva commented 1 year ago

also this would solve the problem in all cases, not just cloud.

reidsunderland commented 1 year ago

It should be really easy to add the download retry check before exiting as well. retry.py reports the number of messages in each queue to flow

https://github.com/MetPX/sarracenia/blob/7a89be129870532ed7815bed80e10d47b5338b0b/sarracenia/flowcb/retry.py#L105-L111

I like the modulo idea! Instances can shut down without having to wait and the retry queues still get dealt with.

petersilva commented 1 year ago

the only routine needing adjustment is on_housekeeping(), and all it has to do is look for the higher numbered retry files, and catenate them to the lower numbered ones before calling the normal housekeeping routine...

petersilva commented 1 year ago

The thing with forcing an empty retry is that, the usual reason for retries is that a remote server is down, so forcing the retries to empty means waiting until the remote server is back in service, which could take a very long time, it doesn't seem like a practical option in real life.

petersilva commented 1 year ago

how the instance setting reduction gets propagated ... likely have to do a restart cycle? or... have a SIGHUP handler (linux/unix standard to have daemons re-read their configurations.)

Also if docker is launching the instances... we don't know how many there are... every time we launch, the "instances" number will increase.... so it is a varying thing....

So the .pid file would be renamed/rather than removed? 01.pid, 02.pid ... so the pid files contain the pid... could they contain keywords instead? "stopped", "retired" ...

creates the question: how does an instance being removed know it isn't a normal stopped... --retired shutting down forever flag...

@reidsunderland :

If an instance dies (crashes) and comes back later, do we care if another instance has consumed its retry queues in the meantime? As long as it restarts with empty queues, it shouldn't be a big problem. Other than one instance maybe getting overloaded
Maybe our docker management thing can write a special file to the cache directory with the number of expected/current instances. sarra can read it

@tysonkaufmann ...Another option to get around the instance count, is to always use instance #1 to do this, but that means that instance 1 has to take over all the retries of all "retired" instances.

note:

all retry queues are read/written/used by a single file. there is no locking or IPC between instances.
so it is hard for instance 1 to distribute retry queues to other instances, easily it should just take them over... which means you end up with instance 1 taking over all the retries.

petersilva commented 1 year ago

so... in the dev meeting, consensus was that this is a blocker... but it's weird because:

we already have this problem currently... it's an issue we have lived with since day 1, and not any form of regression.
This is a fairly rare edge case... it requires the combination of retries and a reduction of instances. In the normal course of affairs, retries should be quite rare (only when a remote server is down) and the reduction of instances is a management intervention done in a planned way, so even rarer. the conjunction of these two does have this weakness... but making it a blocker seems unreasonable.

petersilva commented 1 year ago

@tysonkaufmann is saying that because it involves data loss, it should be a high priority... not wrong... but very, very rare...

shuffling to in progress.

petersilva commented 8 months ago

For #895 ... there is the concept of storing bindings in the state directory (.cache) one could also store the instances setting. on startup, compare the config setting to the cached one, and kill the extras, if any.

MetPX / sarracenia

Deal with orphaned retries when instances are decreased #618