wish: implement a retry policy

aivanise commented 1 year ago

Hi,

Thanks for a wonderful tool, it saved my life a couple of times already :)

I have a large(ish) cluster, 6 nodes, 150+ containers and there is always something going on, either a backup or devs playing around overloading individual nodes, upgrades, maintenance,etc, so more often than not lxc times out and then the complete service "fails", like this:

Dec 06 09:02:27 lxd10.2e-systems.com lxd-snapper[30497]: -> deleting snapshot: auto-20221206-040026
Dec 06 09:02:27 lxd10.2e-systems.com lxd-snapper[30497]: error: lxc returned a non-zero status code and said:
Dec 06 09:02:27 lxd10.2e-systems.com lxd-snapper[30497]: -> [ FAILED ]
...
Dec 06 09:03:37 lxd10.2e-systems.com lxd-snapper[30497]: Error: Some instances couldn't be pruned

in some cases this is a problem as a snapshot that is not deleted on time uses disk space which is sometimes scarce, so would it be possible to implement some kind of retry policy, preferably configurable, like:

retry: 5 retry-interval: 30s

Patryk27 commented 1 year ago

Oh, that's nice - yeah, should be doable 🙂

Would you see such retry & retry-interval as a global configuration for all containers or you'd have some use-case for specifying different retry-options for different policies / containers / remotes?

aivanise commented 1 year ago

Whatever is easier to implement, I don't actually have a use case to have it separate per policy, as I don't see how exec(lxc) can fail differently depending on the policy, maybe if snapshot removal on zfs level is slower depending on the snapshots around it, but it is a stretch.

although... https://github.com/openzfs/zfs/issues/11933, but still a stretch ;)

aivanise commented 1 year ago

one more thing here, somewhat related: lxc can also get stuck and never exit, so it would be nice to have a timeout on exec calls to lxc. Happened to me just now and since it was in a systemd unit that was missing TimeoutStartSec, it was happily hanging in there "as a service" for two weeks until I've realized there are no more snapshots ;)

Patryk27 commented 1 year ago

it would be nice to have a timeout on exec calls to lxc.

Oh, this I can implement pretty quickly! 😄

Check out current master, I've just added there lxc-timeout (with the default of 10 minutes), which allows to specify the maximum waiting time for each invocation of lxc.

aivanise commented 1 year ago

i've tried it out and it actually makes every call to lxc take lxc-timeout time instead of timing it out ;)

# stdbuf -i0 -o0 -e0 time /tmp/lxd-snapper -c /tmp/lxd-snapper.conf backup-and-prune | awk '{ print strftime("[%Y-%m-%d %H:%M:%S]"), $0 }'
[2023-03-20 12:00:52] Backing-up
[2023-03-20 12:00:52] ----------
[2023-03-20 12:00:52]
[2023-03-20 12:00:53] AEE/aee-qc
[2023-03-20 12:00:53]   - creating snapshot: auto-20230320-110053 [ OK ]
[2023-03-20 12:00:53]
[2023-03-20 12:03:53]
[2023-03-20 12:03:53] Pruning
[2023-03-20 12:03:53] -------
[2023-03-20 12:03:53]
[2023-03-20 12:03:54] AEE/aee-qc
[2023-03-20 12:03:54]
^CCommand terminated by signal 2
0.29user 0.88system 4:26.23elapsed 0%CPU (0avgtext+0avgdata 28168maxresident)k
0inputs+0outputs (0major+20946minor)pagefaults 0swaps

# head -3 /tmp/lxd-snapper.conf
# this is yaml
lxc-timeout: 3 min
policies:

Patryk27 commented 1 year ago

Huh, that's pretty random - I've just re-checked on my machine and everything seems to be working as intended, i.e. the commands complete without any extra delay:

pwy@ubu:~/Projects/lxd-snapper$ stdbuf -i0 -o0 -e0 time ./target/release/lxd-snapper backup-and-prune | awk '{ print strftime("[%Y-%m-%d %H:%M:%S]"), $0 }'
[2023-03-20 14:25:38] Backing-up
[2023-03-20 14:25:38] ----------
[2023-03-20 14:25:38] 
[2023-03-20 14:25:38] test
[2023-03-20 14:25:38]   - creating snapshot: auto-20230320-132538 [ OK ]
[2023-03-20 14:25:38] 
[2023-03-20 14:25:38] Backing-up summary
[2023-03-20 14:25:38] ------------------
[2023-03-20 14:25:38]   processed instances: 1
[2023-03-20 14:25:38]   created snapshots: 1
[2023-03-20 14:25:38] 
[2023-03-20 14:25:38] Pruning
[2023-03-20 14:25:38] -------
[2023-03-20 14:25:38] 
[2023-03-20 14:25:38] test
[2023-03-20 14:25:38]   - keeping snapshot: auto-20230320-132538
[2023-03-20 14:25:38]   - deleting snapshot: auto-20230320-132510 [ OK ]
[2023-03-20 14:25:38] 
[2023-03-20 14:25:38] Pruning summary
[2023-03-20 14:25:38] ---------------
[2023-03-20 14:25:38]   processed instances: 1
[2023-03-20 14:25:38]   deleted snapshots: 1
[2023-03-20 14:25:38]   kept snapshots: 1
0.14user 0.20system 0:00.50elapsed 68%CPU (0avgtext+0avgdata 27472maxresident)k
0inputs+0outputs (0major+50077minor)pagefaults 0swaps
pwy@ubu:~/Projects/lxd-snapper$ cat config.yaml 
lxc-timeout: 3 min

policies:
  every-instance:
    keep-last: 1
pwy@ubu:~/Projects/lxd-snapper$

Which OS and kernel are you using? 👀

aivanise commented 1 year ago

I'm on Centos 8-Streams, 4.18.0-408.el8.x86_64

maybe you should add one more machine at least to be able see it, as in my case the delays are between the machines, i.e. [OK] appears immediately, but it then waits lxc-timeout time to skip to the next one.

Patryk27 commented 1 year ago

Yeah, I did check on multiple machines - even with a few different kernel versions (4.14, 4.9 & 5.4) 🤔

Would you mind checking this binary?

(it's lxd-snapper built via Nix, through nix build .#packages.x86_64-linux.default && cp ./result/bin/lxd-snapper . - that's to make sure the compiler or dynamic binaries aren't playing any tricks here :smile:)

aivanise commented 1 year ago

same result. I have noticed however that, according to 'ps -e f', it spawns lxc list and hangs in there for the duration od a timeout. Identical lxc list command issued on the command line returns within seconds. So, it might be something else, not the timeout per se. The version that works that I use is the last release (v1.3.0), so it might be something added to the master after that.

1512271 pts/0    S+     0:00  |           \_ time /tmp/lxd-snapper -c /tmp/lxd-snapper.conf backup-and-prune
1512282 pts/0    S+     0:00  |           |   \_ /tmp/lxd-snapper -c /tmp/lxd-snapper.conf backup-and-prune
1512582 pts/0    Sl+    0:00  |           |       \_ lxc list local: --project=default --format=json
1512272 pts/0    S+     0:00  |           \_ awk { print strftime("[%Y-%m-%d %H:%M:%S]"), $0 }

Patryk27 commented 1 year ago

Okie, I've just prepared a different implementation - feel free to checkout current master branch if you find a minute 🙂

Patryk27 / lxd-snapper

wish: implement a retry policy #12