Open aivanise opened 1 year ago
Oh, that's nice - yeah, should be doable 🙂
Would you see such retry
& retry-interval
as a global configuration for all containers or you'd have some use-case for specifying different retry-options for different policies / containers / remotes?
Whatever is easier to implement, I don't actually have a use case to have it separate per policy, as I don't see how exec(lxc) can fail differently depending on the policy, maybe if snapshot removal on zfs level is slower depending on the snapshots around it, but it is a stretch.
although... https://github.com/openzfs/zfs/issues/11933, but still a stretch ;)
one more thing here, somewhat related: lxc can also get stuck and never exit, so it would be nice to have a timeout on exec calls to lxc. Happened to me just now and since it was in a systemd unit that was missing TimeoutStartSec, it was happily hanging in there "as a service" for two weeks until I've realized there are no more snapshots ;)
it would be nice to have a timeout on exec calls to lxc.
Oh, this I can implement pretty quickly! 😄
Check out current master, I've just added there lxc-timeout
(with the default of 10 minutes), which allows to specify the maximum waiting time for each invocation of lxc
.
i've tried it out and it actually makes every call to lxc take lxc-timeout time instead of timing it out ;)
# stdbuf -i0 -o0 -e0 time /tmp/lxd-snapper -c /tmp/lxd-snapper.conf backup-and-prune | awk '{ print strftime("[%Y-%m-%d %H:%M:%S]"), $0 }'
[2023-03-20 12:00:52] Backing-up
[2023-03-20 12:00:52] ----------
[2023-03-20 12:00:52]
[2023-03-20 12:00:53] AEE/aee-qc
[2023-03-20 12:00:53] - creating snapshot: auto-20230320-110053 [ OK ]
[2023-03-20 12:00:53]
[2023-03-20 12:03:53]
[2023-03-20 12:03:53] Pruning
[2023-03-20 12:03:53] -------
[2023-03-20 12:03:53]
[2023-03-20 12:03:54] AEE/aee-qc
[2023-03-20 12:03:54]
^CCommand terminated by signal 2
0.29user 0.88system 4:26.23elapsed 0%CPU (0avgtext+0avgdata 28168maxresident)k
0inputs+0outputs (0major+20946minor)pagefaults 0swaps
# head -3 /tmp/lxd-snapper.conf
# this is yaml
lxc-timeout: 3 min
policies:
Huh, that's pretty random - I've just re-checked on my machine and everything seems to be working as intended, i.e. the commands complete without any extra delay:
pwy@ubu:~/Projects/lxd-snapper$ stdbuf -i0 -o0 -e0 time ./target/release/lxd-snapper backup-and-prune | awk '{ print strftime("[%Y-%m-%d %H:%M:%S]"), $0 }'
[2023-03-20 14:25:38] Backing-up
[2023-03-20 14:25:38] ----------
[2023-03-20 14:25:38]
[2023-03-20 14:25:38] test
[2023-03-20 14:25:38] - creating snapshot: auto-20230320-132538 [ OK ]
[2023-03-20 14:25:38]
[2023-03-20 14:25:38] Backing-up summary
[2023-03-20 14:25:38] ------------------
[2023-03-20 14:25:38] processed instances: 1
[2023-03-20 14:25:38] created snapshots: 1
[2023-03-20 14:25:38]
[2023-03-20 14:25:38] Pruning
[2023-03-20 14:25:38] -------
[2023-03-20 14:25:38]
[2023-03-20 14:25:38] test
[2023-03-20 14:25:38] - keeping snapshot: auto-20230320-132538
[2023-03-20 14:25:38] - deleting snapshot: auto-20230320-132510 [ OK ]
[2023-03-20 14:25:38]
[2023-03-20 14:25:38] Pruning summary
[2023-03-20 14:25:38] ---------------
[2023-03-20 14:25:38] processed instances: 1
[2023-03-20 14:25:38] deleted snapshots: 1
[2023-03-20 14:25:38] kept snapshots: 1
0.14user 0.20system 0:00.50elapsed 68%CPU (0avgtext+0avgdata 27472maxresident)k
0inputs+0outputs (0major+50077minor)pagefaults 0swaps
pwy@ubu:~/Projects/lxd-snapper$ cat config.yaml
lxc-timeout: 3 min
policies:
every-instance:
keep-last: 1
pwy@ubu:~/Projects/lxd-snapper$
Which OS and kernel are you using? 👀
I'm on Centos 8-Streams, 4.18.0-408.el8.x86_64
maybe you should add one more machine at least to be able see it, as in my case the delays are between the machines, i.e. [OK] appears immediately, but it then waits lxc-timeout time to skip to the next one.
Yeah, I did check on multiple machines - even with a few different kernel versions (4.14, 4.9 & 5.4) 🤔
Would you mind checking this binary?
(it's lxd-snapper built via Nix, through nix build .#packages.x86_64-linux.default && cp ./result/bin/lxd-snapper .
- that's to make sure the compiler or dynamic binaries aren't playing any tricks here :smile:)
same result. I have noticed however that, according to 'ps -e f', it spawns lxc list and hangs in there for the duration od a timeout. Identical lxc list command issued on the command line returns within seconds. So, it might be something else, not the timeout per se. The version that works that I use is the last release (v1.3.0), so it might be something added to the master after that.
1512271 pts/0 S+ 0:00 | \_ time /tmp/lxd-snapper -c /tmp/lxd-snapper.conf backup-and-prune
1512282 pts/0 S+ 0:00 | | \_ /tmp/lxd-snapper -c /tmp/lxd-snapper.conf backup-and-prune
1512582 pts/0 Sl+ 0:00 | | \_ lxc list local: --project=default --format=json
1512272 pts/0 S+ 0:00 | \_ awk { print strftime("[%Y-%m-%d %H:%M:%S]"), $0 }
Okie, I've just prepared a different implementation - feel free to checkout current master branch if you find a minute 🙂
Hi,
Thanks for a wonderful tool, it saved my life a couple of times already :)
I have a large(ish) cluster, 6 nodes, 150+ containers and there is always something going on, either a backup or devs playing around overloading individual nodes, upgrades, maintenance,etc, so more often than not lxc times out and then the complete service "fails", like this:
in some cases this is a problem as a snapshot that is not deleted on time uses disk space which is sometimes scarce, so would it be possible to implement some kind of retry policy, preferably configurable, like:
retry: 5 retry-interval: 30s