NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
17.81k stars 13.91k forks source link

borgbackup unit fails if borg issues any warnings #177733

Open justinas opened 2 years ago

justinas commented 2 years ago

Describe the bug

borg uses exit code 1 to indicate not errors, but warnings, which are not fatal to the program (see https://borgbackup.readthedocs.io/en/stable/usage/general.html#return-codes).

Since systemd.services.*.script is prefixed by set -e, this makes the backup script exit after e.g. borg create, even though it should continue.

The default silent behavior of borg also made this failure more mysterious. I needed to set extraCreateArgs = "--verbose"; to get Borg to log the warning.

Steps To Reproduce

Steps to reproduce the behavior:

  1. Have a borgbackup.jobs.foo
  2. The unit will occasionally exit with an error as Borg issues warnings. One warning I encounter frequently is this in borg create:
    /root/.config/borg/security/048c71a26c7d15b4890012f52aa70a0f97f0e8d5c0603a8a006baf497a698116/nonce: file changed while we backed it up

    which happens if you try to back-up the home directory of the user that Borg is running as (/root when running as root).

  3. The unit exits early, does not execute prune, postCreate, etc. and is marked as failed.

Expected behavior

The script of borgbackup unit should continue even after receiving exit code of 1 from an individual borg command.

Notify maintainers

@dotlambda

Metadata

Please run nix-shell -p nix-info --run "nix-info -m" and paste the result.

[root@calor:~]# nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 5.15.43, NixOS, 22.05 (Quokka), 22.05-11e805f`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.8.1`
 - nixpkgs: `/nix/store/hm5n6bfsl9grg6r2fxynyrxv896wq5x4-1390c1iw8j28msp2k2pbamq6mqjpgw03-nixos-22.05-2022-06-07`
dotlambda commented 2 years ago

I'm not sure what the best action would be. SuccessExitStatus is related but won't work around set -e. cc @ncfavier @Stunkymonkey @blargg @winterqt @erictapen @pennae @mweinelt

pennae commented 2 years ago

borg documentation states that exit code 1 is only used to signal that warnings have happened, exit code 2 is used for errors. maybe we should just wrap borg in the module to treat 0 and 1 as success?

mweinelt commented 2 years ago

Dropping the scripts in favor of ExecStartPre and ExitStopPost like proposed in #120571 would help with actually running hooks even if things exit uncleanly.

winterqt commented 2 years ago

Dropping the scripts in favor of ExecStartPre and ExitStopPost like proposed in #120571 would help with actually running hooks even if things exit uncleanly.

Does anyone know why ExecStartPre and ExecStopPost weren't used in the first place, was there a specific reason?

winterqt commented 2 years ago

maybe we should just wrap borg in the module to treat 0 and 1 as success?

This is probably the best approach considering those status codes are documented behavior.

mweinelt commented 2 years ago

That's the same as setting SuccessExitStatus=1 as proposed earlier.

pennae commented 2 years ago

ExecStart{Pre,Post} would certainly be a good thing to have, but they wouldn't solve the whole problem since appendFailedSuffix is a thing. SuccessExitStatus=1 wouldn't allow that to work either

dotlambda commented 2 years ago

We can ~append~ remove the suffix in ExecStartPost if either exit code is nonzero, or if it is neither 0 nor 1, depending on how a new option considerWarningFailure is set.

pennae commented 2 years ago

to do that we'd have to generate the archive name in StartPre and somehow convey it to the other commands—is that even possible without using state files?

winterqt commented 2 years ago

I'm not sure what the best action would be. SuccessExitStatus is related but won't work around set -e.

@dotlambda I don't see any set -e call in the module, only a set -o pipefail. What were you referencing here?

dotlambda commented 2 years ago

I'm not sure what the best action would be. SuccessExitStatus is related but won't work around set -e.

@dotlambda I don't see any set -e call in the module, only a set -o pipefail. What were you referencing here?

https://github.com/NixOS/nixpkgs/blob/7cdc4dd5d1a2869d6e3c089f960ed6feae06ae77/nixos/lib/systemd-lib.nix#L250

pennae commented 2 years ago

another problem we'd have with splitting the current monolithic script into pre/main/post commands is that preHook can now set shell state that postHook consumes. we'd have to be very clear that such state will have to through files when we split the script. 😕

priegger commented 2 years ago

I use a wrapper function for now, based on the above commit by @calbrecht.

preHook = ''
  realBorg="$(${pkgs.which}/bin/which borg)"

  borg(){
    returnCode=0
    "$realBorg" "$@" || returnCode=$?

    if [[ $returnCode -eq 1 ]]; then
      return 0
    else
      return $returnCode
    fi
  }
'';
mweinelt commented 2 years ago

With roughly two months left until release, can we agree on a bandaid solution, so that the next stable distribution isn't affected?

This makes monitoring the unit for failure pretty much pointless, because 99% of the time it reminds me that I failed to setup a system with snapshotting in mind, but the backup succeeds. But the remaining 1% is issues like this, that prevent the backup entirely:

Oct 08 00:04:45 ganymede borgbackup-job-rootBackup-start[598343]: Failed to create/acquire the lock /root/.cache/borg/4242fd7b9a339cf44257506dca7ddeb9f467cf39da23cc682ff50c1075dffe0d/lock.>

At this point I'd also consider patching to get piece of mind.

https://github.com/borgbackup/borg/blob/1.2.2/src/borg/archiver.py#L568 https://github.com/borgbackup/borg/blob/1.2.2/src/borg/archiver.py#L830


diff --git a/src/borg/archiver.py b/src/borg/archiver.py
index 8afc4079..766ec12a 100644
--- a/src/borg/archiver.py
+++ b/src/borg/archiver.py
@@ -565,7 +565,7 @@ def create_inner(archive, cache, fso):
                         self.print_warning('%s: %s', path, e)
                         status = 'E'
                     if status == 'C':
-                        self.print_warning('%s: file changed while we backed it up', path)
+                        print(f'{path}: file changed while we backed it up', file=sys.stderr)
                     self.print_file_status(status, path)
                 if args.paths_from_command:
                     rc = proc.wait()
@@ -827,7 +827,7 @@ def _rec_walk(self, *, path, parent_fd, name, fso, cache, matcher,
             self.print_warning('%s: %s', path, e)
             status = 'E'
         if status == 'C':
-            self.print_warning('%s: file changed while we backed it up', path)
+            print(f'{path}: file changed while we backed it up', file=sys.stderr)
         if not recurse_excluded_dir:
             self.print_file_status(status, path)
RaitoBezarius commented 1 year ago

Just encountered this again, I would be in favor of your patch @mweinelt which is a OK bandaid solution, I reread https://github.com/borgbackup/borg/issues/6379 and the solution seems to have "individual files retries", uncertain how does it look like in Borgbackup 2+ though.