NixOS / nixpkgs

Nix Packages collection & NixOS
MIT License
16.69k stars 13.14k forks source link

services.restic systemd service does not wait for network #196547

Open jocelynthode opened 1 year ago

jocelynthode commented 1 year ago

Describe the bug

When using services.restic with a remote repository such as a S3 repository. The service does not wait for the network to be ready. This causes issues when service starts on the boot due to for example a Persistent parameter

Steps To Reproduce

Steps to reproduce the behavior:

Use the following config:

  services.restic.backups = {
    persist = {
      user = "root";
      # repository is in environmentFile
      paths = [ '/persist' ];
      extraBackupArgs = [ "--exclude-file=/etc/restic/exclude.txt" ];
      initialize = true;
      passwordFile = config.sops.secrets."restic/password".path;
      timerConfig = {
        OnCalendar = "*-*-* 12:00:00";
        Persistent = true;
      };
      pruneOpts = [
        "--keep-daily 7"
        "--keep-weekly 4"
        "--keep-monthly 3"
      ];
      environmentFile = config.sops.secrets."restic/env".path;
    };
  };

Due to the Persistent settings the backup will be executed on boot if we missed the onCalendar date

oct 18 08:27:03 host restic-backups-persist-pre-start[2600]: Fatal: unable to open repository at swift:repo:/nix: conn.Authenticate: Post "https://swift.example.com/identity/v3/auth/tokens": dial tcp: lookup swift.example.com: Temporary failure in name resolution
oct 18 08:27:03 host restic-backups-persist-pre-start[2640]: Fatal: create repository at swift:repo:/nix failed: conn.Authenticate: Post "https://swift.example.com/identity/v3/auth/tokens": dial tcp: lookup swift.example.com: Temporary failure in name resolution

Expected behavior

The service should wait for network-online.target to be able to resolve and contact the remote repository

Notify maintainers

@thiagokokada @pennae

Metadata

❯ nix-shell -p nix-info --run "nix-info -m"
 - system: `"x86_64-linux"`
 - host os: `Linux 6.0.0, NixOS, 22.11 (Raccoon), 22.11.20221013.ba187fb`
 - multi-user?: `yes`
 - sandbox: `yes`
 - version: `nix-env (Nix) 2.11.0`
 - channels(root): `"nixos"`
 - nixpkgs: `/nix/var/nix/profiles/per-user/root/channels/nixos`
korfuri commented 4 months ago

Looking at my restic setup, the restic systemd service does wait on network-online.target, but that is not always enough: network-online.target does not guarantee that my remote backup locations are reachable.

This is problematic as the backup is by default attempted on a "daily" timer. In practice, this means "at midnight" and "on startup, if we missed the midnight run". But if the machine is never on at midnight, and backups are attempted at startup as soon as network-online.target is reached (and possibly before my backup repository is reachable), then backups never complete.

I am not familiar enough with systemd timers - would there be a way to express "run to completion once a day, but in case of failures, retry every hour"? And would that be a reasonable default?

korfuri commented 3 months ago

It turns out since 2019, oneshot services support Restart=on-failure. This does not solve the title of the issue here, but it's a decent enough workaround for my needs. I just add:

   systemd.services."restic-backups-to-GCS" ={
      serviceConfig = {
        Restart = "on-failure";
        RestartSec = "60s";
      };
      unitConfig = {
        StartLimitIntervalSec = 3600;
        StartLimitBurst = 15;
      };
    };

This means the restic unit will start at some point after network-online.target is reached and it may still fail for any number of reasons, but it will now retry every minute for 15 minutes. This is enough for my main use-case of "laptop booted but is not connected to Wifi yet".

workflow commented 3 months ago

As an alternative workaround to @korfuri, the restic service also has a backupPrepareCommand option that you could abuse to do something like this:

backupPrepareCommand = ''
  while ! /run/current-system/sw/bin/ping -c 1 1.0.0.1; do
    echo "Waiting for internet connection..."
    sleep 60
  done

  echo "Internet is up, let's upload ~raccoon memes~ some backups!"
'';