creativeprojects / resticprofile

Configuration profiles manager and scheduler for restic backup
https://creativeprojects.github.io/resticprofile/
GNU General Public License v3.0
604 stars 29 forks source link

Prometheus metric resticprofile_backup_status is 2 even when backups fail #332

Open deviantintegral opened 4 months ago

deviantintegral commented 4 months ago

To test alerting on the resticprofile_backup_status I tweaked my AWS access key to be invalid, and triggered a backup. While the job errored out, I see a fresh metric for resticprofile_backup_status with the status of 2.

Luckily, the Last Backup timestamp isn't changed, so I can probably alert on that. However, I expected the status to be 0.

creativeprojects commented 4 months ago

You're right, I wouldn't expect the status to be 2 🤔 Can you please post your profile configuration (with any repository information redacted) so I can get a better idea of what is happening?

deviantintegral commented 4 months ago

Sure, here it is. I have several other backup sets but they all have the same config.

version: "1"

global:
  scheduler: crond
  priority: low

base:
  initialize: true
  password-file: key
  prometheus-push: "http://metrics-docker.lan:9091/"
  prometheus-save-to-file: "{{ .Profile.Name }}.prom"
  prometheus-labels:
    - host: {{ .Hostname }}
  backup:
    exclude-caches: true
    one-file-system: true
    check-before: true
    extended-status: true
  retention:
    after-backup: true
    keep-daily: 30
    keep-weekly: 4
    keep-monthly: 13
    prune: true

photos:
  inherit: base
  lock: /tmp/photos.lock
  force-inactive-lock: true
  rustic-stale-lock-age: 5m
  repository: REDACTED-S3-ENDPOINT-ON-B2
  env:
    AWS_ACCESS_KEY_ID: REDACTED_ACCESS_KEY
    AWS_SECRET_ACCESS_KEY: REDACTED_SECRET_KEY
  backup:
    source:
      - '/source/photos'
    schedule: "04:00"
    schedule-permission: system
creativeprojects commented 4 months ago

Right, I see what's happening:

But only the backup command generates prometheus metrics. So at that point it's keeping the existing metrics and not generating new ones.

I think to fix this issue we would need to generate a status line for each part (check, forget, etc.)