backube / volsync

Asynchronous data replication for Kubernetes volumes
https://volsync.readthedocs.io
GNU Affero General Public License v3.0
570 stars 67 forks source link

Restic mover job should fail if restic repo is locked #1429

Open onedr0p opened 10 hours ago

onedr0p commented 10 hours ago

[!IMPORTANT] This request requires restic 0.17.0 for the unlock status code and Kubernetes 1.31 for the stable podFailurePolicy feature.

Restic mover should not try over and over to backup if the repo is locked, we could utilize the pod failure policy feature and restics exit codes to achieve this.

podReplacementPolicy: Failed # required when using podFailurePolicy
podFailurePolicy:
  rules:
    - action: FailJob
      onExitCodes:
        containerName: restic
        operator: In
        values: [11] # exit code 11 indicates a locked restic repo

That should mark the job as failed and not be retried until the next volsync schedule.

The current behavior is the same job will retry over and over until the backoffLimit is reached on a locked restic repo.

Originally posted by @onedr0p in #1415

PrivatePuffin commented 9 hours ago

I want to add to this that this issue is even worse. Got into an issue where a mover repeatedly fetched 99Gb(!) from S3 storage by running over-and-over trying to put 120Gb into a 100Gb PVC.

The sizing issue was my mistake, but it shouldn't try to repeatedly restart and burn through my S3 funds either.

onedr0p commented 9 hours ago

Yeah it also feels like backoffLimit should be set to 0 (or at least configurable), if it fails the first time I highly doubt it will ever succeed.

PrivatePuffin commented 9 hours ago

Yeah maybe 1 retry by default is excusable, but certainly with the S3 backend users can get into BOATLOADS of issues financially if things go horribly wrong.

tesshuflower commented 9 hours ago

Currently the design of volsync is essentially very kubernetes centric, preferring to retry until things work.

If we just fail the job then we have no chance to retry for even something like a network hiccup. If we could detect specific errors from restic, this is an interesting idea however. If we did something it would definitely need to be an opt-in feature.

One issue is that simply stopping the job from re-trying will not solve the scheduling issue, as volsync will not schedule until the previous synchronization has completed (i.e. the job has completed). VolSync will still reschedule the job after the backupLimit is hit. Would need modifications to the scheduling code to mark the synchronization as completed with error. Then there's still the question of whether we hang forever or retry on a schedule (which could still get someone into the same situation).

PrivatePuffin commented 9 hours ago

@tesshuflower I agree, a few retries is not a bad thing. But having it endlessly retry is going to cause more trouble than its solving things.

samip5 commented 4 hours ago

Yeah, my S3 bill is the way I noticed the whole problem after the fact..

onedr0p commented 59 minutes ago

If we could detect specific errors from restic, this is an interesting idea however. If we did something it would definitely need to be an opt-in feature.

That's good to hear and I am glad you agree. There's no point in retrying if we know from restic exit codes if the job will never succeed. Hopefully there can be some improvements in this area.