backube / volsync

Asynchronous data replication for Kubernetes volumes
https://volsync.readthedocs.io
GNU Affero General Public License v3.0
601 stars 68 forks source link

My trigger schedule doesn't seem to be honored (v0.10.0) #1415

Open samip5 opened 1 month ago

samip5 commented 1 month ago

Describe the bug I have specified a trigger schedule to backup at 2AM, but it seems it's not honored?

Steps to reproduce Set a replicationsource with trigger schedule of 0 2 * * *

Expected behavior I expected it to honor my trigger.

Actual results It doesn't seem to honor it.

Additional context

I'm using Volume Populator, and my replication source is here:

apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
  creationTimestamp: "2024-07-28T14:35:32Z"
  generation: 2
  labels:
    kustomize.toolkit.fluxcd.io/name: dev-forego
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: forego-r2
  namespace: dev
  resourceVersion: "221092293"
  uid: 6107f4d4-56a4-499a-aaea-f66511200091
spec:
  restic:
    cacheCapacity: 15Gi
    cacheStorageClassName: zfs
    copyMethod: Snapshot
    moverSecurityContext:
      fsGroup: 1000
      runAsGroup: 1000
      runAsUser: 1000
    pruneIntervalDays: 7
    repository: forego-volsync-r2
    retain:
      daily: 7
      weekly: 1
    storageClassName: zfs
    volumeSnapshotClassName: zfs
  sourcePVC: forego-data
  trigger:
    schedule: 0 2 * * *

Logs that show amount of snapshots: https://gist.github.com/samip5/b7a9bdf8345a5957cb659718812ffd6c

It seems that for whatever reason, it's not following the trigger, like you can clearly see from:

3840aa4c  2024-10-10 00:03:46  volsync                 /data  500.240 MiB
1fa0a425  2024-10-10 00:10:02  volsync                 /data  500.240 MiB
629bf12a  2024-10-10 00:17:47  volsync                 /data  500.240 MiB
c5de5a39  2024-10-10 00:28:07  volsync                 /data  500.240 MiB

So why would that be?

tesshuflower commented 1 month ago

@samip5 Is the replication completing successfully? I can't tell from the container log. Do you have the full replicationsource (including the status) from your system? It should update the .status.nextSyncTime which should tell you when it thinks the next sync should be.

samip5 commented 1 month ago

It would seem to have been the case which is somewhat an frequently occurring theme, that it gets to a state where it's locked by an older backup job (which is nowhere to be found again) and requires manual intervention to resolve.

It was stuck quite a while and it seems the next sync time is also still wrong?


apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
  creationTimestamp: "2024-07-28T14:35:32Z"
  generation: 2
  labels:
    kustomize.toolkit.fluxcd.io/name: dev-forego
    kustomize.toolkit.fluxcd.io/namespace: flux-system
  name: forego-r2
  namespace: dev
  resourceVersion: "221149061"
  uid: 6107f4d4-56a4-499a-aaea-f66511200091
spec:
  restic:
    cacheCapacity: 15Gi
    cacheStorageClassName: zfs
    copyMethod: Snapshot
    moverSecurityContext:
      fsGroup: 1000
      runAsGroup: 1000
      runAsUser: 1000
    pruneIntervalDays: 7
    repository: forego-volsync-r2
    retain:
      daily: 7
      weekly: 1
    storageClassName: zfs
    volumeSnapshotClassName: zfs
  sourcePVC: forego-data
  trigger:
    schedule: 0 2 * * *
status:
  conditions:
  - lastTransitionTime: "2024-10-10T02:00:00Z"
    message: Synchronization in-progress
    reason: SyncInProgress
    status: "True"
    type: Synchronizing
  lastSyncDuration: 1762h33m30.311652358s
  lastSyncStartTime: "2024-10-10T02:00:00Z"
  lastSyncTime: "2024-10-10T01:09:02Z"
  latestMoverStatus:
    logs: |-
      using parent snapshot c5de5a39
      Added to the repository: 1.303 MiB (482.964 KiB stored)
      processed 2173 files, 500.240 MiB in 0:42
      snapshot ad360e4d saved
      Restic completed in 1802s
    result: Successful
  nextSyncTime: "2024-10-10T02:00:00Z"
  restic:
    lastPruned: "2024-10-10T01:09:01Z"
tesshuflower commented 1 month ago

It would seem to have been the case which is somewhat an frequently occurring theme, that it gets to a state where it's locked by an older backup job (which is nowhere to be found again) and requires manual intervention to resolve.

I'm not sure I understand this part - is the repo getting locked after a failed job? Is the mover pod getting killed while backing up (from OOM or something else?). If the backup doesn't complete due to an error, volsync will keep retrying, which potentially is why you see it keep attempting.

nextSyncTime: "2024-10-10T02:00:00Z"

It looks to me like the next sync time is set correctly? In this case the replicationsource is still trying to perform the backup, as you can see the lastSyncStart time is also set to 2am:

lastSyncStartTime: "2024-10-10T02:00:00Z"

If it completes successfully, then it should set the next sync start time to the next one in the schedule.

samip5 commented 1 month ago

It would seem to have been the case which is somewhat an frequently occurring theme, that it gets to a state where it's locked by an older backup job (which is nowhere to be found again) and requires manual intervention to resolve.

I'm not sure I understand this part - is the repo getting locked after a failed job? Is the mover pod getting killed while backing up (from OOM or something else?). If the backup doesn't complete due to an error, volsync will keep retrying, which potentially is why you see it keep attempting.

I really wish I knew how it happens (but it does require me to run unlock), as the pod is nowhere to be found, even if it failed, so I cannot figure it out.. It however results in many pods when it fails and volsync keeps on trying and trying, but fails due to the lock, and before we know it there's 5+ pods trying to do it, and as the backup itself is not instant, it takes more than 3 minutes to do... The plot is thickened.

tesshuflower commented 1 month ago

VolSync's mover uses a job, and should only be running 1 pod at a time. You're mentioning unlock, do you have logs from any of the failed pods indicating the lock is an issue?

If you need to run an unlock, there is an unlock option now for a restic job you could look at - see the unlock backup option here: https://github.com/backube/volsync/issues/1421

onedr0p commented 1 month ago

VolSync's mover uses a job, and should only be running 1 pod at a time.

I think what @samip5 is saying is that when the job fails, a new pod will take it's place, fail, create a new lock and stay in an errored state, another new one will take it's place, fail, create a new lock, stay in an errored state, etc.., etc.. New pods will be created by the job until the restic repo is in a healthy state without any locks. Once there is a successful backup all the errored pods will be deleted as the job is cleaned up by volsync.

From what I can tell reading here and here is the current way the unlock feature works with volsync is that is restic only unlocks locks that are stale (older than 30 minutes or if created by the "same machine") but there could still be locks set that aren't considered stale by restic which requires restic unlock --remove-all to be ran.

It looks likerestic unlock --remove-all was previously experimented with as I see in this comment and in the PR https://github.com/backube/volsync/pull/713

FWIW with the restic mover I don't know if it's even worth trying to start a new pod after the previous one failed, is there anyway to tweak restartPolicy: Never and backoffLimit: 0 on the job spec? I highly doubt the next pod will even succeed if the first one fails especially with locks coming into play.

samip5 commented 1 month ago

Yes, that's what I was trying to say but onedr0p did a better job explaining it.

onedr0p commented 1 month ago

FWIW with the restic mover I don't know if it's even worth trying to start a new pod after the previous one failed, is there anyway to tweak restartPolicy: Never and backoffLimit: 0 on the job spec? I highly doubt the next pod will even succeed if the first one fails especially with locks coming into play.

I do see restartPolicy is never but backoffLimit is set to 8. Ideally I would like this to be 0. So it doesn't try again until the next schedule and gives me time to unlock it using the spec.restic.unlock option.

onedr0p commented 1 month ago

@tesshuflower I wonder if we could make use of the pod failure policy feature and restics exit codes for a locked repo?

e.g. set this on the job...

podFailurePolicy:
  rules:
    - action: FailJob
      onExitCodes:
        containerName: restic
        operator: In
        values: [11]

That should mark the job as failed and not be retried until the next volsync schedule instead of same job retrying over and over on a locked restic repo.

Edit: created a new issue on this https://github.com/backube/volsync/issues/1429

tesshuflower commented 1 month ago

I think we never passed in the restic --remove-all flag as it seemed dangerous - this would also remove locks that could be set by someone else who's accessing the restic repo at the same time.

@samip5 I'd suggest to debug you could run with a manual schedule and look at the pod logs to see if there is any more detail. A restic lock doesn't normally get left behind if restic exits normally, so sounds like a crash of some sort or possibly the pod itself getting killed for OOM or some other reason. Another possibility I suppose is disk space issues on the repo or the cache.

To debug in more detail, you could try adding the following annotation to the ReplicationSource:

volsync.backube/enable-debug-mover (with any value, even "").

When the next synchronization runs (you can run a manual trigger to run it immediately), the pod will now sit in "debug" mode (it will be idle and not run any backup), and then you can exec into it. The pod log has instructions but will have copied the mover script to /tmp where you can run it. You could then manually run the script which will invoke the backup and see if you get any more details. You can even modify the script in /tmp before running it and add -vvv to the restic backup command.