Open samip5 opened 1 month ago
@samip5 Is the replication completing successfully? I can't tell from the container log. Do you have the full replicationsource (including the status) from your system? It should update the .status.nextSyncTime
which should tell you when it thinks the next sync should be.
It would seem to have been the case which is somewhat an frequently occurring theme, that it gets to a state where it's locked by an older backup job (which is nowhere to be found again) and requires manual intervention to resolve.
It was stuck quite a while and it seems the next sync time is also still wrong?
apiVersion: volsync.backube/v1alpha1
kind: ReplicationSource
metadata:
creationTimestamp: "2024-07-28T14:35:32Z"
generation: 2
labels:
kustomize.toolkit.fluxcd.io/name: dev-forego
kustomize.toolkit.fluxcd.io/namespace: flux-system
name: forego-r2
namespace: dev
resourceVersion: "221149061"
uid: 6107f4d4-56a4-499a-aaea-f66511200091
spec:
restic:
cacheCapacity: 15Gi
cacheStorageClassName: zfs
copyMethod: Snapshot
moverSecurityContext:
fsGroup: 1000
runAsGroup: 1000
runAsUser: 1000
pruneIntervalDays: 7
repository: forego-volsync-r2
retain:
daily: 7
weekly: 1
storageClassName: zfs
volumeSnapshotClassName: zfs
sourcePVC: forego-data
trigger:
schedule: 0 2 * * *
status:
conditions:
- lastTransitionTime: "2024-10-10T02:00:00Z"
message: Synchronization in-progress
reason: SyncInProgress
status: "True"
type: Synchronizing
lastSyncDuration: 1762h33m30.311652358s
lastSyncStartTime: "2024-10-10T02:00:00Z"
lastSyncTime: "2024-10-10T01:09:02Z"
latestMoverStatus:
logs: |-
using parent snapshot c5de5a39
Added to the repository: 1.303 MiB (482.964 KiB stored)
processed 2173 files, 500.240 MiB in 0:42
snapshot ad360e4d saved
Restic completed in 1802s
result: Successful
nextSyncTime: "2024-10-10T02:00:00Z"
restic:
lastPruned: "2024-10-10T01:09:01Z"
It would seem to have been the case which is somewhat an frequently occurring theme, that it gets to a state where it's locked by an older backup job (which is nowhere to be found again) and requires manual intervention to resolve.
I'm not sure I understand this part - is the repo getting locked after a failed job? Is the mover pod getting killed while backing up (from OOM or something else?). If the backup doesn't complete due to an error, volsync will keep retrying, which potentially is why you see it keep attempting.
nextSyncTime: "2024-10-10T02:00:00Z"
It looks to me like the next sync time is set correctly? In this case the replicationsource is still trying to perform the backup, as you can see the lastSyncStart time is also set to 2am:
lastSyncStartTime: "2024-10-10T02:00:00Z"
If it completes successfully, then it should set the next sync start time to the next one in the schedule.
It would seem to have been the case which is somewhat an frequently occurring theme, that it gets to a state where it's locked by an older backup job (which is nowhere to be found again) and requires manual intervention to resolve.
I'm not sure I understand this part - is the repo getting locked after a failed job? Is the mover pod getting killed while backing up (from OOM or something else?). If the backup doesn't complete due to an error, volsync will keep retrying, which potentially is why you see it keep attempting.
I really wish I knew how it happens (but it does require me to run unlock), as the pod is nowhere to be found, even if it failed, so I cannot figure it out.. It however results in many pods when it fails and volsync keeps on trying and trying, but fails due to the lock, and before we know it there's 5+ pods trying to do it, and as the backup itself is not instant, it takes more than 3 minutes to do... The plot is thickened.
VolSync's mover uses a job, and should only be running 1 pod at a time. You're mentioning unlock, do you have logs from any of the failed pods indicating the lock is an issue?
If you need to run an unlock, there is an unlock option now for a restic job you could look at - see the unlock
backup option here: https://github.com/backube/volsync/issues/1421
VolSync's mover uses a job, and should only be running 1 pod at a time.
I think what @samip5 is saying is that when the job fails, a new pod will take it's place, fail, create a new lock and stay in an errored state, another new one will take it's place, fail, create a new lock, stay in an errored state, etc.., etc.. New pods will be created by the job until the restic repo is in a healthy state without any locks. Once there is a successful backup all the errored pods will be deleted as the job is cleaned up by volsync.
From what I can tell reading here and here is the current way the unlock
feature works with volsync is that is restic only unlocks locks that are stale (older than 30 minutes or if created by the "same machine") but there could still be locks set that aren't considered stale by restic which requires restic unlock --remove-all
to be ran.
It looks likerestic unlock --remove-all
was previously experimented with as I see in this comment and in the PR https://github.com/backube/volsync/pull/713
FWIW with the restic mover I don't know if it's even worth trying to start a new pod after the previous one failed, is there anyway to tweak restartPolicy: Never
and backoffLimit: 0
on the job spec? I highly doubt the next pod will even succeed if the first one fails especially with locks coming into play.
Yes, that's what I was trying to say but onedr0p did a better job explaining it.
FWIW with the restic mover I don't know if it's even worth trying to start a new pod after the previous one failed, is there anyway to tweak restartPolicy: Never and backoffLimit: 0 on the job spec? I highly doubt the next pod will even succeed if the first one fails especially with locks coming into play.
I do see restartPolicy is never but backoffLimit is set to 8. Ideally I would like this to be 0. So it doesn't try again until the next schedule and gives me time to unlock it using the spec.restic.unlock option.
@tesshuflower I wonder if we could make use of the pod failure policy feature and restics exit codes for a locked repo?
e.g. set this on the job...
podFailurePolicy:
rules:
- action: FailJob
onExitCodes:
containerName: restic
operator: In
values: [11]
That should mark the job as failed and not be retried until the next volsync schedule instead of same job retrying over and over on a locked restic repo.
Edit: created a new issue on this https://github.com/backube/volsync/issues/1429
I think we never passed in the restic --remove-all
flag as it seemed dangerous - this would also remove locks that could be set by someone else who's accessing the restic repo at the same time.
@samip5 I'd suggest to debug you could run with a manual schedule and look at the pod logs to see if there is any more detail. A restic lock doesn't normally get left behind if restic exits normally, so sounds like a crash of some sort or possibly the pod itself getting killed for OOM or some other reason. Another possibility I suppose is disk space issues on the repo or the cache.
To debug in more detail, you could try adding the following annotation to the ReplicationSource:
volsync.backube/enable-debug-mover
(with any value, even "").
When the next synchronization runs (you can run a manual trigger to run it immediately), the pod will now sit in "debug" mode (it will be idle and not run any backup), and then you can exec into it. The pod log has instructions but will have copied the mover script to /tmp where you can run it. You could then manually run the script which will invoke the backup and see if you get any more details. You can even modify the script in /tmp before running it and add -vvv
to the restic backup command.
Describe the bug I have specified a trigger schedule to backup at 2AM, but it seems it's not honored?
Steps to reproduce Set a replicationsource with trigger schedule of
0 2 * * *
Expected behavior I expected it to honor my trigger.
Actual results It doesn't seem to honor it.
Additional context
I'm using Volume Populator, and my replication source is here:
Logs that show amount of snapshots: https://gist.github.com/samip5/b7a9bdf8345a5957cb659718812ffd6c
It seems that for whatever reason, it's not following the trigger, like you can clearly see from:
So why would that be?