Open jrabinow opened 2 years ago
Let's make sure to merge this PR in after #10 so that merge conflicts are avoided.
There's an issue with this PR which I'm unsure how to solve without your input. This new approach looks in the borg repo and identifies the currently backed-up snapshots to know whether any given snapshot was backed up.
However, this won't work in fault-tolerant mode: if there's an error of some sort, there's a good chance that the borg repo is inaccessible. So snapborg can't check if the snapshot was backed up or not. So fault-tolerant mode in this case is not going to work.
I have a couple ideas to solve this but they don't feel very clean. I'm wondering if a) you could provide me with your service/timer files so I could try to better understand the context around fault-tolerance b) if you had any ideas on how to solve this, I'd be interested.
Also, separate issue: I ran into an issue with Archive $archivename already exists
while reverting to a previous code version. This tells me there may be issues the first time you run this after the upgrade. Once this is merged in, I recommend running snapborg clean-snapper
followed by snapborg backup --recreate
if errors occur.
There's an issue with this PR which I'm unsure how to solve without your input. This new approach looks in the borg repo and identifies the currently backed-up snapshots to know whether any given snapshot was backed up.
However, this won't work in fault-tolerant mode: if there's an error of some sort, there's a good chance that the borg repo is inaccessible. So snapborg can't check if the snapshot was backed up or not. So fault-tolerant mode in this case is not going to work.
I have a couple ideas to solve this but they don't feel very clean. I'm wondering if a) you could provide me with your service/timer files so I could try to better understand the context around fault-tolerance b) if you had any ideas on how to solve this, I'd be interested.
I'm not sure anymore whether the benefit the fault tolerant mode brings (being able to work with backup targets that are not permanently reachable while still kind of enforcing regular backups) should be implemented within snapborg. If you have a hard limit on the maximum time between backups, you could also use some external monitoring system. And if you haven't then all that fault tolerant mode gives you is some kind of warning that your backup is getting outdated.
So if it makes handling multiple repositories easier, I have no problem with ditching the fault tolerant mode.
Without fault tolerant mode, it could be interesting to define how snapborg should behave if one borg repository is reachable whereas another one is not. Should it be a hard error if any repository fails? Or should it be possible to define "mandatory" and "optional" repositories?
Another option would be to to store a list of repositories it has already been transferred to alongside each snapper snapshot in its userdata. This could lead to synchronization issues (snapborg backup comments have to be synchronized with snapper snapshot userdata) but might work quite well. Also I think the fault tolerant mode could be kept this way.
In general I like the idea to create a UUID for each snapper snapshot because this provides an exact one-to-many mapping from snapshots to borg backups (using the snapshot ID as I suggested in #6 would lead to issues if multiple snapper configs (possibly on different machines) backup to the same borg repository.
If you have a hard limit on the maximum time between backups, you could also use some external monitoring system. And if you haven't then all that fault tolerant mode gives you is some kind of warning that your backup is getting outdated. So if it makes handling multiple repositories easier, I have no problem with ditching the fault tolerant mode.
Cool :-) But let's see what the options are before ditching, I don't like the idea of forcing an external monitoring system on people if this is a common usecase (which for the time being, it is 😉)
Without fault tolerant mode, it could be interesting to define how snapborg should behave if one borg repository is reachable whereas another one is not. Should it be a hard error if any repository fails? Or should it be possible to define "mandatory" and "optional" repositories?
According to the principle of least surprise, there should never be any surprises - with a backup system, the biggest surprise someone can have is attempting to restore from a backup that they discover doesn't exist (even if the backup exists elsewhere). Following that train of thought, I think the best option in the case of failure would be to keep backing up to other repos as much as is feasible, but to act as if a failure to backup to any repo were a hard failure. Whoever set the system up should investigate why the failure occurred and take steps to mitigate it - that means alerting them properly.
If optional repos are of interest, that could happen too, but it would probably be more appropriate under an optional repo
key in the config file.
Another option would be to to store a list of repositories it has already been transferred to alongside each snapper snapshot in its userdata.
This could work without the sync issues you're afraid of, if we agree on the convention that the source of truth is the borg repo - effectively, the snapper data for fault-tolerance would only be looked at if the borg repo were inaccessible. However, I'm worried about referring to individual borg repos in the snapper userdata: if the repo was moved, or the remote repo changed IP address, that could cause failures.
The way I'm thinking about this is that snapborg could save snapborg_backup_date_${SNAPPER_CONFIG_NAME}=${DATE}
in the userdata, and refer to that if the borg repo can't be accessed. Would do you think?
Alternatively, ditching fault-tolerant mode: there's need to setup an alerting system. I've never used this but it looks like it could be a good replacement. Assuming it works of course 😄
Coming back to the config file format for a second: I'm afraid I'm going to have to change it in this PR, simply to list multiple repos in the same entry. I'll do my best to make this backwards-compatible, but I can't promise anything until I've done it.
I will start making changes to the config file when I get the chance, and I'll leave fault-tolerance mode alone for the time being. If you decide that there's a good replacement (such as systemd-alert linked above, or something else), then fault-tolerance can be removed. Otherwise, we'll have to fix fault-tolerant mode so that it works correctly even if the borg repo is unavailable.
This PR solves #6
Mechanism used is the "snapborg id": each snapshot gets a uuid generated, and each borg backup gets the same uuid as a comment. To check that a snapshot is backed up, check the uuids in the borg repo and in the snapper userdata. A backed-up snapshot means its uuid appears in the borg repo being looked at.
Notes:
Write the snapper snapshot ID in the borg backup metadata (name) and on each execution of snapborg run borg list beforehand to determine which snapshots have already been backed up
I considered doing this but the advantage of using comments is that the code is a lot cleaner and less prone to parsing errors. However, it also means that one must use a command such assudo borg list --json --format "{comment}" $REPO
to see the comment. Because of this, It also means that it's harder for users to know which snapshots are backed up and which ones aren't.snapborg_backup=true
metadata. Borg will deduplicate the data and save space, but there will be a one-time performance hit: the first post-upgrade snapborg run will probably take a long time as it will be backing up all snapshots that fit within the retention policy.