Closed textshell closed 1 month ago
More thoughts from irc:
i.e. borg would need to remove everything from the known cache since it started, but i don't remember anything that does this. Esp with the logic not believing that local time and fs reported file times are even from the same clock.
so how could borg detect the clock value it was started. What happens when different mounts use different clocks. Of course when assuming a global clock it should be easy to avoid by just not storing known file cache entries with max(mtime, ctime) > archive starttime. For a real fix the latest in backup exclusion should be preserved as well.
OK, guess i see the race now.
It requires a file being changed twice within mtime/ctime granularity (depends on filesystem, but usually nanoseconds .. microseconds):
This can not happen if we are working on a snapshot (and this is what the existing "kick latest timestamp from the cache" code had in mind).
It guess it might happen when backing up rather heavily used files w/o snapshot, though, although the time window is rather narrow. But if a file is used that heavily, more mtime/ctime changes are likely and then the problem would vanish again...
Yes snapshots solve the problem of course. And the window is rather narrow.
Unless you happen to use VFAT which iirc has a 2 second resolution or something horrendous like that.
But users rightfully expect a later backup run picks up new data unless they are changing file timestamps in bad ways. (ctime inclusion has reduced that possibility).
Maybe we can avoid kicking all files with timestamps >= backup_start_time from the files cache.
The critical ones are the ones with a timestamp in [st - gran, et + gran], with st/et being start and end of backup time of that file. Or simply "it changed while we backupped it" +/- gran.
yes, assuming a global clock. I think for network filesystems with ntp (or similar) we need max(gran, expected_drift) where expected_drift needs to be decently big (milliseconds? seconds?)
And i think network filesystems with bad clocks would require a huge value of expected_drift. So i expect we don't want to do that (of course those don't work with >= backup_start_time
either). And then there is the amount of fighting i seem to do to get people actually deploying ntp for all systems.
On the other hand i'm not sure how bad kicking all files with timestamps >= backup_start_time from the files cache would really be. I know we had complaints about kicking the latest timestamp, but that one could be really well in the past. This one would at least go away after the next backup.
Hi guys,
What about kicking out of the cache files with mtime/ctime >= backup_start_time + granularity? I suppose a default granularity of 2000ms would be safe enough. And maybe making the granularity a setup for network FS would be a "good enough" solution?
Maybe such algorithm could be a new mode given to --files-cache
so people expecting extreme caution on their backup can choose it even if the backups are slower?
For "changed while being backed up" cases borg2 can detect, it raises a BackupError
internally and does MAX_RETRIES retries in _process_any
(with increasing wait times in between).
Currently, the detection is "ctime after reading != ctime before reading", so this does not catch the issue from top post yet.
But guess the detection could be extended to also covering the case when the ctime falls into the "while borg was backing up THIS file" interval (+/- some safety margin for clock differences and granularity).
If the retries don't help, borg does not write an entry for the file into the archive, logs a non-fatal warning about that file and continues with the next file.
See link directly above this post.
I first tried with a much bigger TIME_DIFFERS_NS (like 3s), but that of course lead to massive issues with the tests, because they quickly create files and then run borg on them, so the file ctime could be easily within that widened interval.
Thus, it does not detect the race on FAT (2s granularity) or with out-of-sync clocks (more than 20ms difference).
If we want to deal with bigger granularity / unsynced clocks, we really need to kick everything from the files cache that has timestamps after backup_start - quite_some_seconds
. That has issues also, e.g. if you shut down a VM or db and immediately start a backup of their files, they will already fall into that interval. Counteraction could be a longer sleep between shut down and starting the backup.
A completely different way to deal with this ticket would be to just close it as wontfix
, because it is not borg's problem if you backup live filesystems instead of using a snapshot.
I also implemented the "discard files cache entries since start_of_whole_backup - delta_t" approach in the same PR. Need to do some testing first, then will push to github.
Investigate if known file cache can contain entries with matching stat data but mismatched chunks when multiple files are modified while creating an archive.
Scenario:
Now the cache does not contain (b) but contains (a) with matching stat data but checks from an older state.
Can this happen? If so for which scenarios can we fix it?
¹ ctime should also be checked.