borgbackup / borg

Deduplicating archiver with compression and authenticated encryption.
https://www.borgbackup.org/
Other
10.84k stars 734 forks source link

file modified check race condition (files cache) #3536

Open textshell opened 6 years ago

textshell commented 6 years ago

Investigate if known file cache can contain entries with matching stat data but mismatched chunks when multiple files are modified while creating an archive.

Scenario:

Now the cache does not contain (b) but contains (a) with matching stat data but checks from an older state.

Can this happen? If so for which scenarios can we fix it?

¹ ctime should also be checked.

textshell commented 6 years ago

More thoughts from irc:

i.e. borg would need to remove everything from the known cache since it started, but i don't remember anything that does this. Esp with the logic not believing that local time and fs reported file times are even from the same clock.

so how could borg detect the clock value it was started. What happens when different mounts use different clocks. Of course when assuming a global clock it should be easy to avoid by just not storing known file cache entries with max(mtime, ctime) > archive starttime. For a real fix the latest in backup exclusion should be preserved as well.

ThomasWaldmann commented 6 years ago

OK, guess i see the race now.

It requires a file being changed twice within mtime/ctime granularity (depends on filesystem, but usually nanoseconds .. microseconds):

This can not happen if we are working on a snapshot (and this is what the existing "kick latest timestamp from the cache" code had in mind).

It guess it might happen when backing up rather heavily used files w/o snapshot, though, although the time window is rather narrow. But if a file is used that heavily, more mtime/ctime changes are likely and then the problem would vanish again...

textshell commented 6 years ago

Yes snapshots solve the problem of course. And the window is rather narrow.

Unless you happen to use VFAT which iirc has a 2 second resolution or something horrendous like that.

But users rightfully expect a later backup run picks up new data unless they are changing file timestamps in bad ways. (ctime inclusion has reduced that possibility).

ThomasWaldmann commented 6 years ago

Maybe we can avoid kicking all files with timestamps >= backup_start_time from the files cache.

The critical ones are the ones with a timestamp in [st - gran, et + gran], with st/et being start and end of backup time of that file. Or simply "it changed while we backupped it" +/- gran.

textshell commented 6 years ago

yes, assuming a global clock. I think for network filesystems with ntp (or similar) we need max(gran, expected_drift) where expected_drift needs to be decently big (milliseconds? seconds?)

And i think network filesystems with bad clocks would require a huge value of expected_drift. So i expect we don't want to do that (of course those don't work with >= backup_start_time either). And then there is the amount of fighting i seem to do to get people actually deploying ntp for all systems.

On the other hand i'm not sure how bad kicking all files with timestamps >= backup_start_time from the files cache would really be. I know we had complaints about kicking the latest timestamp, but that one could be really well in the past. This one would at least go away after the next backup.

ioguix commented 5 years ago

Hi guys,

What about kicking out of the cache files with mtime/ctime >= backup_start_time + granularity? I suppose a default granularity of 2000ms would be safe enough. And maybe making the granularity a setup for network FS would be a "good enough" solution?

Maybe such algorithm could be a new mode given to --files-cache so people expecting extreme caution on their backup can choose it even if the backups are slower?