Closed kosch closed 6 years ago
Root cause was invalid cache and seems to be caused by creating lv snapshot with uncomitted ext4 journal. In case of ext4 the commit time was reduced and for xfs xfs_freeze was introduced before creating snapshot. New created backups did not show up with similar problems. I'll keep testing next week, but it looks like its not a borg issue.
I'm a bit worried about bit rot and viruses changing the data with files (there was a pretty bad one that would encrypt all of your treasured family photos at the disk level, but not change the timestamp).
So I'm in the process of setting up a script to "create --files-cache=rechunk,ctime" and re-read ALL files. So that at least I have an old copy and a new copy of the contents of all files. My script will update a single large directory per hour (rechunking takes a long time, and I don't want to do the entire filesystem at once). And I'm hoping that it'll solve any problems like you had. When I get the script to be stable, I'll post it somewhere.
borg 1.1 uses ctime by default, which can't be controlled from userspace (like setting it to arbitrary values).
so maybe you don't need to use rechunk,ctime
, but just use the defaults?
bitrot: if a file was ok and then bit-rots, borg would not notice the change and not back it up. but in that case, that would be fine.
crypto trojan: if malware encrypts your files and resets timestamps (from userspace only possible for mtime and atime, not ctime), borg in --files-cache=mtime
mode would not back up the encrypted file. but the encrypted file is useless anyway, isn't it?
I've located the issue which causes an invalid cache. Szenario (ext4 filesystem logical volume and running MongoDB): MongoDB fsyncLock is executed, file handles to the database files are still kept opened. Even when waiting until ext4 commits its changes, the ctime might not be changed until these handles are closed. Inode and size of the files are anyway not changed (they are preallocated by mongodb). When now lvm snapshot is created and borg runs over the files of the filesystem, the ctime is still the old one. My workaround in case of ext4 is now just to "touch" these opened files on the read-write mounted snapshot, before performing a borg create. I don't use rechunk cache option in this case, because other areas (not only mongodb files) are also included in this borg create run. The created archives are now always valid.
Luckily this not happens using xfs. xfs_freeze + lvm snapshot are working very well together.
So I guess this ticket can be closed.
In the initial ticket you said "mongodb stopped".
Now it sounds like you had mongodb still running.
If the latter is correct, please edit the top post, so in case somebody stumbles over it (even if closed), the ticket reflects what in fact happened.
I investigated a issue where restore tests of database files failing, because restored files have an inconsistent state. Situation: ext4(data=journal) filesystem in a logical volume on CentOS 6 hosts. Database files belonging to MongoDB. borg is version 1.1.4 Steps are done to create an archive: 1) MongoDB running, db.fsyncLock() 2) Filesystembuffers are synced 3) LV snapshot is created 4) borg create --compression lz4 --files-cache ctime,size,inode ...
Problem: At least one file was different after extract and caused mongod to report error on startup. I mounted the archive and compared the files: Original file: -rw-------. 1 mongod mongod 2146435072 11. Jan 13:50 /mnt/dc-storage/var/lib/mongo/0.9 File inside mounted archive: -rw-------. 1 mongod mongod 2146435072 11. Jan 13:50 /mnt/dc-backup-restorepoint/var/lib/mongo/0.9 Compared: cmp -b /mnt/dc-storage/var/lib/mongo/0.9 /mnt/dc-backup-restorepoint/var/lib/mongo/0.9 /mnt/dc-storage/var/lib/mongo/0.9 /mnt/dc-backup-restorepoint/var/lib/mongo/0.9 differ: byte 2134909114, line 23742726 is 63 3 41 !
The file inside lv snapshot was the same as the original one. So the problem might be located in borg caching. First I tried to create additional archives, but the file inside archive stayed the same. Then I touch"ed" this file and created another archive and now the file is also valid inside the archive.
Do you have any advices to avoid this situation? Avoid the cache is not really a solution, because the source data can be several hundred GBs.