Open safinaskar opened 2 years ago
I'm curious as to why you use touch -r
. You're deliberately modifying a file, then telling the system to pretend that it hasn't been modified by setting the mtime to the previous value?
Anyway, you're correct that tarsnap will avoid reading a file whose inode numbers, size, and mtime hasn't changed, as noted on https://www.tarsnap.com/efficiency.html.
If you want to disable the cache entirely, you could use the (admittedly non-obviously-named) --verylowmem
option.
(This issue arose in another context a few months ago, so I was considering adding some documentation about this.)
I'm curious as to why you use
touch -r
. You're deliberately modifying a file, then telling the system to pretend that it hasn't been modified by setting the mtime to the previous value?
Yes. I often record video from screen using ffmpeg -f x11grab
and similar things. Sometimes I want to recode video using better compression. And I intentionally keep metadata the same using touch -r
, so that I could see that the video was created a long ago.
If you want to disable the cache entirely, you could use the (admittedly non-obviously-named) --verylowmem option.
I tried to find such option. I opened page https://www.tarsnap.com/man-tarsnap.1.html , then using "find in page" I searched strings "checksum", "mtime" and "ctime" and didn't find anything. It would be great if you add to man to --verylowmem something like: "this option doesn't look at mtime".
There are programs, which manipulate mtime. I know at least one of them: xz
. I think there are a lot more. xz
creates compressed file with same mtime as source file. I can create scenario, where this can be exploited to trick tarsnap. Here is full log (I use tarsnap 1.0.40, linux kernel 4.19, ext4 file system, this is easy to get same inode number on ext4):
user@comp:~/exp$ echo a > a
user@comp:~/exp$ xz -0 a
user@comp:~/exp$ stat a.xz
File: a.xz
Size: 60 Blocks: 8 IO Block: 4096 regular file
Device: 81ah/2074d Inode: 33988611 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ user) Gid: ( 1000/ user)
Access: 2022-05-12 23:41:37.429926508 +0300
Modify: 2022-05-12 23:41:37.429926508 +0300
Change: 2022-05-12 23:41:39.305922434 +0300
Birth: -
user@comp:~/exp$ md5sum a.xz
0da74eb254586cec9888dd700905795c a.xz
user@comp:~/exp$ sudo tarsnap -c -f t-2022-05-11-18 --cachedir /usr/local/tarsnap-cache-2 --keyfile /root/tarsnap-2.key a.xz
Total size Compressed size
All archives 142643279 134351393
(unique data) 58562759 49934797
This archive 2599 1872
New data 2599 1872
user@comp:~/exp$ xz -d a.xz
user@comp:~/exp$ xz -1 a
user@comp:~/exp$ stat a.xz
File: a.xz
Size: 60 Blocks: 8 IO Block: 4096 regular file
Device: 81ah/2074d Inode: 33988611 Links: 1
Access: (0644/-rw-r--r--) Uid: ( 1000/ user) Gid: ( 1000/ user)
Access: 2022-05-12 23:41:44.857910380 +0300
Modify: 2022-05-12 23:41:37.429926508 +0300
Change: 2022-05-12 23:42:48.873771880 +0300
Birth: -
user@comp:~/exp$ md5sum a.xz
fa25f1983bf7153c34a11ca1edf964dc a.xz
user@comp:~/exp$ sudo tarsnap -c -f t-2022-05-11-19 --cachedir /usr/local/tarsnap-cache-2 --keyfile /root/tarsnap-2.key a.xz
Total size Compressed size
All archives 142645878 134353265
(unique data) 58563278 49935908
This archive 2599 1872
New data 519 1111
user@comp:~/exp$ mkdir extract
user@comp:~/exp$ cd extract/
user@comp:~/exp/extract$ sudo tarsnap -x -f t-2022-05-11-19 --cachedir /usr/local/tarsnap-cache-2 --keyfile /root/tarsnap-2.key
user@comp:~/exp/extract$ md5sum ~/exp/a.xz ~/exp/extract/a.xz
fa25f1983bf7153c34a11ca1edf964dc /home/user/exp/a.xz
0da74eb254586cec9888dd700905795c /home/user/exp/extract/a.xz
So, I think current tarsnap behavior is simply wrong, tarsnap could easily be tricked using such innocent tools as xz
. So (from correctness point of view) default tarsnap behavior should be changed to always use checksums. But I think this will cause performance degradation, so I propose to keep default behavior as is but to add option to always use checksum and tell everyone to nearly always use it. And of course if file system has its own checksumming or some kind of integrity checking, then tarsnap should use it
I'm tempted to say "don't do that then" -- assuming that "unmodified inode" means "unmodified file" is very common, and as you note it provides a very large performance benefit. In your case of recompressing a video, tarsnap will do the right thing anyway, since in addition to the modification time it also checks if the file size has changed.
But yes, we could add an --always-read-files
option. This would actually be faster than relying on --verylowmem
since tarsnap can compare data against the chunkification cache and bypass the CPU-intensive chunking step.
I found a perfect solution! Just always use ctime instead of mtime. I was not able to find way to set ctime to some past date. And modern versions of borg-backup use ctime instead of mtime for this purpose, here is their rationale: https://borgbackup.readthedocs.io/en/stable/usage/create.html#borg-create
Wontfix would be my vote.
UNIX™ file management and backup philosophy would lean to using the filesystem as the ultimate source of truth.
if (file_now->nodump) return NOPE;
if (file_now->mtime_in_tai64 <= last->mtime_in_tai64) return NOPE;
if (file_now->size == last->size) {
file_now->hash = hash(file_now);
if (file_now->hash == last->hash)
return NOPE;
}
return OK_FINE_I_WILL_BACK_THIS_UP;
To mitigate this category of edge-case, it would require exhaustively scanning every byte of content &| hooking fs change notifications for continuous/periodic sync background tracking, both very expensive in different ways compared to the added value. ROI approaching epsilon.
As well as I understand tarsnap takes into account file metadata when deciding whether to skip file. Unfortunately, this means that tarsnap can wrongly skip modified file if its metadata was changed in unusual way. I was able to change file metadata using
touch -r
and trick tarsnap into not making backing. Here is full log (b
is file with randomly generated content):I use tarsnap 1.0.40.
So, it seems any future tarsnap invocations will not backup new
c
version.This breaks my workflow, because I actually sometimes use command
touch -r
. Also I think it is possible some broken utils may create files with wrong metadata. And I want my backup software to be absolutely reliable in such cases.So, please always checksum files or add some option to always force checksumming.
rsync
has such option, it is named--checksum