RJVB / afsctool

This is a version of "brkirch"'s afsctool utility that allows end-users to leverage HFS+ compression.
https://brkirch.wordpress.com/afsctool
GNU General Public License v3.0
187 stars 18 forks source link

"0 byte" files after specifying the same directory twice #32

Closed biziclop closed 5 years ago

biziclop commented 5 years ago

Hi! I found afsctool yesterday and generally it works great! Thanks for keeping it alive!
Because it's so useful I tried to make it into an Automator service to be available from Finder's contextual menu. Unfortunately I fucked it up a bit:
automator
The Get Selected Finder Items action (which I put there for making testing services easier, suggested by Automator itself) is unnecessary, because it adds the very same items to the argument list twice.

This way the command that actually got executed was the following (fortunately I used a non-important directory):
afsctool -cvvv -9 -J8 ~/.cache ~/.cache
I had to manually kill afsctool because it never finished. And when I looked into the given directory, it had many apparently "Zero bytes" sized files according to the Finder. Examining them with xattr, they still had a non-empty com.apple.decmpfs attribute, so I guess the data was still there in them, but afsctool attempted to concurrently process them:

xattr -l ~/.cache2/fontconfig/<random chars>.cache-4 
com.apple.decmpfs:
00000000  66 70 6D 63 03 00 00 00 C8 1B 00 00 00 00 00 00  |fpmc............|

Suggestion:
I have no clue how to work with files atomically, so to make attempted concurrent processing safer, it could be useful to rename files before processing then renaming back when compression is done. Maybe a similar naming template could be used:
original-file.txt-afsc-<process id>-<thread number>-<per thread incrementing number>-<and maybe a random number too>

There also could be a new command line argument to scan and find all files having temporary names and attempting repairing them.

Thank you!

RJVB commented 5 years ago

I don't think it ever occurred to me to reject duplicates from the list of files to process (but then again, maybe I did!).

Either way and a priori, afsctool shouldn't be able to process the same file multiple times at the same time because it opens them in exclusive mode. That's intended to be a protection against processing files that are currently open by someone else (imperfect because there is no guarantee that that someone else also used exclusive mode). That said, I don't actually know if O_EXLOCK gives any protection for opening a file multiple times in the same process!

Regardless of all that, you chose to compress a directory that might have open files in it, or where processes could do things with files while you are too. The fontconfig cache files are examples of that.

Afsctool tries very hard to be safe and "transparent" for directories and files that are "offline" (unused) when they're being compressed, but isn't designed for any other kind of use. That's basically impossible anyway (OS X has no sure and above all no efficient way of telling is a file is opened by someone else; this is probably also why ejecting a disk that has in-use files on it is so cumbersome and can really rev up the fans).

biziclop commented 5 years ago

Maybe file locking only locks the main data fork, but not the other forks/xattrs/etc?

Anyway, I made a simple test, text files filled with aaa…, but even in the most simple case it zaps even a single file specified twice:
afsctool -cvvv -J1 ~/test/0/0.txt ~/test/0/0.txt
but it also "works" with the entire directory:
afsctool -cvvv -J1 ~/test ~/test

If I set any thread option (-[jJ][123…]) it zeroes, otherwise it doesn't. I tried it on my main fusion drive, a pendrive, inside a dmg on and external ntfs hd, they all produced the phenomenon.

Here is the test zip:
test.tar.gz

RJVB commented 5 years ago

Clearly I should have though of and tested this long ago, so thanks for stumbling across this issue! I'm testing a fix as we speak, which will queue each file only once for compression.

What must be happening here is a classic race condition. A thread accessing a file-being-compressed just a bit later than the other thread(s) reads an empty file, and writes the compressed 0 bytes to the file, overwriting the actual contents. Which should trigger certain protections too and maybe even not produce 0-byte files (because zip will be adding its header).

My existing exclusive lock protection can also cause a deadlock if you specify a directory with lots of files twice; I'm not certain why but it no longer happens with the fix.

Debugging this to figure out exactly why you got empty files without warning (or I deadlocks) won't be trivial and hopefully unnecessary with the queue filter.

Serial processing was never concerned because the algorithm will reject files that are already compressed (or just recompress them as if they weren't compressed at all).

biziclop commented 5 years ago

Thank you for looking into it!
Anyway, it's f*g amazing, It already freed 15GB space! Why didn't Apple enabled it by default… Maybe they want us to buy the larger built-in soldered-on storage. :)

gingerbeardman commented 5 years ago

Thank you for looking into it! Anyway, it's f*g amazing, It already freed 15GB space! Why didn't Apple enabled it by default… Maybe they want us to buy the larger built-in soldered-on storage. :)

I think it's easier said than done. It should only be used on files that are (mostly?) read-only, so apps are a good fit. Apps from the MAS and System files from OS install already have it applied. Where else would you recommend it being applied?

biziclop commented 5 years ago

Non-MAS applications and their Application Support folder, especially those ridiculously huge gigabyte eating apps which contain at least 3 Chromium engine just for fun, email programs' databases which store each message in a separate file (mine went down from 14GB to 9GB), various downloads, PlayOnMac's wine prefixes, other things in Library that not expected to change often, stupid WordPress and whatever installations with many-many well-compressable source files, various sh*t in /usr/local, …

There is always something.

I wonder when will it fail catastrophically. :)