Support for big files - Githubissues

kenorb commented 6 years ago

$ afsctool -c somebigfile
Skipping file somebigfile with unsupportable size 6678124800
Unable to compress file.

RJVB commented 6 years ago

6678124800

That's 6Gb. Afsctool reads entire files into RAM and then hands off that buffer to libz for compression into another buffer. If memory serves me well the size limitation is not mine, but imposed by that mode of operation and the libz API. Maybe even by HFS itself, I've never looked at that.

In practice I have so few compressable files that are 2Gb or larger that I never felt the need to work this limitation. Patches welcome though.

wmertens commented 5 years ago

I have a 13GB sqlite db that compresses down to 800MB. It would be cool to be able to keep it compressed, but I wonder how that will perform, being stored in the resource fork.

On my server I use btrfs and that compresses extents, so by "block". That works great.

RJVB commented 5 years ago

I have a 13GB sqlite db that compresses down to 800MB. It would be cool to be able to keep it compressed, but I wonder how that will perform, being stored in the resource fork.

I don't think you can notice it being in the resource fork instead of elsewhere, but the question is a bit moot. You can only keep HFS compression if you only open the file read-only (or are otherwise certain that no changes are made to it at all).

I recently made some changes to afscool that make it possible to compress files that are > 2Gb. That support is a bit experimental because I'm not 100% certain that HFS compression is supported above that limit. I've only tried it with a 2.5Gb file myself, did you try it on your 13Gb monster?

On my server I use btrfs and that compresses extents, so by "block". That works great.

On Mac you could try ZFS (www.o3x.org), the industrial-strength (original) alternative to btrfs. That works great too nowadays, as long as you don't expect the same performance as you'd get on HFS or APFS. Then again, sqlite isn't the best choice either if you want performance and huge databases, from what I hear ;)

wmertens commented 5 years ago

argh of course, read-only. Hmm, ZFS would indeed be nice.

I wonder about the ZFS performance, I'd use it to put my projects tree, but I should probably move the compile cache off it. And then of course I'll need to pick an appropriate volume size, and upgrading to new macOS versions is gated by ZFS. Hmmm. Why can't Apple be like the Apple of 2005, when they were building the best Unix laptop there was?

SQLite performs admirably on large DBs :) in the end, it comes down to algorithms, and SQLite is very well implemented, and isn't burdened by a network layer. If your filesystem can hold it, SQLite can manage it.

RJVB commented 5 years ago

I wonder about the ZFS performance,

The best thing is to test it. Raw throughput (to an external 3.5" 7200RPM disk on USB3 on Thunderbolt) as measured with Blackmagic's DiskSpeed is comparable to what I get with HFS+ on an adjacent partition. For things like git checkout firstcommit ; git checkout HEAD on big repositories like macports-ports (lots of small files, lots of commits) I also hardly notice a difference. On Linux you can get better performance if you give entire devices to ZFS but I don't know if that applies to Mac too (where putting a partition with an officially recognised filesystem can avoid Finder dialogs about unrecognised disks).

I only don't use ZFS for everyday work on Mac because it is really RAM hungry. But my linux notebook has been running off ZFS for years.

I'd use it to put my projects tree, but I should probably move the compile cache off it.

Ccache? Not necessarily; that directory compresses amazingly well.

And then of course I'll need to pick an appropriate volume size,

You can resize ZFS pools (increasing at least).

and upgrading to new macOS versions is gated by ZFS.

Why would it be? The O3X teams seems to be on top of those things, and I for one never install a major OS upgrade before there's at least an X.1 release ;)

Hmmm. Why can't Apple be like the Apple of 2005, when they were building the best Unix laptop there was?

Because that Apple died when Jobs started going downhill? All their products after 2011 appealed less and less to me (fortunately my MBP from that year is still hanging on and with a Sandybridge i7 it's only just becoming acceptably slowish. Under 10.9, I haven't yet dared to go beyond.

SQLite performs admirably on large DBs :) in the end, it comes down to algorithms, and SQLite is very well implemented, and isn't burdened by a network layer. If your filesystem can hold it, SQLite can manage it.

Well, it isn't suitable for everything. For KDE PIM (akonadi) it's best avoided and I remember phpbb forum crashes where the entire forum (history) was lost because of a glitch in sqlite. Multiple times ...

wmertens commented 5 years ago

I'll give ZFS a whirl as a weekend project. Been quite a while since I used it.

My 2015 MBP just can't seem to handle modern web development any more, but maybe that's hardware issues. I should have stuck to 10.9 too, the last few releases didn't bring anything useful. 😞

Any extra info about those DB crashes? https://www.sqlite.org/howtocorrupt.html is good reading. SQLite is probably the most installed database in the world, so data corruption bugs are rare now.

lucianmarin commented 5 years ago

It will be great if there's was some kind of support for large files. I have to work with HaveIBeenPwned passwords files and they eat a lot of space.

gingerbeardman commented 5 years ago

@lucianmarin have you tried it? see https://github.com/RJVB/afsctool/issues/17#issuecomment-453027732 which says that it's been added since Jan, but needs testing.

RJVB commented 5 years ago

As mentioned above HFS compression of any file is interesting only for files that are not modified regularly. The compression step is rather expensive, so you don't want to do it regularly on really large files (for which afsctool itself isn't optimised either).

I notice that the pwned files are 7zipped; those files will not compress with HFS compression.

lucianmarin commented 5 years ago

I notice that the pwned files are 7zipped; those files will not compress with HFS compression.

Tried it on uncompressed (*.txt) version of those files. I got a system crash at the end. asfctool -v says it's compressed (46.2% savings), but I get system crashes while reading it.

Previously I used the Homebrew version of afsctool (1.6.4) which said: Unable to compress file.

RJVB commented 5 years ago

A system crash, meaning a kernel panic that obliged you to reboot?

I haven't yet seen that happen, but as said the feature is experimental. Maybe I should remove it if it can lead to KPs...

What type of compression did you use, and did you use the -L option? The -L option is in a sense experimental too but never led to issues that I know of. Other than that I'm fairly certain that the compression algorithm is correct, so any crashes while trying to read a compressed file are due to boundary conditions (like file size limits) not being observed.

IIRC the old version refused to compress files > 2Gb because it allocated memory to read the entire file, and it ran into a 32bit limit with larger files.

In case you used the standard ZIP compression, do you also get a crash when you decompress using afsctool itself?

lucianmarin commented 5 years ago

Yes, the standard kernel panic message and a reboot. I didn't use the -L option, instead I used -1 -c. As I read, this would trigger the ZIP compression.

I have a MBP 2017, 16 GB RAM. The memory usage of afsctool was 9 GB of memory with 8.9 GB of compressed memory. The swap increased to 15.9 GB at the end. The file I was trying to compress has 19.9 GB.

I think you should calculate the memory usage in advance based on file size. If the system can allocate that kind of memory, then it should be allowed to compress it. I'm not sure if this is a file system implementation, but the ordinary ZIP tools don't use that kind of memory.

Anyway, I used the tool on 3GB of CSV files with 96.2% savings. That's huge! Thank you for your work.

wmertens commented 5 years ago

IMHO the afs compression is a dirty hack by Apple, and you shouldn't expect it to work well beyond a few tens of MB.

Since you're using text files, can't you simply split them?

RJVB commented 5 years ago

I think you should calculate the memory usage in advance based on file size.

What do you think the code does? Determining whether or not the system "can allocate that kind of memory" is tricky, the usual way is simply to try it if you don't want to make assumptions based on the amount of RAM present (cf. overcommit under Linux). Afsctool probably uses more memory than strictly required because of attempts to maximise throughput and builtin failsafes but also because of the chunked compression format. All compressors can use considerable amounts of memory when you let them loose on large enough files and you want to get maximum compression.

you shouldn't expect it to work well beyond a few tens of MB.

It works perfectly fine up to the 2Gb limit.

Nowadays resource forks are just special kinds of extended attributes, so using them to store compressed file content may be surprising but hardly a dirty hack. The format itself appears to be designed to make it as transparent for the end-user as possible: there is indeed very little performance cost to it. It is my pet peeve that they went for a read-only variant without official easy compression tools because they also sell diskspace. That became even more apparent after they failed to include proper compression in their new filesystem.

As to those use .txt files: there are other ways to work with them in compressed form, depending on exactly what you do with them. If you're just searching them with grep and family: the GNU versions of those have been able to search compressed files for a long time, and most compressors accept to decompress from stdin to stdout (or give gzcat/bzcat/xzcat utilities). Alternatively you can use one of several libraries that provide API for doing the usual operations on compressed files, but you can also use popen/pclose to read the output from say xzcat as if it were a file. If you only have to read the files (and download new versions periodically) you can put the content on a bzip2-compressed disk image (DMG); that's what I did before I had access to HFS compression. And then of course you can install ZFS.

wmertens commented 5 years ago

It is my pet peeve that they went for a read-only variant without official easy compression tools because they also sell diskspace. That became even more apparent after they failed to include proper compression in their new filesystem.

This. So much. The devtools eat like 13GB of your disk, and if you run them through afsctool, you recover 7GB. Basically they're stealing 2.5% of developers' SSD drives.

RJVB commented 5 years ago

You can probably recover even more if you get rid of the SDKs and simulators that you don't need. But it's a hassle to figure out how much to leave (and you may want to re-sign the bundle afterwards).

gingerbeardman commented 5 years ago

This. So much. The devtools eat like 13GB of your disk, and if you run them through afsctool, you recover 7GB. Basically they're stealing 2.5% of developers' SSD drives.

Only true if you download and install manually.

If you download the dev tools (Xcode) — or indeed any other app — from the Mac App Store, they are HFS+ compressed (LZVN)

I also use CleanMyMacX to keep on top of unused/old SDKs, Simulators, builds, and other cruft.

wmertens commented 5 years ago

FWIW f you download the dev tools (Xcode) — or indeed any other app — from the Mac App Store, they are HFS+ compressed (LZVN)

Oh? Mine wasn't compressed, maybe because of migration assistant. Hmmm, must investigate.

Dr-Emann commented 2 years ago

I believe the real limit is probably 2/4 GiB (depending on if the values are actually unsigned 32 bits) compressed size (approximately), since it seems all of the formats use a 32 bit offset from near the start of the file to store the location of compressed blocks.

If the file @lucianmarin was trying to compress was 19.9 GB, and reported 46.2% savings, that would still be well over 4 GiB when compressed, which would overflow the 32 bit offsets, and I wouldn't be surprised if there was kernel-side code which would crash in the presence of wrap-around, especially if the kernel actually interprets them as 32 bit SIGNED values, since the blocks would be interpreted as starting at a negative offset.

RJVB commented 2 years ago

Good point!

But still something that could have been avoided by using relative offsets and a wide enough accumulator "register".

Dr-Emann commented 2 years ago

Nope! Looks like it really is on the uncompressed size: removing the size check, writing files of all zeros, of length 4000000000 works, length 4500000000 fails (or.. succeeds writing, but then kernel panics on read), and since it's all zeros, it compresses super well, so the compressed size is nowhere near any issues.

DanielSmedegaardBuus commented 1 year ago

Quick question, gotta leave for the ferry, is there any way to make afsctool compress a large file (in this case 2GB vmdk-slices that are on average 90% zeroes) even if it instantly decides its incompressible due to... I'm guessing the first megabyte or so being incompressible?

Dr-Emann commented 1 year ago

By default, afsctool will give up if even a single block does not compress (grows even slightly when compressed). If you pass -L ("Allow larger-than-raw compressed chunks"), it will keep going, even if the whole file doesn't compress.

RJVB commented 1 year ago

Note that it won't compress files above (IIRC) 2Gb .

Dr-Emann commented 1 year ago

Right, yeah, the size check is currently for 2 GiB: https://github.com/RJVB/afsctool/blob/5e0f4be4eb1d7d0de69161408cc2ecaad9e2c4fa/src/afsctool.h#L59

RJVB / afsctool

Support for big files #17