RJVB / afsctool

This is a version of "brkirch"'s afsctool utility that allows end-users to leverage HFS+ compression.
https://brkirch.wordpress.com/afsctool
GNU General Public License v3.0
190 stars 18 forks source link

Incorrect display of folder sizes in human readable form #4

Open szhorvat opened 7 years ago

szhorvat commented 7 years ago

When I run afsctool -v . in a large folder, I usually get output like this:

Folder size (uncompressed; reported size by Mac OS 10.6+ Finder): 10175898032 bytes / 1.5 TB (terabytes) / 1.37 TiB (tebibytes)
Folder size (compressed - decmpfs xattr; reported size by Mac OS 10.0-10.5 Finder): 5726786210 bytes / 339.74 GB (gigabytes) / 316.41 GiB (gibibytes)
Folder size (compressed): 5956483940 bytes / 339.97 GB (gigabytes) / 316.62 GiB (gibibytes)

Notice the mismatch between the byte count and the human readable form.

10175898032 bytes is actually 9.48 GiB or 10.18 GB.

The byte count is correct (as compared with e.g. the Finder's reporting). The human readable form is wrong.

RJVB commented 7 years ago

Yeah, I've been meaning to do something about that output display which is way too complex and "TMI". Part of the reason I didn't is that I don't want to get drawn into the debate whether to use 1024 or 1000 as a kilo unit, and how to abbreviate that. To me, a megabyte is 1024 x 1024 bytes, abbreviated as MB (or Mb), a gigabyte is 1024 times more and the 1000 base scheme nothing but a cheap trick used by disk makers ;)

szhorvat commented 7 years ago

It's cosmetic, so obviously not a big deal.

The biggest reason to fix it is that when someone notices it for the first time, it is not immediately obvious that it's cosmetic, and people may get worried about possible data loss (afsctool can be scary to use). I was certainly a bit worried until I realized that the byte count is correct.

As for 1024 vs 1000, it's really not a big deal. Most other command line tools, like ls -lh, use 1024, so why not go with that? (The Finder uses 1000, which is annoying.)

RJVB commented 7 years ago

Can you check and close the issue if this now gets your approval? :)

szhorvat commented 7 years ago

Sorry for the late response. I am not sure what is happening, but it still doesn't report sizes correctly. I double-checked that I am using the version I just built (and the source I just downloaded).

For example, for the folder where I built it, it says

Folder size (uncompressed; reported size by Mac OS 10.6+ Finder): 1387738 bytes / 276.8 MB (megabytes, base-10)
Folder size (compressed - decmpfs xattr; reported size by Mac OS 10.0-10.5 Finder): 1387738 bytes / 276.8 MB (megabytes, base-10)
Folder size (compressed): 1387738 bytes / 264 MiB
RJVB commented 7 years ago

So basically it's a factor 1000 (1024) off? I don't understand this, and in fact I remember that I couldn't even reproduce your original observation.

Looking at the function that does the conversion my first reaction was to rewrite it entirely using a different approach. I cannot really wrap my head around how it's being done currently, can you? But then I noticed that it works for me, so I went the easier route.

How do you build, i.e. with what compiler and options? I've tried with Apple's clang from Xcode 6.2 (Clang 600.0.57 or so) as well as with a stock clang 4.0, in both cases using -O3 -march=native and (evidently) 64bit mode. Are you by any chance building in 32bit mode?

szhorvat commented 7 years ago

I'm not really familiar with cmake, so all I did was

cmake .
make

It would complain about not finding sparsehash/dense_hash_map, so I added /opt/local/include to target_include_directories, which seemed to work.

I built it on OS X 10.13.1 with the latest dev tools ("Apple LLVM version 9.0.0 (clang-900.0.38)")

szhorvat commented 7 years ago

But since it's just a misreporting, it's really not a big deal.

RJVB commented 7 years ago

It is until we understand what's going on. If it miscompiles this code, who knows what else it miscompiles?

szhorvat commented 7 years ago

Weird indeed. I can't see the problem in the function. Will look more later. Here the output is fine: http://coliru.stacked-crooked.com/a/e04c6526167906df

szhorvat commented 7 years ago

I won't have more time for this today, but I put a printf at the beginning of getSizeStr to see what arguments it's getting, and I see this:

size: 1382668, size_rounded: 276824064

This doesn't look right. The problem seems to be with the size_rounded value that's passed in.

RJVB commented 7 years ago
size: 1382668, size_rounded: 276824064

This doesn't look right. The problem seems to be with the size_rounded value that's passed in.

No, that doesn't look right, esp. since size and size_rounded are supposed to be equal in at least the last call, "Folder size (compressed):" (I am claiming this from memory!).

RJVB commented 7 years ago

A thought: what filesystem are you seeing this with? Is that by any chance on APFS?

szhorvat commented 7 years ago

Yes, APFS.

I just tried on HFS+ (external drive) and it looks fine there:

size: 15329972233, size_rounded: 15330181120
Folder size (uncompressed; reported size by Mac OS 10.6+ Finder): 15329972233 bytes / 15.33 GB (gigabytes, base-10)
RJVB commented 7 years ago

Thanks for confirming.

Still a considerable difference (over 200000 bytes) but this does look more reasonable, no?

Apparently we have to find the HFS-specific contribution to the rounded file size estimate that goes awry on APFS. I've already seen 1 candidate (something using HFS attributes). But maybe I should just

szhorvat commented 7 years ago

st_blksize (for one file) is 4096 on the HFS+ volume I tried. On the APFS one it is 4194304 = 2^22. That explains the large reported sizes, but I do not know much about file systems, so I don't know what this value means in practice (I prefer not to guess in public :) )

szhorvat commented 7 years ago

I find the code a bit too noisy to follow with confidence, but there are several repeated bits that seem to round up filesize to a multiple of fileinfo->st_blksize, like here:

filesize_rounded = filesize = fileinfo->st_size;
filesize_rounded += (filesize_rounded % fileinfo->st_blksize) ? fileinfo->st_blksize - (filesize_rounded % fileinfo->st_blksize) : 0;
RJVB commented 7 years ago

You're right that the code is very noisy and repetitive. I don't feel very much like giving it the required overhaul, though ...

Could I ask you to try to determine whether the Finder still requires any special size calculations? If we can avoid that whole rounding step at least for files on APFS the problem is "solved" ...

szhorvat commented 7 years ago

I don't know the background context.

Do you mean that the whole purpose of this complicated calculation and rounding to "block sizes" (whatever those may be) is just to reproduce the Finder's results?

So you are asking to look at what the Finder reports on 10.13 / APFS and see which calculation it matches with?

RJVB commented 7 years ago

Oh, and if it really interests you, try a disk benchmarking utility that allows you to use 4Mb (and larger) block sizes. I don't know what kind of drive you got that value from nor what size the file had (the value can be file-specific) but if the system is right you should see optimum throughput with 4Mb blocks.

This is of course unrelated to solving the issue at hand.

RJVB commented 7 years ago

Meanwhile, I pushed a couple of changes (including one which should let cmake find sparsehash if you build for installing into the same prefix (/opt/local, right?).

I think I found the bug that was biting you. The rounded file sizes are in fact the sizes in terms of disk blocks. For instance, a file that's 1024 bytes small will still take up 4096 bytes if the disk uses that block size. The previous algorithm would calculate (size % st_blksize) and then add the complement of that to the file size; the a 1024 byte file was thus counted as (1024 + (4096 - 1024)). A small error for small blocksizes, but a possibly signifant error for a 4Mb block size.

There is of course also the possibility that we shouldn't be doing this rounding at all on APFS. That 4Mb value is probably a value that has little to do with the underlying disk block size, I really hope that APFS doesn't make you lose 4Mb per file!

Here's another small experiment: compare fileinfo->st_blocks with the actual filesize (and fileinfo->st_blksize) on HFS+ and APFS, and then to the disk sector size (printed by smartmontools, for instance). Many modern disks have a physical sector size of 4096 (coinciding with st_blksize on HFS+) but 512 bytes logical sector size (the legacy value). As far as I can tell fileinfo->st_blocks gives the filesize in multiples of 512 byte on the disks I have here. I've added a bit of (outcommented) debug printf to roundToBlkSize() which makes this easier.

RJVB commented 6 years ago

Do you mean that the whole purpose of this complicated calculation and rounding to "block sizes" (whatever those may be) is just to reproduce the Finder's results?

Yes. I think the reason is that the Finder includes size information from resource forks, metadata etc.

So you are asking to look at what the Finder reports on 10.13 / APFS and see which calculation it matches with?

Yes. Take a couple of files and/or directories, see what the Finder reports before compression and what after compression and how that compares to the "unrounded" sizes afsctool reports. Comparison to the rounded sizes is a bonus, but probably not a very useful one since we already know that the rounding operation is flawed.

Oh, and may I also ask you to take a quick look at the man lstat page to see if it says anything interesting about st_blksize that explains the difference? It's supposed to be the blocksize for optimum file I/O; the value you report seems improbable in that context...