BLAKE3-team / BLAKE3

the official Rust and C implementations of the BLAKE3 cryptographic hash function
Apache License 2.0
4.71k stars 315 forks source link

Slow HDD read. #390

Open lelik107 opened 2 months ago

lelik107 commented 2 months ago

Hello guys, BLAKE3 Team! I found out that any b3sum_windows_x64_bin.exe reads a HDD slow. What exactly I did: I calculated the sums of Windows 10 iso images, 5.3 Gb and 5.8 Gb files, WD RE 1 Tb was used, I measured with a simple stopwatch.The speed of this HDD is 140-80 Mbyte/s. And it's not about a disk cache, because I rebooted after every test. So v1.5.1, v1.3.3 ~ 5.3 Gb - 1 m 45 s, 5.8 Gb ~ 1 m 57 s; v1.0.0 a bit faster ~ 5.3 Gb - 1 m 22 s, 5.8 Gb ~ 1 m 31 s. And I hear the disk spinning. b2sum-bin_20130305.zip, 7-zip h command with any algo ~ 5.3 Gb - 45-48 s, 5.8 Gb ~ 55 s -1 m;. And I don't hear the disk spinning. So my simple question is: why b3sum is this slower?

lulcat commented 1 month ago

Yes. internally I have been solving this with some ideas in the C. I forget if rust b3sum actually mmaps properly or is slow too but IF you are on a slow iops media AND do a big file, b3 is sadly the slowest hasher of them all... until cached, theni t's the fastest. (c or rust). I notice with rust b3sum I think the mmapping saturates my test spin.. which is 210ish MB/s, but when I used the C version it does 60MB/s , when not cached. Using some custom mmaping in front is what I/we do here. I can't say anything on how things work in windows, . if you rerun b3sum immediately after doing it once it should take 2-300 ms from what I see. Your times is what I would see too given the size on a disk spinning. My guess is b3sum_windows is C?

I don't think I recall if I found out exactly why b3sum was slower than the others during first load but I suspected something to do with chunking and buffer sizes :D but the authors obv. will have a better idea. Either way, this is the only case I am aware of where it suffers (sadly). Hence, I DO wish, as they mentioned once , that they did the OMP implementation in C as well..

lelik107 commented 1 month ago

@lulcat Yes, the disk cache reading is OK with Windows b3sum. IDK exactly the languages are used but there might be Assembly as well. Most of the third party Windows SW, for example https://github.com/OV2/RapidCRC-Unicode and Total Commander use single-threaded reference C code because there’s no multi-threaded C code yet from the developers. As for a raw reading from HDD in RapidCRC Unicode and TC, it shows just the same speed as any other algorithm. If you have 210 Mbytes/s you’ll get it. For a buffered reading single-threaded C code gives 1500-1700 Mbytes/s with all SIMDs but AVX-512F (I don’t have them). It’s fine for any SATA-3 device but maybe a bottleneck for M2s. And, yes, It'll be nice to have multi-threaded C code, maybe OMP.

lulcat commented 1 month ago

Hi again. Let me correct myself. I just checked now.. .the actual RUST binary as well, which does mmap AND MT...runs AT 60 MB/s in an example case I have....

I then evict the file from cache and run ANY Other hashsum, which will beat it... e.g. b2sum? (can be sha*sum, anysum, all will run at the max ingress speed of the file)! e.g.

time b2sum same_example => 210MB/s over 3 times faster!!

I meant the C code won't do any better because it isn't optimised like rust (no mmapping and no MT , i.e. OMP).

THIS Is the issue you notice and reported I think. and it is a 'edge case' of b3 which I haven't figured out yet WHY happens.

b2sum as you see does NOT suffer from this.. and I DO have avx512, I can run it with that in C etc but this problem persists.

it is on slow spinners and big files which trigger it. fast IO will not spike this case.

Now, again, as I said.. I have solved this in my environment with something on top.. but requires heuristically determining if we have that 'edge case'. I'd prefer obviously it to be solved in the b3 code but I am not sure where to touch it. I don't wanna touch crypto anyway with my edits. :)

in all cases, once said file is in memory however, then both C and rust versions will be the fastest! (rust being faster due to MT). I am fairly sure I saw this case in both reference/portable code and with various machine instruction versions (e.g. avx2 or avx512), can't quite figure out yet why this is happening , so my own guesstimate was something to do with how 'chunks' are fed to the memory but I don't know why this is happening that it can't feed the full read speed, hence I mentioned chunks but anyway; this is something the authors can (and hopefully will) address properly.

SO in your case you are using a 1 TB spinner "slow io" and a BIG file (in the order of GBs) which is why this is 'triggered' .

TO NOTE; this is very important in fact, because it is a very common layout to keep archived files (thus often large) on slower but larger backup media, which renders b3 useless until this is adressed imo. "Fun fact:" I removed b3 as a default hash in systems to sha3-224 presicely due to this uncached-slow-io-big-file issue several years ago, but testing re-introduction (with the heuristics as I said).

EDIT: OH LORD.. just tested my bsum -a:b3 example which is in C,

and it DOES also read at 210MB/s so it's the RUST binary whcih is messing up :p ye , ok I care less then but in other systems, this will/can be an issue.

So more likely it's the mmapping. YUP. confirmed.. passing --no-mmap makes b3sum run at 210MB/s which is 3.5x faster give or take.

Damn, I never realised this haha. Although I figured it had to do with mmapping since my 'solution' had to do with this..

but ye, in the proper system, b3sum won't be rust anyway so this problem will go away for me in my native environment, (but not guest ones which use rust's b3sum then).

lelik107 commented 1 month ago

@lulcat I'm glad you've found a solution, but I'm neither a developer nor a coder myself rather an end user in Windows, we just don't build SW very often :)