BLAKE3-team / BLAKE3

the official Rust and C implementations of the BLAKE3 cryptographic hash function
Apache License 2.0
5.12k stars 351 forks source link

Too much memory usage #305

Open terrancewong opened 1 year ago

terrancewong commented 1 year ago

How to represent

fallocate -x -l 1T test1t
/usr/bin/time -f "%M kBpeak, %Us user, %I FSI, %O FSO, %P CPU, %es real" b3sum test1t
9387bd9b2ff4c3d9baa8c65d491f51789b6ed2a000aa45b679f80d21c3cc5013  test1t      
504482044 kBpeak, 1078.56s user, 88 FSI, 0 FSO, 3439% CPU, 703.66s real 

500GB memory used. rendering very high load, system less usable.

oconnor663 commented 1 year ago

My guess is that this is just what happens when we mmap the sparse file and then read the mmap?

terrancewong commented 1 year ago

sounds like that, file t is around 61GB, mem peak around 60GB.

 % /usr/bin/time -f "%M kBpeak, %Us user, %I FSI, %O FSO, %P CPU, %es real" b3sum t                    
ab5d590789635ed6444bbf901f81fa8611ecbd8a6581b156ccb3752cc46dbe49  t
59595172 kBpeak, 36.24s user, 0 FSI, 0 FSO, 2735% CPU, 1.44s real

does not happen when hashing through pipe.

oconnor663 commented 1 year ago

Yes that's expected. b3sum doesn't try to mmap standard input, so hashing with | or < works around this. Passing --no-mmap should have the same effect. It's also possible that --num-threads=1 would cut down on the allocations, but that's mostly up to your OS and not something we control.

terrancewong commented 1 year ago

but --no-mmap makes it significantly slower, only 100% cpu utilized.

terrancewong commented 1 year ago

and in theory it could get away with only storing log(N) scale Merkel tree.

oconnor663 commented 1 year ago

Yep :(

musicinmybrain commented 1 week ago

This issue came up downstream in Fedora.

In my testing on a 6.11.4 kernel in Fedora 40 with a file larger than system memory on a fast SSD, the kernel seemed to flush pages from the memory-mapped file to keep resident memory around 90-95%. When I started multiple b3sum processes, the total resident memory usage still adapted pretty nicely to available memory.

However, on a spinning HDD with much slower I/O, I found that the control group containing the b3sum process was killed by the OOM killer. I don’t have a swap file or partition configured (only zswap) so the pain was over pretty quickly, but I suspect a system with swap might have suffered a significant period of unresponsiveness.

The user reported some other issues when b3sum exhausted memory – garbage checksum output, and confusing results from other checksum utilities that made them suspect “possible memory corruption” – but I was not able to reproduce any of these on my own system.

I’m not in a position to contribute a fix for this, but it would be really nice if b3sum could parallelize usefully without memory-mapping the entire input file at once. Even though it appears the kernel can sometimes partially compensate by flushing pages, b3sum is conceptually asking for the entire input file to be loaded into memory. That’s a lot to ask for huge input files, and b3sum’s high performance makes it particularly likely that people will want to use it on huge inputs.