BLAKE3-team / BLAKE3

the official Rust and C implementations of the BLAKE3 cryptographic hash function
Apache License 2.0
5.03k stars 344 forks source link

Hash speed drops on XEON (hashing to standard length, 32 bytes) #172

Closed fire-rab-bit closed 3 years ago

fire-rab-bit commented 3 years ago

Dear Blake3 Team,

BACKGROUND:

My Coursework is about hashing speed. I tested Blake3, Shake and K12 with different inputs and outputs. input sizes tested: 1KB till 10GB; output sizes tested: standard output length (i.e. 32 bytes) till 1 GB, dumped as HEX. The speedtest was done on desktop Intel i7 x64, mobile ARMv8, and server on Xeon x64. Each hash operation was repeated 10 times and the average time was calculated to avoid unwanted breakouts during the measurement.

ISSUE:

For Blake3, On XEON I got very confusing speed results, which i do not understand yet. Could you please support me in undestanding the issue better? I've found out, that hashing to 32 byte std len takes significantly more time compared to sponge the output (e.g. to 500 bytes length). My script tested many constallations of input and output sizes and run for approx. 1,5 days in total on the server.

MY RESULTS ON XEON:

1KB file input and 32 bytes outputs takes: 0.0124169347 seconds in average (10 times repeated) 1KB file input but 500 bytes output takes: 0.0023664945 seconds in average (10 times repeated) Hashing to 32 bytes output takes longer than hashing to 500 bytes output. Manual tests confirmed that. My scripts incrementally increases input and output constellations.

The gap between 32 bytes and 500 bytes output, increases significantly the longer the script runs. Finally, when it came to a 10GB file: 10GB file input and 32 bytes output: 14.5859859031 seconds in average (10 times repeated) 10GB file input and 500 bytes output: 1.8065879838 seconds in average (10 times repeated). ... 10GB file input and 10000000 output: 2.0749627773 seconds in average (10 times repeated) ...

MY COMMAND:

b3sum $key_file -l $output_size > speedtest.txt

CONCLUSIONS?

I could imagine that the difference is increasing on XEON, because i am running each constallations 10 times and calculating the average, and seems that the XEON cpu does not like this .. Manual tests confirmed a gap, BUT not in this scale. However on ARM and Intel i7 there is also a gap between 32 bytes and 500 bytes output visible, but comparably very tiny (approx. 150 milliseconds in average). And it does not matter if a 1KB input is hashed or 10 GB, the gap remains at a very tiny range of approx 150 milliseconds.

If i run the script the manually for the 10GB file input >> the gap is much smaller than above. I can imagine that the issue above occurs when hashing for a longer period of time.

HINT:

Please be aware: when you screen the results-blake3-XEON.txt. For the first 2 lines "4K" file output is stated. This is just an display issue.

SETUP, RAW RESULTS:

Script in use: https://korac.info/speedtest-blake3.sh Speedtest protocol for blake: https://korac.info/results-blake3-XEON.txt

System – x64 architecture - XEON based cloud server (dedicated) Intel Xeon E31230; 4 cores @ Max. 3200 MHz GPU none RAM 4 GB Ubuntu 20.04 2TB HDD

I would be glad to get in a conversation with you about this observation. If you have any questions, you are very welcome and I will provide more info immediately, I can provide you my script and the full protocol.

Cheers, Pille

oconnor663 commented 3 years ago

10GB file input and 32 bytes output: 14.5859859031 seconds in average (10 times repeated) 10GB file input and 500 bytes output: 1.8065879838 seconds in average (10 times repeated).

Wow, that's mystery! It's got to be a measurement issue of some kind, since there's no difference in how b3sum processes the 10 GB of input in either case, and everything after that will be in the noise.

My first guess in any situation like this is run order. Is there any chance that switching the order in which you run those cases changes the outcome? If so that would be a hint that it has something to do with disk caching.

Another thing to check is, are you disabling TurboBoost on your Intel processors? There are lots of architecture- and OS-specific knobs you might want to take a look at, but TurboBoost is the big one. That said, I don't think it can explain anything like the ~10x gap you're seeing here. (But the difference between your disk throughput and RAM throughput could certainly be in 10x range.)

One thing to be aware of with very large files, which exceed the size of available RAM, is that b3sum by default will memory map the whole thing and do a ton of random reads from many threads. In this regime, what you're benchmarking is actually your disk read throughput with heavy thrashing. I think SSD's tend to be ok here, but we have a known (really bad) problem with large files on spinning disks. (A future version of b3sum will fix this, but it's nontrivial to fix it without slowing down the common case.) An easy way to avoid this behavior is to pass the --num-threads=1 command line flag. It might be interesting to see how that affects these results, especially for your large files.

oconnor663 commented 3 years ago

An off-topic note: BLAKE3's XOF stream hasn't been properly optimized yet with SIMD, so our output speed is generally several times slower than our input speed on x86. (And it can't currently be multithreaded either.) There's very little demand for large outputs, and we haven't found time to do the refactoring necessary to share optimizations across both phases. But something like this would be neat if we wanted to e.g. use BLAKE3 as a stream cipher.

fire-rab-bit commented 3 years ago

Thank you O'Connor for the very fast reply and suppor!. I will consider your inputs during my re-tests and leave a commenct with new insights approx. end of the week. Best, Pille

fire-rab-bit commented 3 years ago

Dear O'Connor, i have extend my tests as proposed by you. For the retests, i just tested the first ten output sizes. This is the summary:

1) Just a simple retest to ensure, that no other issue is there. Result: Same behaviour. Same results. 2) I skipped the 32 byte output step and directly output to 500 bytes. Result: Same behaviour. Just shifted. 3) I changed the order to descending: largest output size first to smallest. Result: Same behaviour. 4) I turned turboboost off. Result: Same behaviour. 5) I set num-threats=1. Result: Same behaviour. 6) Instead of repeating each hash operation 10 times, i just calculated each hash only once. Result: Same behaviour.

CONCLUSION as of now: To ensure correct measurement, I aimed to isolate unknown factors by an iterative approach. I can confirm that the use of HDD instead SSD is the main influencer here and NOT the XEON cpu itself. When using SSDs on other systems, there is still a small gap visible, but at very tiny range. Does it make sense from your point of view?

FYI, EVIDENCES: 1) https://korac.info/blake3-retests/1_results-blake3-XEON-retest-without-any-changes.txt 2) https://korac.info/blake3-retests/2_results-blake3-without-32bytes-output.txt 3) https://korac.info/blake3-retests/3_results-blake3-without-32bytes-output-only-1-time-descending.txt 4) https://korac.info/blake3-retests/4_results-blake3--only-1-time-ascending-turbo-disabled.txt 5) https://korac.info/blake3-retests/5_results-blake3--only-1-time-ascending-1-thread.txt 6) https://korac.info/blake3-retests/6_results-blake3-without-32bytes-output-only-1-time.txt

oconnor663 commented 3 years ago

I skipped the 32 byte output step and directly output to 500 bytes.

To me, this was the most mysterious part of your results, the fact that extracting more output was faster (and actually dramatically faster) than extracting less output. I'd love to dig to the bottom of it.

For example, if you could give me a script that just reproduces this result in your environment, I'd love to try it in my environment:

1KB file input and 32 bytes outputs takes: 0.0124169347 seconds in average (10 times repeated)
1KB file input but 500 bytes output takes: 0.0023664945 seconds in average (10 times repeated)
oconnor663 commented 3 years ago

Looking at speedtest-blake3.sh more carefully, I think one thing I'd change is that I wouldn't write the output to disk. So like instead of:

b3sum $key_file -l $output_size  > speedtest.txt

I'd write:

b3sum $key_file -l $output_size  > /dev/null

The issue here is that disk writing is much more expensive than disk reading, and it means that you're probably touching the disk itself, rather than just reading cached files from RAM after your first loop iteration. The speed at which b3sum's output can be written to disk isn't really what you want to benchmark, since it's purely a function of your filesystem and your disk and doesn't have much to do with b3sum itself. What you probably want to benchmark is how quickly b3sum can produce output, and you'll get a more meaningful number there by redirecting to /dev/null.

Other things to think about:

fire-rab-bit commented 3 years ago

Dear O'Connor,

i have adapted the script with the comments you made. I forwarded the output to /dev/null, without no difference.

You can find the script you wish here and try on your own: https://korac.info/blake3-retests/retest-blake3.sh You can choose your own file for tests or create same as mine: dd if=/dev/zero of=key_8GB.file bs=1024 count=8000000

I also called b3sum manually and wrote to /devl/null. As you see, running the script directly twice, reduces the hashing time significantly.

**root@s82493:~/retest# ./retest-blake3.sh Input key_10GB.file Output 32 AVERAGE 9.03456048580000000000 Input key_10GB.file Output 500 AVERAGE 6.60389203540000000000

root@s82493:~/retest# ./retest-blake3.sh Input key_10GB.file Output 32 AVERAGE 1.39118891920000000000 Input key_10GB.file Output 500 AVERAGE .19507304710000000000**

In an other manual test, i changed the output size alternately. You can see a continous decrease of hashing the more b3sum is called; see https://korac.info/blake3-retests/8_retest-manually-blake3-XEON-point_to_DEV_NULL.txt for example.

CONCLUSION:

BTW: I checked also the results against my X64 laptop and my ARM. The first uses SSD and the second eMMC (Galaxy Tab S5e). In both systems only a very very tiny gap is visible. So therefore this supports the argument, that the identified gap is mainly when reading and chaching due to use of HDD.

fire-rab-bit commented 3 years ago

FULL RESULTS FOR OTHER PLATFORMS:

x64 Zbook i7:

K12: https://korac.info/ARM-Tab-S5e/results-k12-ARM-all.txt Shake: https://korac.info/ARM-Tab-S5e/results-shake-ARM-all.txt Blake3: https://korac.info/ARM-Tab-S5e/results-blake3-ARM-all.txt

for ARM, Tab S5e:

K12: https://korac.info/x64-i7-zBook-360-G5/k12-results.txt Shake: https://korac.info/x64-i7-zBook-360-G5/shake-results.txt Blake3: https://korac.info/x64-i7-zBook-360-G5/speedtest-results-blake3-zBook.txt

oconnor663 commented 3 years ago

The case is: the first hash operation is slower, all hash operations after are faster

Ah, I think you might've been telling me this above when you said "same result", but then I might've understood. And yes, this is pretty much what we'd expect.

One good generic technique for dealing with these problems, is to always "throw away your first run". That is, if you're timing a loop of 10 iterations, ignore the first one and only average the times from the last 9. That's often enough to get rid of simple caching effects like this one. Fancy benchmarking frameworks like Criterion.rs will do a few whole seconds of "warming up" runs, before they even begin to count time. (That might be enough to get your processor out of TurboBoost, for example, if you've left TurboBoost on.)

I'll go ahead and close this issue, but feel free to comment with more questions or open new ones.

fire-rab-bit commented 3 years ago

Dear @oconnor663

i am in final stage of my coursework. Would it be ok for you to mention the interaction we had in my critical appraisal, like this draft below:


_Critical appraisal for executed speed tests It was quite difficult to measure all constellations and isolate the influencing factors to reduce noise. It was necessary to adapt the speed tests several times during the coursework.

At the beginning, each hash operation was executed only one time. It was identified that this approach has a high risk of noise. Therefore this results had to be withdrawn, which meant to withdrawn approx. 2 days of time. To reduce noise, it was necessary to repeat the measurement again and to execute each hash operation 10 times to calculate the average time.

Secondly, after screening the results on system 3 (i.e. XEON with HDD), it was recognized that the first operation took always significantly longer. The first operation was hashing to a standard hash length of 32 bytes. But the second operation was hashing to 500 bytes and took only a fraction of the previous one. For example, Blake3 hashed a 10GB file to 32 byte output in 14.59 seconds. Directly after this, the same file was hashed to 500 bytes in 1.80 seconds. So the second operation needed 12.79 seconds less. This behavior was quite strange at the first look. O’CONNOR from the Blake3-Team was asked for support to understand the behavior; see issue 172 on https://github.com/BLAKE3-team/BLAKE3. After several manual retests and assumptions, the reason for that behavior is most-likely the reading and data caching process when HDDs were in use. Each hash operation pointed to the same input file. The system works than with cached data if the next operation points to the same file (-name). This big difference (e.g. 12.79 seconds in case of 10 GB input) was only observed when HDDs were in use. If SSDs were in use, the difference was in most of all cases only a 10-200 milliseconds. In consequence all tests for 32 bytes output were repeated on the systems again to ensure comparability.

Ideally, for each hash operation the file should have different names, even if they are same and of same size (e.g. “key_10GB.file.A”, “key_10GB.file.B”, etc.). By this, the tests would include the reading speed. However, this was not necessary for the coursework. As the coursework tried to compare the three hash functions together. The setup was the same for every hash function on each system. Therefore comparability between the hash functions is ensured and noise level is reduced.

The hash result was written to file after each hash operation. Finally, it was discussed to write the output to /dev/null instead to disk. After manual retests it was identified that this would not solve the afore-mentioned problem and was therefore not implemented._


What do you think? If I do not get an feedback until tomorrow 14pm german time, i would not include your name, to be safe. Of course you can get a copy of the final / draft coursework, if you wish.

oconnor663 commented 3 years ago

I'm not sure what it is you're asking for here. Maybe let's take this part to email. You can reach me at oconnor663@gmail.com.