Closed Rudxain closed 1 year ago
It seems the problem is the file byte
iterator. I only did the benchmark comparison with 1 filepath on Linux, didn't check stdin.
I guess we need a buffer to iterate over u128
s. That's like loop unrolling, but at the data level
This is strange, I did another bench with stdin
and XS was almost as fast as MD5, but providing a file path was extremely slower than with MD5. The bottleneck is definitely the file byte iterator
After using buffering, it's still kinda slow. I read about automatic vectorization, so I learned that explicitly using u128
s won't help at all with performance. Even with no mmap
nor concurrency, it should be as fast as MD5 (if my calculations are correct), so I don't know what's wrong with my code
I think it may be a good idea to manually use an Array
as a buffer. Everytime a file is opened, a new BufReader
is created. I haven't checked the internal implementation, but I guess this means a new allocation is done for every file.
Another nice optimization is to check the file size and load the whole file before processing, only if the file is "small enough". That would require getting info about free memory, at runtime
I realized one of the reasons why it's slower than MD5 is because of the heap-allocated sbox
. MD5 has a constant-size state before runtime, XOR-hash doesn't. I read somewhere that reading the heap is slower than reading static memory
eugene2k notified me that there's too many allocs at u8vec_to_hex
Update 1: while I'm not using base16
or hex
, I should preallocate using String::with_capacity
Update 2: done
print!
and println!
lock
stdout
on every call. This comment has useful info on how to handle this latency
Related: #19
IT'S OFFICIAL!!! It became faster than MD5 AND BLAKE3! Huge thanks, congratulations, and shout-out to @Measter ! Here are my benchmark results for v3.1.5:
# for reference
head -c $(( 0x10000000 )) /dev/urandom > rand; time xorsum rand
5065a2feeebfc9bf rand
real 0m0.139s
user 0m0.114s
sys 0m0.025s
# very fast! but now the comparison...
head -c $(( 0x10000000 )) /dev/urandom > rand; time md5sum rand
eb87700e153590adb5061cfe67a098b1 rand
real 0m0.916s
user 0m0.471s
sys 0m0.045s
# I couldn't believe my eyes, I had to test again
head -c $(( 0x10000000 )) /dev/urandom > rand; time md5sum rand
452d4dd04702b9eab9a48bbefc9ee5ee rand
real 0m0.733s
user 0m0.486s
sys 0m0.032s
# and again
head -c $(( 0x10000000 )) /dev/urandom > rand; time md5sum rand
3b34a0f7706d7bf970d99d0ec7725d42 rand
real 0m0.718s
user 0m0.494s
sys 0m0.024s
# and again! but now back to xorsum
head -c $(( 0x10000000 )) /dev/urandom > rand; time xorsum rand
9b34b2e20b4dcb29 rand
real 0m0.140s
user 0m0.092s
sys 0m0.048s
# damn! that's fast
# but now the ULTIMATE opponent... BLAKE3 (I know XXH is faster)
head -c $(( 0x10000000 )) /dev/urandom > rand; time b3sum rand
765c98be6f0118666aea263827b86260ac59d86008189ee2a2b6079d35446d93 rand
real 0m0.447s
user 0m0.397s
sys 0m0.029s
# "FLAWLESS VICTORY!"
I did some benchmarks, and this is CONSIDERABLY SLOWER than MD5 and Blake2. This is unacceptable. How come such a simple algorithm is slower than a single-threaded (I guess, didn't check) MD5?