🐢🐌Too slow - Githubissues

Rudxain commented 2 years ago

I did some benchmarks, and this is CONSIDERABLY SLOWER than MD5 and Blake2. This is unacceptable. How come such a simple algorithm is slower than a single-threaded (I guess, didn't check) MD5?

Rudxain commented 2 years ago

It seems the problem is the file byte iterator. I only did the benchmark comparison with 1 filepath on Linux, didn't check stdin.

I guess we need a buffer to iterate over u128s. That's like loop unrolling, but at the data level

Rudxain commented 2 years ago

This is strange, I did another bench with stdin and XS was almost as fast as MD5, but providing a file path was extremely slower than with MD5. The bottleneck is definitely the file byte iterator

Rudxain commented 2 years ago

Yep, it is

Official reference

Rudxain commented 2 years ago

After using buffering, it's still kinda slow. I read about automatic vectorization, so I learned that explicitly using u128s won't help at all with performance. Even with no mmap nor concurrency, it should be as fast as MD5 (if my calculations are correct), so I don't know what's wrong with my code

Rudxain commented 2 years ago

I think it may be a good idea to manually use an Array as a buffer. Everytime a file is opened, a new BufReader is created. I haven't checked the internal implementation, but I guess this means a new allocation is done for every file.

Another nice optimization is to check the file size and load the whole file before processing, only if the file is "small enough". That would require getting info about free memory, at runtime

Rudxain commented 2 years ago

I realized one of the reasons why it's slower than MD5 is because of the heap-allocated sbox. MD5 has a constant-size state before runtime, XOR-hash doesn't. I read somewhere that reading the heap is slower than reading static memory

Rudxain commented 1 year ago

eugene2k notified me that there's too many allocs at u8vec_to_hex

Update 1: while I'm not using base16 or hex, I should preallocate using String::with_capacity

Update 2: done

Rudxain commented 1 year ago

print! and println! lock stdout on every call. This comment has useful info on how to handle this latency

Related: #19

Rudxain commented 1 year ago

IT'S OFFICIAL!!! It became faster than MD5 AND BLAKE3! Huge thanks, congratulations, and shout-out to @Measter ! Here are my benchmark results for v3.1.5:

# for reference
head -c $(( 0x10000000 )) /dev/urandom > rand; time xorsum rand
5065a2feeebfc9bf rand

real    0m0.139s
user    0m0.114s
sys 0m0.025s

# very fast! but now the comparison...

head -c $(( 0x10000000 )) /dev/urandom > rand; time md5sum rand
eb87700e153590adb5061cfe67a098b1  rand

real    0m0.916s
user    0m0.471s
sys 0m0.045s

# I couldn't believe my eyes, I had to test again

head -c $(( 0x10000000 )) /dev/urandom > rand; time md5sum rand
452d4dd04702b9eab9a48bbefc9ee5ee  rand

real    0m0.733s
user    0m0.486s
sys 0m0.032s

# and again

head -c $(( 0x10000000 )) /dev/urandom > rand; time md5sum rand
3b34a0f7706d7bf970d99d0ec7725d42  rand

real    0m0.718s
user    0m0.494s
sys 0m0.024s

# and again! but now back to xorsum

head -c $(( 0x10000000 )) /dev/urandom > rand; time xorsum rand
9b34b2e20b4dcb29 rand

real    0m0.140s
user    0m0.092s
sys 0m0.048s

# damn! that's fast
# but now the ULTIMATE opponent... BLAKE3 (I know XXH is faster)

head -c $(( 0x10000000 )) /dev/urandom > rand; time b3sum rand
765c98be6f0118666aea263827b86260ac59d86008189ee2a2b6079d35446d93  rand

real    0m0.447s
user    0m0.397s
sys 0m0.029s

# "FLAWLESS VICTORY!"

Rudxain / xorsum

🐢🐌Too slow #16