dotnet / roslyn

The Roslyn .NET compiler provides C# and Visual Basic languages with rich code analysis APIs.
https://docs.microsoft.com/dotnet/csharp/roslyn-sdk/
MIT License
19.02k stars 4.03k forks source link

use different hash algorithm than SHA1 in checksum for better performance #33411

Closed heejaechang closed 5 days ago

heejaechang commented 5 years ago

@sharwell believe SHA1 is too slow for our checksum. so he wants us to use a different algorithm than SHA1 to make hash faster.

CyrusNajmabadi commented 4 years ago

That can be done by pushing individual case-normalized characters

Could you link me to where it supports pushing characters? Thanks!

I assume you don't care about endianness if you're wanting to hash System.String values directly?

Correct. :)

saucecontrol commented 4 years ago

I've updated the API since the last version published to NuGet, but you can try it out with the latest CI build

You can use Blake2b.CreateIncrementalHasher(), which will return the hash state struct. That has an Update() that accepts a value or Span of value:

https://github.com/saucecontrol/Blake2Fast/blob/master/src/Blake2Fast/Blake2b/Blake2bHashState.cs#L155-L183

So you can call that with aString.AsSpan() or you could case-normalize a string a chunk at a time into a fixed buffer, or just grab a character at a time to update the hash state. Updating the state simply pushes new bytes into a buffer until a block is full at which point the actual hash state is updated, so it's very lightweight.

GrabYourPitchforks commented 4 years ago

FYI on 64-bit platforms, SHA512 tends to outperform SHA256. Especially for larger inputs (anything over a few dozen bytes), as is the case here. If you're going for raw speed and you need to use something built-in, it could be a stop-gap measure.

saucecontrol commented 4 years ago

For that matter, MD5 is faster than both SHA2 variants on both platforms if you don't require cryptographic security.

on 64-bit platforms, SHA512 tends to outperform SHA256

BLAKE2 has similar characteristics. With scalar implementations, the 256-bit BLAKE2s variant runs faster on 32-bit while 512-bit BLAKE2b runs faster on 64-bit. With SIMD implementations, BLAKE2b is always faster.

CyrusNajmabadi commented 4 years ago

I'm fine with any system that meets the requirements stated in https://github.com/dotnet/roslyn/issues/33411#issuecomment-465272354. We don't need cryptographic security. We just want a reasonable hasher.

tmat commented 4 years ago

@CyrusNajmabadi In addition to those requirements, it can't be MD5 or SHA1.

CyrusNajmabadi commented 4 years ago

interesting, is that an external requirement/mandate @tmat? Nothing about the scenrios where we uses these hashes seems like it would preclude those (at least from Roslyn's perspective).

CyrusNajmabadi commented 4 years ago

Thansk @tmat . Have added that to our criteria.

jack-pappas commented 3 years ago

May be worth taking a look at BLAKE3 here -- it's much faster than SHA-1, SHA-256, MD5, and blake2.

https://github.com/BLAKE3-team/BLAKE3

CyrusNajmabadi commented 5 days ago

Closing. We moved to xxhash128