dotnet / roslyn

The Roslyn .NET compiler provides C# and Visual Basic languages with rich code analysis APIs.
https://docs.microsoft.com/dotnet/csharp/roslyn-sdk/
MIT License
19.02k stars 4.03k forks source link

use different hash algorithm than SHA1 in checksum for better performance #33411

Closed heejaechang closed 5 days ago

heejaechang commented 5 years ago

@sharwell believe SHA1 is too slow for our checksum. so he wants us to use a different algorithm than SHA1 to make hash faster.

sharwell commented 5 years ago

We are reviewing MurmurHash 3 internally. The implementation I'm most familiar with is the one I used in ANTLR 4: https://github.com/antlr/antlr4/blob/1c6c62afc7cbef464ba71d332c8c1ae50d54e8d5/runtime/Java/src/org/antlr/v4/runtime/misc/MurmurHash.java https://github.com/tunnelvisionlabs/antlr4cs/blob/4874e010bafe828192b6bf436a2a7d02bab161ec/runtime/CSharp/Antlr4.Runtime/Misc/MurmurHash.cs

/cc @VladimirReshetnikov

sharwell commented 5 years ago

📝 I'll also review the keys for collision resilience and set up an A/B test that uses a very poor hash to ensure collisions occur periodically during our testing so we understand the impact it would have at less frequent occurrences in the wild.

davidwengier commented 5 years ago

It would be worthwhile to validate assumptions around SHA256 at the same time to settle that argument once and for all at least.

sharwell commented 5 years ago

Alternative proposal: unique ID generation. This would have essentially the runtime behavior of lazily calculating the hash of an object by simply calling Guid.NewGuid(), but could be done in a much faster manner with fewer stored bits since we only need the number to be unique within the devenv+OOP processes.

heejaechang commented 5 years ago

checksum is designed to be content based so it can be stateless. we already have unique id (VersionStamp) that is only valid within same VS session and attempt to re-use between session using file stamp.

having such system at the lower layer that is hard to guarantee data integrity between processes or between VS sessions making any system built on top of it hard to build/maintain. basically it requires something to gurantee validness of the data through holding some info making it stateful system.

content based checksum let one to pick up data such diagnostics/bloom filter and etc by recalculating checksum of source content itself and doesn't require who generated the data as long as checksum is same making it easy to be distributed (no central control). or content cached by checksum can be validated by re-calculating checksum again. making whole system stateless.

heejaechang commented 5 years ago

I already know the impact of a collision. we use checksum as a key of data. it can be content itself or it can be calculated information off content.

having collision means wrong content (source, metadata, options and any data solution is using) is used to build solution. or calculated information such as symbol index or syntex tree index for wrong content is used for features.

user facing impact will be things like wrong diagnostics shows up, FAR showing the wrong result, Add using not showing up, go to not showing match and etc. or crash if completely wrong data got picked up from cache.

the system assumes there is no collision.

...

by the way, if you want to see things yourself, change checksum type to have the same value for everything and see what happens, no reason to have a very poor hash to see the impact?

heejaechang commented 5 years ago

we only need the number to be unique within the devenv+OOP processes.

this is not true. that is just one of usage case for the checksum. for example, once things are synced between devenv+OOP, it uses the checksum to verify the content is correct.

sharwell commented 5 years ago

for example, once things are synced between devenv+OOP, it uses the checksum to verify the content is correct.

Right, that's what I meant. Collisions are bad if they happen in either process, but maybe don't matter if they happen in an unrelated process.

sharwell commented 5 years ago

checksum is designed to be content based so it can be stateless.

Is there any correctness requirement that the checksum be based on content, or is that just an implementation detail? Using a unique ID assignment to avoid the recalculation of checksums from content could save us a substantial amount of processing.

having such system at the lower layer that is hard to guarantee data integrity between processes or between VS sessions making any system built on top of it hard to build/maintain. basically it requires something to gurantee validness of the data through holding some info making it stateful system.

Failure to correctly keep track of unique IDs would result in cache misses across processes, but would not result in the use of invalid data (or more specifically, the incorrect reuse of data from one context in a different context).

by the way, if you want to see things yourself, change checksum type to have the same value for everything and see what happens, no reason to have a very poor hash to see the impact?

Sometimes this degrades performance to the point that the system becomes unusable. Also, some systems that produce incorrect results in a collision scenario self-correct when the checksums no longer collide. These systems rely on the infrequent occurrence of collisions and are not able to handle cases where collisions are constant. It's easy to ensure collisions are rare but difficult to make them non-existent.

heejaechang commented 5 years ago

the whole system is built on top of the promise of collision resistance strength of SHA. what you are saying is like, what is so wrong when 2 digital certificates having the same signature.

gafter commented 5 years ago

What usage of SHA1 are we talking about? In the compiler?

davidwengier commented 5 years ago

@gafter There are uses in the compiler and IDE side, though this task is specifically for IDE. The compiler work item is at https://devdiv.visualstudio.com/DevDiv/_workitems/edit/788330

CyrusNajmabadi commented 5 years ago

@sharwell believe SHA1 is too slow for our checksum

Do we have any data for that? How much time do we spend on sha1 hashing currently?

sharwell commented 5 years ago

Do we have any data for that? How much time do we spend on sha1 hashing currently?

It's shown up several times internally, but hasn't risen to the point that it took priority over other perf work. A specialized testing harness will need to be created that covers the checksum process (this issue) and serialization process (#23331).

blowdart commented 5 years ago

Also please come talk to @GrabYourPitchforks and I about this. Given we use marvin32 in strings, it'd be an interesting discussion around why mumur and not marvin.

CyrusNajmabadi commented 5 years ago

Note: roslyn already has an impl of murmur hash (v2 though) here: https://github.com/dotnet/roslyn/blob/master/src/Workspaces/Core/Portable/Shared/Utilities/BloomFilter.cs

I tested it originally, and found that on all of roslyn's inputs we never collided with a suitably high number of bits (which i think was less than the 160 used by sha1). Note that for Roslyn's needs, we're not going for cryptographic strength. Nor do we have any sort of attacker/adversary. All we're trying to do is hash fast with an extremely low chance of collisions. We then use that hash for content lookup. An inappropriate collision is undesirable, but isn't something that would be disastrous.

blowdart commented 5 years ago

A few things I care about here,

the system assumes there is no collision.

That's not true for any hashing algorithm. A good algorithm makes it unlikely, but it's never impossible.

And then the phrase guarantee data integrity

Both of which become security flags :) Admittedly this could be a misunderstanding based on wording, and given what rosyln does probably is, but still, I'd like to get a clear concept of what the hashes are used for.

CyrusNajmabadi commented 5 years ago

Both of which become security flags

This is not a security domain. We are just using this for content hashing. A simple way to have our editor refer to a snapshot of a file. We've gone through the threat modeling here in the past. The use of a hashing algorithm here is just to get a content ID, it's not for a cyptographic security purpose. Please don't make it into one.

ut still, I'd like to get a clear concept of what the hashes are used for.

It's very simple. We often persist harmless data to disk (like an index of all the names used in a file). We want to know if we should update that index. So we store a hash of hte original file that was used to compute the index. If that hash changes, we recompute the index. If it doesn't we keep the old index around.

The worst case here is that we somehow get a staggeringly unlikely collision. In that case, our index will just be a little incorrect. And all that means is that if you do something like run "find all references" in VS, you get a slightly wrong answer. Note, to even get a collision would mean the user was somehow even trying to subvert us, as any sort of normal edit/change just has less chance of colliding than the world spontaneously imploding tomorrow.

CyrusNajmabadi commented 5 years ago

That's not true for any hashing algorithm. A good algorithm makes it unlikely, but it's never impossible.

It's fine for our purposes. The chance of collisions is so low under any sort of realistic use case, as to be the same as impossible. There are better chances of ram bits being flipped by cosmic rays than tehre is of a collision here, and we don't worry or try to deal with the former.

blowdart commented 5 years ago

Ah now I remember. SHA1 just puts things on my radar.

Moving away from sha1 or even sha256 would remove the need for the endless exemption. Still wondering why murmur over marvin, might be interesting to compare performance there.

(But you know if you say SHA three times in front of a mirror then either Levi or I will appear)

CyrusNajmabadi commented 5 years ago

Ah now I remember. SHA1 just puts things on my radar.

:)

Still wondering why murmur over marvin, might be interesting to compare performance there.

I odn't remember the history (even though i did the murmur impl). Murmur was just extremely simple to get right, doesn't allocate, and included lots of available information on the net, as well as public domain code to refer to. I'm looking for info on marvin hashing, and not much is coming up. So it probably was just a decision to use something that fit our needs and would be supremely easy to use.

If marvin is better, i would have no problem with Roslyn switching to it. Is there a nice .net version available somewhere?

CyrusNajmabadi commented 5 years ago

Roslyn critical requirements for hash function:

  1. Cannot be MD5 or SHA1. These are banned for all purposes (even non-cryptographic scenarios).
  2. non-allocating.
  3. can provide a reasonably uniform distribution of hash results from inputs that are System.Strings (not bytes/arrays).
  4. needs to handle strings in both a case sensitive and insensitive manner (again, without allocating).
  5. fast. we need to hash millions of strings potentially. and we are constantly rehashing as users edit.
  6. deterministic across sessions. i.e. we need people to be able to close VS, log in/out, reboot, etc. and get the exact same result on the same inputs.

Non-goals of hte hash function:

  1. cryptographically secure.
  2. handle non-strings.

murmur2 fit all of the above, and asking the bcl for an equivalent at the time yielded nothing built-in that we could use.

blowdart commented 5 years ago

We use marvin for string hash codes I believe in Core.

Not really fussed here, it's not crypto related :)

CyrusNajmabadi commented 5 years ago

We use marvin for string hash codes I believe in Core.

Do you know if it would satisfy all of the above requirements? For example, could you get a Case-Insensitive hash using marvin in a non-allocating fashion?

Not really fussed here, it's not crypto related :)

No worries! i'm just trying to learn more myself. I would be happy to get rid of our hand-written code if there is framework code that can do it for us!

blowdart commented 5 years ago

I honestly don't know, because that dips into implementation details, and I'm just a PM :)

@GrabYourPitchforks might.

GrabYourPitchforks commented 5 years ago

https://apisof.net/catalog/System.String.GetHashCode(ReadOnlySpan%3CChar%3E,StringComparison) can give you a case-insensitive hash code without allocations, but unfortunately it's a 3.0-only API. If you have downlevel requirements I don't think there's anything built in that would satisfy your exact requirements.

Edit: Ah, didn't see that you need stability. Will think on this a bit more.

heejaechang commented 5 years ago

most of our hashing is for stream (bytes []). we don't use string directly but through sourcetext which doesn't save text as string but chunks or through abstraction of stream (byte[]). also, we get checksum from sourcetext through its API directly which uses SHA1 and will use SHA1 as long as compiler uses them.

so, new hash algorithm we want is for other parts of data such as options, metadata, analyzers and etc.

if we have some hash algorithm we can rely on on its collision resistance with better perf than SHA even though theoretically strength of collision resistance is worse than SHA, then I am +1 for using that hash algorithm as long as we don't implement them ourselves.

if we need to implement on ourselves, I would just take SHA256. with this change (https://github.com/dotnet/roslyn/pull/33496) perf of SHA1 vs SHA256 (CPU and memory) should be comparable.

we can also reduce allocation once we move to 3.0 as @GrabYourPitchforks said.

GrabYourPitchforks commented 5 years ago

How much collision resistance are you going for here? If the hashes aren't being used for security purposes you have wide berth to do whatever you need in order to get the performance goals you seek. On one extreme end, you could even pull in the old String.GetHashCode implementation, and we could do crazy things to it like SIMD-optimize it. It wouldn't be secure, but if you're only ever operating on trusted input I don't see why anybody would block it for process reasons.

Once you drop from 160-bit hashes (SHA-1) down to 32-bit hashes your odds of seeing a collision increase substantially. So you would still need some way to distinguish two files which happen to have the same hash code, not because anybody purposefully generated such files, but because given a large enough customer base and enough invocations of the tool it's bound to happen at some point just by chance alone.

CyrusNajmabadi commented 5 years ago

How much collision resistance are you going for here?

A fair amount. it's not good if we get a collision. We can assume no one is trying to force a collision. But we want to avoid collisions happening as part of normal processing. 32bit hashcodes are generally too weak for our needs as it's way too easy to collide.

sha1 is totally sufficient for our needs. The chances of a collision (since we don't have an attacker trying to force one) is effectively nil.

CyrusNajmabadi commented 5 years ago

then I am +1 for using that hash algorithm as long as we don't implement them ourselves.

Totally agreed. I would prefer we never have to write this ourselves.

CyrusNajmabadi commented 5 years ago

Edit: Ah, didn't see that you need stability. Will think on this a bit more.

Thanks!

Note: it's acceptable if a rare event causes new hashes. i.e. a new version of a framework. We don't absolutely need stability for all time. However, we def need stability for common events. i.e. people using VS day in/out for months.

If the hashes change, we have to regen all our indices. That's acceptable if it happens rarely (1-2 times a year), but really bad if it happens weekly or more as they may be quite expensive to recompute.

GrabYourPitchforks commented 5 years ago

I spent some time looking at other uses and internal recommendations. It looks like xxHash32 and xxHash64 (https://github.com/Cyan4973/xxHash) are very popular. .NET uses this for its own HashCode class (but again, we randomize it by choosing a random seed and not allowing you to control it, so you'd need to carry your own logic). It also looks like other teams within MS are having success with it. Might be worth investigating?

I bet if there's demand we could provide a non-cryptographic hash code class to the BCL. It's something that's come up in discussion a few times but until now we never had a good forcing function.

tmat commented 5 years ago

@GrabYourPitchforks i'm not sure what you mean that there was no good forcing function. The moment it's been decided that SHA1 needs to be avoided even for non-crypto purposes an alternative should have been provided. Besides that we asked for this functionality at least a year ago.

CyrusNajmabadi commented 5 years ago

been decided that SHA1 needs to be avoided even for non-crypto purposes

Can someone explain the reasoning here. It's not crypto... so what's wrong with using SHA1? Is there something about it that makes it worse for non-crypto versus murmur/marvin/xxHash etc?

T an alternative should have been provided.

I agree. This is a pretty simple use case, and it's unclear what the alternatives should be.

GrabYourPitchforks commented 5 years ago

Ok, I've engaged some of the other runtime folks about what solutions we can provide here to unblock you. Will keep you updated as things progress.

GrabYourPitchforks commented 5 years ago

BTW can you give us the lengths of input (in bytes) that are typical of your scenarios? I imagine if we're talking source files there's going to be a very wide range, but it'd still be good to know what the "typical" input looks like.

heejaechang commented 5 years ago

for collision, regardless of theoretical strength, we want no collision ever happening in practice. not just in one repo but among repos as long as someone is not trying to purposely make things collide. and that is why we choose SHA1.

..

I just ran this on my roslyn src folder

$foo = (Get-ChildItem -path "c:\your\file\path" -recurse | measure-object | select -expand Count) $bar = ((Get-ChildItem -path "c:\your\file\path" -recurse | Measure-Object -property length -sum).sum /1MB) $avrg = $bar / $foo $avrg

and it gives 16KB as average.

...

but I dont think we want a hash that show different characteristic based on size of input.

sharwell commented 5 years ago

Currently the implementation assumes there will not be collisions, and does not implement collision detection or handling in the synchronization algorithm. The most likely observed side effect following a collision would be an application crash. However, assuming this was corrected, the characteristics would change substantially. The application would be highly tolerant of short-lived collisions (e.g. a checksum generated while typing that changes the next time a character is entered), but less tolerant of long-lived collisions (e.g. a user completes typing while the document is in a colliding state). In addition, the observed impact of collisions between objects of different types could be restricted to performance degradation proportional to the serialized size of the objects.

The majority of hashes generated by the system are used for synchronization of transient assets between processes on the same system. All of these hashes could be replaced with an object tagging system (rather than a content-based hash) to substantially reduce the startup cost and guarantee collisions are avoided.

CyrusNajmabadi commented 5 years ago

Currently the implementation assumes there will not be collisions, and does not implement collision detection or handling in the synchronization algorithm.

Right. All of that would be unnecessary (given that we don't have hostile actors we have to be worried about). The chance of a collision is so low as to be practically meaningless. As mentioned above, it would make more sense for us to worry about bitflips and how to deal with them versus needing to be concerned with hash collisions here.

GrabYourPitchforks commented 5 years ago

Right. All of that would be unnecessary (given that we don't have hostile actors we have to be worried about). The chance of a collision is so low as to be practically meaningless.

Are the hashes computed over source documents? If so, are they persisted anywhere? (You mentioned needing something stable across reboots.) The scenario I'm thinking is that somebody opens a .cs file provided by an adversary, and it persists something to the file system that causes VS to misbehave from then forward due to a collision.

CyrusNajmabadi commented 5 years ago

The scenario I'm thinking is that somebody opens a .cs file provided by an adversary, and it persists something to the file system that causes VS to misbehave from then forward due to a collision.

This is already covered by hte VS threat model. Opening a .cs file from an adversary has never been safe (for many reasons**). So VS even warns in this case to only open things trusted.

--

** For example, things like designers will start up and will literally examine and just execute code in the .cs file. The thread from code execution is staggeringly higher than any sort of concern about producing slightly incorrect IDE assistance because we got a file that collided.

GrabYourPitchforks commented 5 years ago

Hi all, spent some time digging around. There's support on the runtime side for exposing non-cryptographic hash algorithms via their own APIs (see https://github.com/dotnet/corefx/issues/25666 for one proposal). Just pulling things out of thin air my gut tells me that if this API gets off the ground we'd probably start with CRC32, Marvin32/64, and xxHash32/64 as a first pass, since the worker code to all of those already exists in Core in some fashion. I'm not optimistic about this making it in to 3.0, but we could ask @terrajobst and others to fast-track it when 3.1 opens up.

tmat commented 5 years ago

@GrabYourPitchforks @terrajobst Would it be possible to publish the implementation as a source package? Also, we need the impl to work on .NET 4.7.2. At least once it has a hash alg that we can use as a replacement for SHA1.

terrajobst commented 5 years ago

Would it be possible to publish the implementation as a source package?

Possible yes, but not desirable. We just ran through this excercise for JSON. It's too fragile, even for internal use.

bcronce commented 5 years ago

I was doing some prototyping on a system where I needed a uniqueness guarantee, but didn't really care about crypto. I found a pure native implementation of Blake2 from their official site that out performed the .Net SHA1 for me.

Blake2 was a SHA3 finalist and only lost because it's more difficult to make in hardware, but it's really fast in software. It was made by the best cryptographers in the business and have a track record of creating crypto and finding faults in other crypto.

I consider Blake2 my go to hashing algorithm when I want the properties of crypto strength, but where I don't care about attackers.

CyrusNajmabadi commented 4 years ago

@sharwell do we need this issue? We have moved to SHA256 here:

https://github.com/dotnet/roslyn/blob/153fec7d2ada17d32788e60e97dd9c297ce610b4/src/Workspaces/Core/Portable/Workspace/Solution/Checksum_Factory.cs#L24

If we have an actual scenario where we see hash perf showing up, we can use that as a driving function. However, this issue seems speculative and not necessary to just have floating around.

saucecontrol commented 4 years ago

SHA256 is still quite slow. If this is a bottleneck, I've published a BLAKE2 implementation that's roughly 5x faster than Windows CNG's SHA256: https://github.com/saucecontrol/Blake2Fast#blake2fast-vs-net-in-box-algorithms-md5-and-sha2

That's making use of AVX2 intrinsics in netcoreapp3.x, but the scalar fallback is still at least twice as fast as SHA256, even on netfx 32-bit.

CyrusNajmabadi commented 4 years ago

@saucecontrol does this work for these requirements: https://github.com/dotnet/roslyn/issues/33411#issuecomment-465272354

?

saucecontrol commented 4 years ago

It works as is for everything except case insensitive string hashing. That can be done by pushing individual case-normalized characters (or perhaps fixed windows of case-normalized substrings) into the hash state. The API is Span-based, so it's non-allocating and flexible.

saucecontrol commented 4 years ago

I assume you don't care about endianness if you're wanting to hash System.String values directly? Or would you want something that normalizes both case and byte order?