dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.29k stars 4.74k forks source link

Proposal: add EqualityComparer for ReadOnlyMemory<char> to enable collections with memory backed strings #27039

Open cdmihai opened 6 years ago

cdmihai commented 6 years ago

MSBuild has a bunch of HashSet and Dictionaries with string keys. Due to how Equals and HashCode on ReadOnlyMemory are implemented, it's not possible to convert these collections to have ReadOnlyMemory as keys. An EqualityComparer which derived from the actual view in the string would help here.

var set = new HashSet<ReadOnlyMemory<char>>(ReadOnlyMemory.StringComparer.OrdinalIgnoreCase);
set.Add("FooBAR".Slice(3));
set.Contains("ZarBar".Slice(3)); // awesome if this returned true
Wraith2 commented 6 years ago

I think this would tie in with https://github.com/dotnet/corefx/issues/31302 in some way but would probably require a specific ReadOnlyMemory<char> comparer to be able to take advantage of the extension methods.

There is also a slight worry that while the name ReadOnlyMemory claims to be readonly it isn't actually guaranteeing anything about the underlying storage being immutable. String instances are (within managed bounds) immutable so using them as keys is ok, can you be certain that any ReadOnlyMemory<T> wrapping some object won't be mutated elsewhere?

benaadams commented 6 years ago

ReadOnlyMemory as actual keys would perform badly and be problematic (wrapping mutable memory, holding reference to entire object just for a slice, lifetime issues)

However maybe as an extension for lookups on a string dictionary or hashset? Though Span would likely be more flexible at that point...

public static CollectionExtensions<TValue>
{
    bool TryGetValue(this Dictionary<string, TValue> dict, ReadOnlySpan<char> key, out TValue value);
    bool ContainsKey(this Dictionary<string, TValue> dict, ReadOnlySpan<char> key);
}

public static CollectionExtensions
{
    bool TryGetValue(this HashSet<string> dict, ReadOnlySpan<char> equalValue, out string actualValue);
    bool Contains(this HashSet<string> dict, ReadOnlySpan<char> key);
}
Wraith2 commented 6 years ago

So those methods would take the span convert to a string and then pass it to the dictionary? Unless the string borrowed the underlying storage that'd require allocation of a new string each time which doesn't seem desirable.

benaadams commented 6 years ago

So those methods would take the span convert to a string and then pass it to the dictionary?

No they'd need to be tied to the implementation; but they'd also only be valid for char types. Would need to use the same hashcode as the string used in the Collection and also same comparer

Wraith2 commented 6 years ago

I don't think that would work. The keys are stored in the entries in the dictionary so there do actually have to be real strings unless you change TKey to be something more generic like IEnumerable.

benaadams commented 6 years ago

The keys in the Dictionary would be strings; doesn't mean the values used to look up have to be strings; they could be ReadOnlySpan.

e.g. use case an intern Hashtable that returns strings for a ROS is the string has already been created

Wraith2 commented 6 years ago

In Dictionary.FindEntry the stored and lookup keys are passed to the provided comparer. Both must be of type TKey for IEqualityComparer.Equals and string is sealed so we can't derive our way out. Is there a way around this that I can't see?

cdmihai commented 6 years ago

True, putting string based view structs in hash based collections is hard because string is sealed.

One simplification might be to reduce the problem to just view struct lookups in string collections. The collection is prepopulated with strings, but lookup is done via view structs (*Segment, Span, Memory, Sequence, etc). In the end, TKey is reduced to an int via GetHashCode, so maybe Dictionary and HashSet could have an int based lookup?

public HashSet<T>
{
    bool TryGetValue(int hashcode, out T actualValue);
    bool ContainsKey(int hashCode);
}

It looks like it's exposing private implementation detail. But hashing is a core concept in lookup collections so that's ok, the only leaky part is the int.

It also does not address the need to have case insensitive hash codes for strings. Having String / StringComparer helpers for this would help: https://github.com/dotnet/corefx/issues/31302#issuecomment-409668448

benaadams commented 6 years ago

In the end, TKey is reduced to an int via GetHashCode, so maybe Dictionary and HashSet could have an int based lookup?

It also needs equality testing; the hashcode int is a candidate check (in addition to bucket location) can be more than one entry with the same hashcode (it isn't a form of perfect compression)

cdmihai commented 6 years ago

Good point, forgot about collisions. Stack confined lambdas would help here, but that would need language modifications:

public HashSet<T>
{
    bool TryGetValue(int hashcode, ref Func<T, bool> equalityChecker, out T actualValue);
}
var set = new HashSet<string>(){"foo", "bar"};
var slice = "fooBAR".Slice(3);
set.TryGetValue(String.GetHashCode(slice, StringComparison.OrdinalIgnoreCase), ref s => slice.Equals(s.AsSpan(), StringComparison.OrdinalIgnoreCase), out var value);

An alternative to ref lambdas would be introducing a structural typing idiom like foreach and Enumerable which would lookup Equals and GetHashCode on structs. This way, one would pass an IEqualityComparer looking struct to the lookup collection.

Wraith2 commented 6 years ago

Using existing features we could have a new type StringLikeKey which contains a reference to something that's string like and can compare that to other string like things. The performance on comparison won't be perfect because there will be type detection but actual comparison is hopefully a rare event. The GetHashCode result can be cached in it which would slightly mitigate underlying mutation. it'll should probably be a class because we know we're going to store it in a bucket and there's no point copying it about. Sprinkle in some implicit conversions and it might work.

The original comment by @benaadams on perf would still stand though, if you add something to Dictionary<StringLikeKey,TValue> with a ReadOnlyMemory<char> as the backing value you've got to be careful to clear out that dictionary or risk causing problems. I can see it being a useful tool when parsing though, pull in a file as a single blob use it in lookups without reallocating pieces of it as strings and then discard the entire set of related structures.

The slicing string hashcode overload you mention would be dealt with by the hashcode extension for ReadOnlySpan<char> in https://github.com/dotnet/corefx/issues/31302 , you just slice to span and hash that.

When doing xml parsing I've hit this same sort of issue. NameTable can be used to lookup common tokens but I can't use it with span without going to an int key and losing the collision capability.

As useful as span is I always seem to end up doing something slightly annoying at the intersection of span to string based code. It's a bit like going from generic to non-generic collections, slightly jarring.

benaadams commented 6 years ago

Might be better as a specific feature of a "Power Collection" https://github.com/dotnet/corefxlab/issues/2406 e.g. where the key is a string-type rather than trying to get the generic collections to work with this caveat for string-types?

KrzysztofCwalina commented 6 years ago

Can we just implement IEqualityComparer<Memory<char>> on StringComparer?

ahsonkhan commented 5 years ago

Can we just implement IEqualityComparer<Memory<char>> on StringComparer?

If StringComparer implemented IEqualityComparer<Memory<char>>, would we need any additional APIs then?

cc @GrabYourPitchforks

GrabYourPitchforks commented 3 years ago

FWIW, during the Utf8String prototype I tried making StringComparer implement both IComparer<string> and IComparer<Utf8String>. Ultimately this resulted in an issue that certain call sites were ambiguous due to overload resolution not being able to determine a best match, and our unit tests failed to compile. The incident count of these was very small, but it was non-zero. In a larger project I imagine adding extra IComparer<T> implementations would be more problematic.

udlose commented 9 months ago

looks like this can now be accomplished via MemoryExtensions