Open cdmihai opened 6 years ago
I think this would tie in with https://github.com/dotnet/corefx/issues/31302 in some way but would probably require a specific ReadOnlyMemory<char>
comparer to be able to take advantage of the extension methods.
There is also a slight worry that while the name ReadOnlyMemory claims to be readonly it isn't actually guaranteeing anything about the underlying storage being immutable. String instances are (within managed bounds) immutable so using them as keys is ok, can you be certain that any ReadOnlyMemory<T>
wrapping some object won't be mutated elsewhere?
ReadOnlyMemory
as actual keys would perform badly and be problematic (wrapping mutable memory, holding reference to entire object just for a slice, lifetime issues)
However maybe as an extension for lookups on a string
dictionary or hashset? Though Span would likely be more flexible at that point...
public static CollectionExtensions<TValue>
{
bool TryGetValue(this Dictionary<string, TValue> dict, ReadOnlySpan<char> key, out TValue value);
bool ContainsKey(this Dictionary<string, TValue> dict, ReadOnlySpan<char> key);
}
public static CollectionExtensions
{
bool TryGetValue(this HashSet<string> dict, ReadOnlySpan<char> equalValue, out string actualValue);
bool Contains(this HashSet<string> dict, ReadOnlySpan<char> key);
}
So those methods would take the span convert to a string and then pass it to the dictionary? Unless the string borrowed the underlying storage that'd require allocation of a new string each time which doesn't seem desirable.
So those methods would take the span convert to a string and then pass it to the dictionary?
No they'd need to be tied to the implementation; but they'd also only be valid for char
types. Would need to use the same hashcode as the string used in the Collection and also same comparer
I don't think that would work. The keys are stored in the entries in the dictionary so there do actually have to be real strings unless you change TKey to be something more generic like IEnumerable
The keys in the Dictionary would be strings; doesn't mean the values used to look up have to be strings; they could be ReadOnlySpan.
e.g. use case an intern Hashtable that returns strings for a ROS is the string has already been created
In Dictionary.FindEntry the stored and lookup keys are passed to the provided comparer. Both must be of type TKey for IEqualityComparer.Equals and string is sealed so we can't derive our way out. Is there a way around this that I can't see?
True, putting string based view structs in hash based collections is hard because string is sealed.
One simplification might be to reduce the problem to just view struct lookups in string collections. The collection is prepopulated with strings, but lookup is done via view structs (*Segment, Span, Memory, Sequence, etc). In the end, TKey is reduced to an int via GetHashCode, so maybe Dictionary and HashSet could have an int based lookup?
public HashSet<T>
{
bool TryGetValue(int hashcode, out T actualValue);
bool ContainsKey(int hashCode);
}
It looks like it's exposing private implementation detail. But hashing is a core concept in lookup collections so that's ok, the only leaky part is the int.
It also does not address the need to have case insensitive hash codes for strings. Having String / StringComparer helpers for this would help: https://github.com/dotnet/corefx/issues/31302#issuecomment-409668448
In the end, TKey is reduced to an int via GetHashCode, so maybe Dictionary and HashSet could have an int based lookup?
It also needs equality testing; the hashcode int
is a candidate check (in addition to bucket location) can be more than one entry with the same hashcode (it isn't a form of perfect compression)
Good point, forgot about collisions. Stack confined lambdas would help here, but that would need language modifications:
public HashSet<T>
{
bool TryGetValue(int hashcode, ref Func<T, bool> equalityChecker, out T actualValue);
}
var set = new HashSet<string>(){"foo", "bar"};
var slice = "fooBAR".Slice(3);
set.TryGetValue(String.GetHashCode(slice, StringComparison.OrdinalIgnoreCase), ref s => slice.Equals(s.AsSpan(), StringComparison.OrdinalIgnoreCase), out var value);
An alternative to ref lambdas would be introducing a structural typing idiom like foreach and Enumerable which would lookup Equals and GetHashCode on structs. This way, one would pass an IEqualityComparer looking struct to the lookup collection.
Using existing features we could have a new type StringLikeKey
which contains a reference to something that's string like and can compare that to other string like things. The performance on comparison won't be perfect because there will be type detection but actual comparison is hopefully a rare event. The GetHashCode result can be cached in it which would slightly mitigate underlying mutation. it'll should probably be a class because we know we're going to store it in a bucket and there's no point copying it about. Sprinkle in some implicit conversions and it might work.
The original comment by @benaadams on perf would still stand though, if you add something to Dictionary<StringLikeKey,TValue>
with a ReadOnlyMemory<char>
as the backing value you've got to be careful to clear out that dictionary or risk causing problems. I can see it being a useful tool when parsing though, pull in a file as a single blob use it in lookups without reallocating pieces of it as strings and then discard the entire set of related structures.
The slicing string hashcode overload you mention would be dealt with by the hashcode extension for ReadOnlySpan<char>
in https://github.com/dotnet/corefx/issues/31302 , you just slice to span and hash that.
When doing xml parsing I've hit this same sort of issue. NameTable
can be used to lookup common tokens but I can't use it with span without going to an int
key and losing the collision capability.
As useful as span is I always seem to end up doing something slightly annoying at the intersection of span to string based code. It's a bit like going from generic to non-generic collections, slightly jarring.
Might be better as a specific feature of a "Power Collection" https://github.com/dotnet/corefxlab/issues/2406 e.g. where the key is a string-type rather than trying to get the generic collections to work with this caveat for string-types?
Can we just implement IEqualityComparer<Memory<char>>
on StringComparer?
Can we just implement
IEqualityComparer<Memory<char>>
on StringComparer?
If StringComparer implemented IEqualityComparer<Memory<char>>
, would we need any additional APIs then?
cc @GrabYourPitchforks
FWIW, during the Utf8String
prototype I tried making StringComparer
implement both IComparer<string>
and IComparer<Utf8String>
. Ultimately this resulted in an issue that certain call sites were ambiguous due to overload resolution not being able to determine a best match, and our unit tests failed to compile. The incident count of these was very small, but it was non-zero. In a larger project I imagine adding extra IComparer<T>
implementations would be more problematic.
looks like this can now be accomplished via MemoryExtensions
MSBuild has a bunch of HashSet and Dictionaries with string keys. Due to how Equals and HashCode on ReadOnlyMemory are implemented, it's not possible to convert these collections to have ReadOnlyMemory as keys. An EqualityComparer which derived from the actual view in the string would help here.