dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.96k stars 4.65k forks source link

Vectorize IndexOfAny on more than 5 chars #68328

Closed stephentoub closed 1 year ago

stephentoub commented 2 years ago

EDITED BY @stephentoub on 10/26/2022 to add revised API shape:

namespace System
{
    public static class IndexOfAnyValues
    {
        public static IndexOfAnyValues<T> Create<T>(ReadOnlySpan<T> values)  where T : IEquatable<T>?;
        // in the future, could add an `IndexOfAnyValues<char> Create(ReadOnlySpan<string> values) overload
    }

    public class IndexOfAnyValues<T> where T : IEquatable<T>?
    {
        // no public/protected members, including ctors
    }

    public static partial class MemoryExtensions
    {
        // These are identical to the existing overloads, just with IndexOfAnyValues<T> instead of ReadOnlySpan<T> for the values parameter type
        public static int IndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
        public static int IndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
        public static int LastIndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
        public static int LastIndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
        // ... repeat for Span overloads
    }
}

API Usage

internal static class HttpRuleParser
{
    private static readonly IndexOfAnyValues<char> s_tokenChars =
        IndexOfAnyValues.Create("!#$%&'*+-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ^_`abcdefghijklmnopqrstuvwxyz|~");

    public static bool IsToken(ReadOnlySpan<char> input) =>
        input.IndexOfAnyExcept(s_tokenChars) < 0;
}

EDITED BY @MihaZupan on 10/24/2022 to add proposed API shape:

API Proposal

namespace System
{
    public static partial class MemoryExtensions
    {
        public static int IndexOfAny(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues asciiValues);
        public static int IndexOfAnyExcept(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues asciiValues);
        public static int LastIndexOfAny(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues asciiValues);
        public static int LastIndexOfAnyExcept(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues asciiValues);

        public static int IndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues asciiValues);
        public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues asciiValues);
        public static int LastIndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues asciiValues);
        public static int LastIndexOfAnyExcept(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues asciiValues);
        // ... repeat for Span overloads
    }
}

namespace System.Buffers.Text
{    
    public sealed class IndexOfAnyAsciiValues
    {
        public IndexOfAnyAsciiValues(ReadOnlySpan<char> asciiValues);
        // No other public members
    }
}

API Usage

internal static class HttpRuleParser
{
    private static readonly IndexOfAnyAsciiValues s_tokenChars =
        new("!#$%&'*+-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ^_`abcdefghijklmnopqrstuvwxyz|~");

    public static bool IsToken(ReadOnlySpan<byte> input) =>
        input.IndexOfAnyExcept(s_tokenChars) < 0;
}

Alternative Designs

Static members on Ascii instead of extension methods in MemoryExtensions. Based on feedback from API review on 2022-10-25.

namespace System.Buffers.Text;

public static class Ascii
{
    public static int IndexOfAny(ReadOnlySpan<byte> span, IndexOfAnyAsciiValues values);
    public static int IndexOfAny(ReadOnlySpan<char> span, IndexOfAnyAsciiValues values);
    // ...
}

public sealed class IndexOfAnyAsciiValues
{
    public IndexOfAnyAsciiValues(ReadOnlySpan<char> asciiValues);
}

Original post:

Earlier in the .NET 7 release, we combined the best of string.IndexOfAny and MemoryExtensions.IndexOfAny to provide a single implementation used by both. As part of this, whereas <= 5 characters are vectorized, > 5 characters use a "probabilistic map": https://github.com/dotnet/runtime/blob/d58efd1f78c5f7b8e1df60dd5e67fafeae9526ee/src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.cs#L1140-L1167 essentially a Bloom filter to quickly screen out characters that are definitely not in the set so that the more expensive check need only be done if it's likely the character is in the set. While faster than comparing every input character against every character in the set, we still have to walk the input character by character.

@MihaZupan's https://github.com/xoofx/markdig/commit/da756f4efe7b5ee6999ef753a2fd4fb9387172dd is an implementation of http://0x80.pl/articles/simd-byte-lookup.html#universal-algorithm, and highlights that it's possible to vectorize this operation instead. It would require startup overhead of determining whether all of the set characters are ASCII and generating a bitmap from them, but we're already doing such a loop in order to build the probabilistic map: https://github.com/dotnet/runtime/blob/d58efd1f78c5f7b8e1df60dd5e67fafeae9526ee/src/libraries/System.Private.CoreLib/src/System/ProbabilisticMap.cs#L31-L67

We could simply augment that to also build up the ASCII bitmap, and if after initialization determine that every character was indeed ASCII, take the vectorized path of using that ASCII bitmap, otherwise fall back to using the probabilistic map as we do today.

MihaZupan commented 1 year ago

If we decide we want it to be abstract so that internally we can have different implementations, then it obviously would need to remain a factory.

If we keep it a struct, that doesn't stop us from having different internal implementations right?

In my local testing with this, IndexOfAnyAsciiValues is just a place to stash the IndexOfAnyAsciiSearcher instance.

public readonly struct IndexOfAnyAsciiValues
{
    private readonly IndexOfAnyAsciiSearcher _searcher;

    internal IndexOfAnyAsciiValues(ReadOnlySpan<char> asciiValues) =>
        _searcher = new IndexOfAnyAsciiSearcher(asciiValues);

    internal readonly IndexOfAnyAsciiSearcher Searcher =>
        _searcher ?? ThrowForUninitializedValues();

    private static IndexOfAnyAsciiSearcher ThrowForUninitializedValues() =>
        throw new InvalidOperationException("");
}

public static class MemoryExtensions
{
    public static int LastIndexOfAny(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues values) =>
        values.Searcher.LastIndexOfAny(span);
}
stephentoub commented 1 year ago

If we keep it a struct, that doesn't stop us from having different internal implementations right?

Correct, as long as we do the right thing in the ref assembly; it'd be a breaking change, for example, to not have a ref type field in the ref assembly and then later try to add one, and also a bug to not have a reference type field in the ref assembly but have one in the implementation.

terrajobst commented 1 year ago

Video

namespace System;

public static partial class MemoryExtensions
{
    public static int IndexOfAny(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues asciiValues);
    public static int IndexOfAnyExcept(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues asciiValues);
    public static int LastIndexOfAny(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues asciiValues);
    public static int LastIndexOfAnyExcept(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues asciiValues);

    public static int IndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues asciiValues);
    public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues asciiValues);
    public static int LastIndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues asciiValues);
    public static int LastIndexOfAnyExcept(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues asciiValues);
    // ... repeat for Span overloads

    public readonly struct IndexOfAnyAsciiValues
    {
        public IndexOfAnyAsciiValues(ReadOnlySpan<char> asciiValues);
        // No other public members
    }
}
stephentoub commented 1 year ago

Should this be an iterator style API that returns something one can use to find all matches?

No. It's just like every other {Last}IndexOf{...} method we already have. And the most common use for an API like this isn't finding each one atomically but rather finding one, then processing the text starting at that location, and then potentially finding another but starting at a different position after some additionally consumed text.

stephentoub commented 1 year ago

We don't like extending ReadOnlySpan

Why? We already have IndexOfAny that operates on a ReadOnlySpan<byte> and that accepts a ReadOnlySpan<byte> as the values, which means you can do both chars.IndexOfAny("abcdef") and bytes.IndexOfAny("abcdef"u8)... this just allows you to do that more efficiently by caching all the necessary precomputed stuff and using it repeatedly.

MihaZupan commented 1 year ago

Trying to gather feedback related to extensions on ROS<byte>

Quoting Levi from the review video (timestamped to relevant discussion)

ROS<byte>s are binary data. MemoryExtensions should not attempt to think of them as text whatsoever. ... We are trying to keep MemoryExtensions as generic as possible. This is tying MemoryExtensions to something that is very specifically meant for one very concrete narrow scenario, which we have tried to avoid. ... Say I have a random byte[], I call "dot" on it, and now I am starting to see APIs that have nothing whatsoever to do with binary data. APIs that are custom tailored to ASCII scenarios. That is not really something that we want people to have to see in IntelliSense for instance, if they are operating with images in memory.

From offline discussion:

In the recording I said that if you extend the bitmap to support all 256 possible values I'm ok with it. It's weird, but I have no issues. Weird from the perspective that it's putting an extremely specialized API (intended primarily for networking scenarios) on a very heavily used API surface. But as long as it supports all possible 256 values, it doesn't violate the "it supports the full range of ROS<byte> this arguments" principle.

Keep pulling at this thread and you'll see that now you can't have a ROS<char> ctor on the helper type, since now you have to be opinionated as to how you convert those chars to bytes (latin1? utf8? something else?) - and you're back to the original problem of having an opinionated API as an extension method on a commonly used type. And that opinionated API only works for a select set of inputs. It's not universally applicable to all possible ROS<byte> input arguments.


From my understanding, the preferred API shape from the review would be closer to

namespace System.Buffers.Text;

public static class Ascii
{
    public static int IndexOfAny(ReadOnlySpan<byte> span, IndexOfAnyAsciiValues values);
    public static int IndexOfAny(ReadOnlySpan<char> span, IndexOfAnyAsciiValues values);
    // ...
}

public sealed class IndexOfAnyAsciiValues
{
    public IndexOfAnyAsciiValues(ReadOnlySpan<char> asciiValues);
}

Namely:

stephentoub commented 1 year ago

Move APIs to something like the Ascii type

Please no, especially if we have other APIs on Ascii that only support ASCII inputs, which is I believe most of them. Just like the existing IndexOfAny overloads, these new overloads support arbitrary inputs across the whole Unicode range... the only restriction is on the needle, not the haystack.

In general I don't understand all of this feedback:

Move APIs to something like the Ascii type Avoid these being extensions methods Prefer static methods on Ascii instead of instance methods on IndexOfAnyAsciiValues to keep the haystack, needle order

These new IndexOfAny* overloads are exactly that: they are overloads for the existing APIs that perform exactly the same operation, they just allow you to do so more efficiently by storing some of the computed information necessary for the operation and reusing it. So today where we have:

private static readonly char[] s_values = new[] { 'a', 'b', 'c', 'd', 'e', 'f' };
...
int i = span.IndexOfAny(s_values);

you'd instead be able to write:

private static readonly IndexOfAnyAsciiValues s_values = new IndexOfAnyAsciiValues(new [] { 'a', 'b', 'c', 'd', 'e', 'f' });
...
int i = span.IndexOfAny(s_values);

and it all "just works", looks exactly the same at the call site, but is more efficient in multiple ways.

Why deviate from that?

MihaZupan commented 1 year ago

So it seems like a few of us still prefer the extensions... so

?

stephentoub commented 1 year ago

Yup. I'm fine if we want to make it a sealed class instead of a struct, though I don't know that really helps in any meaningful way, and I'm fine moving it to some other namespace rather than being a nested class (though unless we put it in System which we probably don't want to do folks will need to import another namespace in order to use it); I'm not sure that's better than a nested type, but I don't have a strong stance on it.

MihaZupan commented 1 year ago

'm fine if we want to make it a sealed class instead of a struct, though I don't know that really helps in any meaningful way

Main point re: struct was that a class better implies to the user that the initialization is not "free" (it at least incurs a heap allocation) and that the object should be reused.

stephentoub commented 1 year ago

I've been thinking about this more...

I think we got so focused on the known vectorization opportunity for ASCII needles that we lost track of what this actually is: an optimization for the existing overloads that simply hoists out any computation that can be shared across multiple invocations. This is made evident from https://github.com/dotnet/runtime/pull/76740, which adds these optimizations into the existing overloads but has to pay for the shared computations on each invocation: this addition then just enables that shared computation to be precomputed into an object that can be reused.

When viewed in that way, this should look a bit different:

namespace System
{
    public class IndexOfAnyValues<T> where T : IEquatable<T>?
    {
        public static IndexOfAnyValues<T> Create(ReadOnlySpan<T> values);
        // no public/protected ctors
    }

    public static partial class MemoryExtensions
    {
        public static int IndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
        public static int IndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
        public static int LastIndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
        public static int LastIndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
        // ... repeat for Span overloads
    }
}

A few salient points:

MihaZupan commented 1 year ago

I like it.

Curious, could we have the values type be sealed with a regular ctor, and have it store the specialized type internally?

Or would the JIT give up in that case (detect type of field on sealed class vs detect the type of field)?

stephentoub commented 1 year ago

I don't believe the JIT can handle that today without PGO (and with PGO it would still require an extra check), though you could try it out to see. It would also be more expensive to have another layer of objects to be allocated. Factories are a well-established pattern and I think fit well here.

stephentoub commented 1 year ago

It's also worth noting that with this shape, we could choose to expose the core operation for https://github.com/dotnet/runtime/issues/62447 as simply one more Create overload:

public static IndexOfAnyValues<char> Create(ReadOnlySpan<string> values);

If we think we might want to do that, we should consider having a separate non-generic IndexOfAnyValues type with the static Create/Create<T> methods. The generic type would have no exposed members.

cc: @teo-tsirpanis

jkotas commented 1 year ago

With the common structure of caching the values in a readonly static, it should be trivial for the JIT with its existing optimizations to inline the MemoryExtensions method, see the concrete type of the object stored in the static field, and devirtualize the internal call

JIT is not able to do this optimization today. It can only observe types of direct readonly references statics. It is not able to observe types of references wrapped in readonly structs statics. (Of course, this optimization can be added.)

stephentoub commented 1 year ago

JIT is not able to do this optimization today. It can only observe types of direct readonly references statics.

That's what I'm talking about here; there's no struct in my outline. Have I misunderstood?

private static readonly IndexOfAnyValues<char> s_values = IndexOfAnyValues.Create("abcdefg");
...
int index = span.IndexOfAny(s_values);
gfoidl commented 1 year ago

The values type is not sealed but it doesn't have any public/protected constructors. It has a factory method that enables us to create derived implementations as needed

This bites me a bit, namely the "us", which is meant from within SPCoreLib. Assume a situation where any user of .NET (outside of SPCoreLib) comes up with a special implementation for specific T and the given values. Then it's not possible to "inject" a custom derived type into the IndexOfAny-methods.

Sure, in such a situation one could implement the whole IndexOfAny-equivalent one self. But for me it would be desireable to just provide the specialized version, and if conidtions are not met (vector not supported, too short, etc.) "fallback" to the .NET provided solution.

So should there be a protected ctor on IndexOfAnyValues? For standard-usage the entry would still be the Create-factory. For advanced usages the protected ctor is needed.

Although that could be enabled if there's actual desire for this. Adding a protected parameterless ctor isn't a breaking change.


As another implementation detail, the threshold for scalar / vectorized -- or in general: when the implementation given by IndexOfAnyValues should be used -- should be given as property on that type.

    public class IndexOfAnyValues<T> where T : IEquatable<T>?
    {
        public static IndexOfAnyValues<T> Create(ReadOnlySpan<T> values);
        // no public/protected ctors

        public int MinimumSearchSpaceLengthToEnable { get; }
    }

So in https://github.com/dotnet/runtime/blob/2ffdaf6010800b73073e3e2bff63bf2427b9aac3/src/libraries/System.Private.CoreLib/src/System/ProbabilisticMap.cs#L94 instead of hard-coding the 20 the value of that property could be used.


Besided that, thanks for stepping back and bringing up that great proposal 👍🏻.

stephentoub commented 1 year ago

This bites me a bit, namely the "us", which is meant from within SPCoreLib.

Why? It's an implementation detail, and never in this discussion was there an intention of somehow allowing external configuration/pluggability/extendability of the algorithms. The internal virtual method mentioned would also be the entire implementation of the public API, so there's little real benefit to making it pluggable externally, since the thing plugging in would need to supply the entire set of implementations, anyway.

stephentoub commented 1 year ago

As another implementation detail, the threshold for scalar / vectorized -- or in general: when the implementation given by IndexOfAnyValues should be used -- should be given as property on that type.

If it makes sense to do as an implementation detail, that could be an internal field/property on the IndexOfAnyValues type, sure.

gfoidl commented 1 year ago

Yep, so far there was no word about external extendability. The possibility of that come into my mind when looking at your last proposal.

For the "why" as advanced use case only if someone has a more suitable implementation for the specific given pattern, and if that is too niche to add to .NET. Without having an actual implementation in mind, I thought about using this API in searching within tensors (just a thought, didn't elaborate actually).

But as said, this could always be done later on if there's actual desire for extendability. This shouldn't hinder the current shape of the APIs.

MihaZupan commented 1 year ago

instead of hard-coding the 20 the value of that property could be used.

 if (searchSpaceLength < Vector128<short>.Count || (searchSpaceLength < 20 && searchSpaceLength < (valuesLength >> 1))) 

I don't think we would use such property. This check does an educated guess as to whether computing the bitmap is more expensive than just performing an O(n * m) operation. It's dependent on both the haystack and needle length, and specific to this init cost. I can make it a const in the ProbabilisticMap implementation, but I don't think it's relevant outside of it.

teo-tsirpanis commented 1 year ago

we could choose to expose the core operation for #62447 as simply one more Create overload

We shouldn't use it in source-generated regexes if we want to avoid the initialization cost. See https://github.com/dotnet/runtime/issues/69682#issuecomment-1189479894.

stephentoub commented 1 year ago

We shouldn't use it in source-generated regexes if we want to avoid the initialization cost. See https://github.com/dotnet/runtime/issues/69682#issuecomment-1189479894.

We wouldn't want to use it in source-generated regexes regardless of the initialization cost, as presumably we'd want to emit an implementation of the graph / trie / tables / whatever as code. But we could use it for RegexOptions.Compiled, right? And possibly the interpreter?

teo-tsirpanis commented 1 year ago

You said in https://github.com/dotnet/runtime/issues/69682#issuecomment-1171075614:

It might be counterintuitive, but I would like to avoid adding optimizations to the interpreter that aren't also in Compiled and the source generator, and similarly optimizations to Compiled that aren't in the source generator.

Judging from it the Compiled mode should not use it but the interpreted mode could, right?

stephentoub commented 1 year ago

Judging from it the Compiled mode should not use it but the interpreted mode could, right?

Assuming there's a performance improvement to be had and it doesn't too negatively impact first-use, "it" could apply to all of them, where "it" is using the algorithm. Source generator would use a source generated implementation to avoid all startup costs and optimize throughput, Compiled could potentially use an IL emitted implementation in order to optimize throughpuit but could also use a built-in IndexOfAny if that made more sense, and the interpreter could use the built-in IndexOfAny.

jkotas commented 1 year ago

If we are not going to use this API as an implementation detail for source generated regexes, we may consider dropping this API and double-down on efficient source-generated regex support with good usability instead.

It would reduce concept count and it would be similar to how some other language environments work. You just specify a regex and let the compiler worry about the details.

gfoidl commented 1 year ago

consider dropping this API

There will be usages outside of regex, too. E.g. in ASP.NET Core (https://github.com/dotnet/aspnetcore/pull/44041).

stephentoub commented 1 year ago

If we are not going to use this API as an implementation detail for source generated regexes, we may consider dropping this API and double-down on efficient source-generated regex support with good usability instead.

I think there's some confusion here. The API currently proposed will be used by regex and many other places. What I was saying won't be used by source gen regex is the hypothetical additional method for aho corasick, mentioned here as being a possible future extension, that @teo-tsirpanis is looking at.

jkotas commented 1 year ago

What are all Ts and other special cases that we expect to provide an optimized path for in IndexOfAnyValues.Create<T>? It is the kind of method that can turn into a choke point for trimming, keeping number of unreachable codepaths around unnecessarily.

// in the future, could add an `IndexOfAnyValues Create(ReadOnlySpan values) overload

Nit: This overload would be ambiguous with the existing method. Consider var mask = IndexOfAnyValues.Create(new ReadOnlySpan<string>(new string[] { "Hello", "World" }));. I think it would need to be called something else.

stephentoub commented 1 year ago

What are all Ts and other special cases that we expect to provide an optimized path for in IndexOfAnyValues.Create?

At least char and byte. If you think it's better to have explicit overloads for char and byte, we could do that, too. Though I'd hope the trimmer would be able to figure out (if not now, eventually) that a method M<T> only called as M<char> could have its byte-based code paths trimmed away.

And for char at least the cases where all inputs are ASCII (in which case it can use the vectorized implementation) or they aren't (in which case it can use the probabilistic map, or other such lookup tables). I'm hopeful in the future we'll discover additional ways to optimize as well, and that can all be done under the covers without additional APIs. We could also special-case inputs that would be satisfied by other public API, e.g. detect that non-ASCII inputs are all in a single range and then delegate to IndexOfAnyInRange, but for such cases I think we should start simple and just say callers should themselves use those APIs rather than having the API try to detect and "fix" it.

stephentoub commented 1 year ago

This overload would be ambiguous with the existing method. Consider var mask = IndexOfAnyValues.Create(new ReadOnlySpan(new string[] { "Hello", "World" }));. I think it would need to be called something else.

It wouldn't be ambiguous from the compiler's perspective: if you used IndexOfAnyValues.Create(new ReadOnlySpan<string>(new string[] { "Hello", "World" })), you'd get an IndexOfAnyValues<char>, and if you used IndexOfAnyValues.Create<string>(new ReadOnlySpan<string>(new string[] { "Hello", "World" })), you'd get an IndexOfAnyValues<string>. However, I agree the distinction is subtle and we might want to use a different name (if we even add such a method in the future; it's not part of this proposal). My main point was that the factory pattern lets us produce an IndexOfAnyValues<char> based on different inputs and that the resulting instance can be specialized for that input without incurring additional per-invocation overheads.

jkotas commented 1 year ago

At least char and byte. If you think it's better to have explicit overloads for char and byte, we could do that, too.

I keep thinking about this API primarily as a low-level regex helper that saves us from stamping the complex unsafe vectorized code in every source-generated regex. The API can be used directly too, but I expect only a small number of people are going to be successful in figuring when and how to correctly use it to their advantage.

Having the surface limited and specialized to just what regex and parsers need would be fine with me (ie byte and char only). I am not even sure we need methods for this on MemoryExtensions. I think having a method on IndexOfAnyValues<T> type would be good enough.

My main point was that the factory pattern lets us produce an IndexOfAnyValues based on different inputs and that the resulting instance can be specialized for that input without incurring additional per-invocation overheads.

How would the IndexOfAnyExcept work with the aho corasick? I am not sure what the sensible behavior would be. And if there is a sensible behavior, would it mean that we would need to keep multiple state machines around for different IndexOf variants?

stephentoub commented 1 year ago

I keep thinking about this API primarily as a low-level regex helper that saves us from stamping the complex unsafe vectorized code in every source-generated regex.

Why do you believe that's the case? It'll certainly be used by regex, but most places currently doing an IndexOfAny with more than 5 characters can benefit from this (and others that aren't using IndexOfAny because they could do something faster based on precomputing a bitmap). Some examples from a quick search:

expect only a small number of people are going to be successful in figuring when and how to correctly use it to their advantage.

I'm surprised by that. Plenty of folks figured out how to write:

private static readonly char[] s_chars = new char[] { ... };
...
int index = something.IndexOfAny(s_chars);

This follows the exact same form, just changing the line:

private static readonly char[] s_chars = new char[] { ... };

to instead be:

private static readonly IndexOfAnyValues<char> s_chars = IndexOfAnyValues.Create(new char[] { ... });

and as an overload of IndexOfAny, you're looking in IntelliSense through the overloads, you see one that takes an IndexOfAnyValues, you see it has a Create method, and you substitute an IndexOfAnyValues for your s_chars. We can even have an analyzer/fixer for it.

Having the surface limited and specialized to just what regex and parsers need would be fine with me (ie byte and char only). I am not even sure we need methods for this on MemoryExtensions. I think having a method on IndexOfAnyValues type would be good enough.

I think there's real value having these simply as overloads of the existing IndexOfAny: it's just one more way you can provide inputs to the existing functionality. I'd be fine replacing the generic Create<T>(ReadOnlySpan<T>) with Create(ReadOnlySpan<char>) and Create(ReadOnlySpan<byte>) if you believe that'll be better; I personally like it as a single generic method that accomodates any future optimizations we want for other types, but as I don't have any in mind or that are pressing, we could do without it for now, and I understand the trimming concerns (though I still think we should be thinking about imbuing the trimmer with more smarts around this kind of input/type tracking).

jkotas commented 1 year ago

Why do you believe that's the case?

I expect that most people will just use the existing IndexOfAny API that is the most straightforward one to use, and not bother with making their code less readable by outlining the pattern to search for.

private static readonly char[] s_chars = new char[] { ... };

int index = something.IndexOfAny(s_chars);

I expect only a fraction of read-world places to have outlined cache like this. I expect that majority are just going to allocate the array of chars to search for inline since that it is the most straightforward way to do it. Our examples show you to allocate the chars to search for inline too: https://learn.microsoft.com/en-us/dotnet/api/system.string.indexofany?view=net-7.0.

Yes, there are certainly going to be cases where this API will be used directly, especially in our own libraries. I see it as an API for advanced performance-oriented developers only.

stephentoub commented 1 year ago

Our examples show you to allocate the chars to search for inline too

Our examples show an entire program in a single snippet focused on highlighting what output you get for what input. They're not real-world.

I expect only a fraction of read-world places to have outlined cache like this.

I see a lot of it in searching around GitHub. But my searches may be skewed; I asked @marklio for help in getting some data.

Regardless, whether it's a majority or minority, there are lots of places that will be able to benefit from it (as outlined in https://github.com/dotnet/runtime/issues/68328#issuecomment-1293786696). We have multiple IndexOfAny overloads taking inputs in a variety of forms; this adds just one more, and I think it's the right way to expose this. I think trying to bury it elsewhere actually makes it more confusing, as it means there's now a set of IndexOfAny methods elsewhere that aren't obviously connected and thus force people to think about whether they need to consider the other one; as an overload, it's just another way of providing the same inputs.

terrajobst commented 1 year ago

Video

namespace System;

public static partial class MemoryExtensions
{
    public static int IndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
    public static int IndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
    public static int LastIndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
    public static int LastIndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
}
namespace System.Buffers;

public static class IndexOfAnyValues
{
    public static IndexOfAnyValues<byte> Create(ReadOnlySpan<byte> values);
    public static IndexOfAnyValues<char> Create(ReadOnlySpan<char> values);
}

public class IndexOfAnyValues<T> where T : IEquatable<T>?
{
}
gfoidl commented 1 year ago

This was an interesting journey. Thanks for all the beatiful discussion and thoughts on this!