Vectorize IndexOfAny on more than 5 chars

stephentoub commented 2 years ago

EDITED BY @stephentoub on 10/26/2022 to add revised API shape:

namespace System
{
    public static class IndexOfAnyValues
    {
        public static IndexOfAnyValues<T> Create<T>(ReadOnlySpan<T> values)  where T : IEquatable<T>?;
        // in the future, could add an `IndexOfAnyValues<char> Create(ReadOnlySpan<string> values) overload
    }

    public class IndexOfAnyValues<T> where T : IEquatable<T>?
    {
        // no public/protected members, including ctors
    }

    public static partial class MemoryExtensions
    {
        // These are identical to the existing overloads, just with IndexOfAnyValues<T> instead of ReadOnlySpan<T> for the values parameter type
        public static int IndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
        public static int IndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
        public static int LastIndexOfAny<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
        public static int LastIndexOfAnyExcept<T>(this ReadOnlySpan<T> span, IndexOfAnyValues<T> values) where T : IEquatable<T>?;
        // ... repeat for Span overloads
    }
}

API Usage

internal static class HttpRuleParser
{
    private static readonly IndexOfAnyValues<char> s_tokenChars =
        IndexOfAnyValues.Create("!#$%&'*+-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ^_`abcdefghijklmnopqrstuvwxyz|~");

    public static bool IsToken(ReadOnlySpan<char> input) =>
        input.IndexOfAnyExcept(s_tokenChars) < 0;
}

EDITED BY @MihaZupan on 10/24/2022 to add proposed API shape:

API Proposal

namespace System
{
    public static partial class MemoryExtensions
    {
        public static int IndexOfAny(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues asciiValues);
        public static int IndexOfAnyExcept(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues asciiValues);
        public static int LastIndexOfAny(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues asciiValues);
        public static int LastIndexOfAnyExcept(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues asciiValues);

        public static int IndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues asciiValues);
        public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues asciiValues);
        public static int LastIndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues asciiValues);
        public static int LastIndexOfAnyExcept(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues asciiValues);
        // ... repeat for Span overloads
    }
}

namespace System.Buffers.Text
{    
    public sealed class IndexOfAnyAsciiValues
    {
        public IndexOfAnyAsciiValues(ReadOnlySpan<char> asciiValues);
        // No other public members
    }
}

API Usage

internal static class HttpRuleParser
{
    private static readonly IndexOfAnyAsciiValues s_tokenChars =
        new("!#$%&'*+-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ^_`abcdefghijklmnopqrstuvwxyz|~");

    public static bool IsToken(ReadOnlySpan<byte> input) =>
        input.IndexOfAnyExcept(s_tokenChars) < 0;
}

Alternative Designs

Static members on Ascii instead of extension methods in MemoryExtensions. Based on feedback from API review on 2022-10-25.

namespace System.Buffers.Text;

public static class Ascii
{
    public static int IndexOfAny(ReadOnlySpan<byte> span, IndexOfAnyAsciiValues values);
    public static int IndexOfAny(ReadOnlySpan<char> span, IndexOfAnyAsciiValues values);
    // ...
}

public sealed class IndexOfAnyAsciiValues
{
    public IndexOfAnyAsciiValues(ReadOnlySpan<char> asciiValues);
}

Original post:

Earlier in the .NET 7 release, we combined the best of string.IndexOfAny and MemoryExtensions.IndexOfAny to provide a single implementation used by both. As part of this, whereas <= 5 characters are vectorized, > 5 characters use a "probabilistic map": https://github.com/dotnet/runtime/blob/d58efd1f78c5f7b8e1df60dd5e67fafeae9526ee/src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.cs#L1140-L1167 essentially a Bloom filter to quickly screen out characters that are definitely not in the set so that the more expensive check need only be done if it's likely the character is in the set. While faster than comparing every input character against every character in the set, we still have to walk the input character by character.

@MihaZupan's https://github.com/xoofx/markdig/commit/da756f4efe7b5ee6999ef753a2fd4fb9387172dd is an implementation of http://0x80.pl/articles/simd-byte-lookup.html#universal-algorithm, and highlights that it's possible to vectorize this operation instead. It would require startup overhead of determining whether all of the set characters are ASCII and generating a bitmap from them, but we're already doing such a loop in order to build the probabilistic map: https://github.com/dotnet/runtime/blob/d58efd1f78c5f7b8e1df60dd5e67fafeae9526ee/src/libraries/System.Private.CoreLib/src/System/ProbabilisticMap.cs#L31-L67

We could simply augment that to also build up the ASCII bitmap, and if after initialization determine that every character was indeed ASCII, take the vectorized path of using that ASCII bitmap, otherwise fall back to using the probabilistic map as we do today.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-memory See info in area-owners.md if you want to be subscribed.

Issue Details

Earlier in the .NET 7 release, we combined the best of string.IndexOfAny and MemoryExtensions.IndexOfAny to provide a single implementation used by both. As part of this, whereas <= 5 characters are vectorized, > 5 characters use a "probabilistic map": https://github.com/dotnet/runtime/blob/d58efd1f78c5f7b8e1df60dd5e67fafeae9526ee/src/libraries/System.Private.CoreLib/src/System/MemoryExtensions.cs#L1140-L1167 essentially a Bloom filter to quickly screen out characters that are definitely not in the set so that the more expensive check need only be done if it's likely the character is in the set. While faster than comparing every input character against every character in the set, we still have to walk the input character by character. @MihaZupan's https://github.com/xoofx/markdig/commit/da756f4efe7b5ee6999ef753a2fd4fb9387172dd is an implementation of http://0x80.pl/articles/simd-byte-lookup.html#universal-algorithm, and highlights that it's possible to vectorize this operation instead. It would require startup overhead of determining whether all of the set characters are ASCII and generating a bitmap from them, but we're already doing such a loop in order to build the probabilistic map: https://github.com/dotnet/runtime/blob/d58efd1f78c5f7b8e1df60dd5e67fafeae9526ee/src/libraries/System.Private.CoreLib/src/System/ProbabilisticMap.cs#L31-L67 We could simply augment that to also build up the ASCII bitmap, and if after initialization determine that every character was indeed ASCII, take the vectorized path of using that ASCII bitmap, otherwise fall back to using the probabilistic map as we do today.

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Memory`, `tenet-performance`, `up-for-grabs`
Milestone:	7.0.0

stephentoub commented 2 years ago

Instead of or in addition to this, we might also consider a new overload:

IndexOfAny(ReadOnlySpan<char>, UInt128)

(or potentially a ReadOnlySpan<byte> or Vector128<byte> instead of the UInt128). The existing IndexOfAny overload could use it, but it could also be used directly by code that could precompute the ASCII bitmap, reducing the per-call overhead, e.g. Regex could use this to vectorize set searches too big for the current IndexOfAny.

danmoseley commented 2 years ago

I’d like to take a look, but if someone can do this sooner, just let me know.

It looks like the algorithm cannot be immediately transliterated into Vector128 because (at least) it relies on Ssse3.Shuffle and the generic shuffle on Vector128 doesn’t have the same behavior. Therefore we would rely on software fallback for Arm64 and potentially improve later.

deeprobin commented 2 years ago

@stephentoub Is the constant 5 dependent on other parameters? Or was this constant only selected based on benchmarks?

In other words, in any case, if the length is > 5, is it guaranteed that the performance will be better than the probablistic map?

I was thinking in general about these constants that we have everywhere. Especially that we somehow evaluate them dynamically in the startup phase of the JIT, etc., because certainly in many cases parameters like microarchitecture or the use of NativeAOT or ReadyToRun make a crucial difference (i.e. a (lightweight): Runtime Benchmarking).

stephentoub commented 2 years ago

Is the constant 5 dependent on other parameters?

It's not just a constant. If you look at the implementation, you'll see there's work that has to be done specific to each additional character, each one getting its own vector that needs to be factored in. It's not like there's some number to just change and magically higher counts of chars are vectorized.

MihaZupan commented 2 years ago

we might also consider a new overload: IndexOfAny(ReadOnlySpan<char>, UInt128) ... it could also be used directly by code that could precompute the ASCII bitmap

The only problem I see with such an API is that the pre-computed bitmap used by the vectorized algorithm isn't the ASCII bitmap itself (described a bit under the universal algorithm description). My implementation is somewhat different from the "universal algorithm" as it only deals with the ASCII range. For each character, the nibbles are swapped and packed together (int position = (c >> 4) | ((c & 0x0F) << 3)).

   01234567
=>
    4567     ((c & 0xF) << 3)
&      0123  (c >> 4)
==
   X4567123  (the X bit is 0)

The 0th (highest) bit is assumed to be 0 (ASCII), so it can overlap with the 7th bit. With this trick, the whole bitmap fits into a single Vector128<byte>.

I don't know if it's possible to quickly turn an input bitmap representing the set of characters into this modified one. For the lowest overhead, the caller would want to precompute the modified bitmap, but that may not be obvious from the API signature.

As Dan pointed out, the above approach depends on the behavior of Ssse3's Shuffle. I'm not familiar with ARM instructions, but if there's no "4 bit => 4 bit lookup" equivalent, the optimal bitmap may be different.

stephentoub commented 2 years ago

For the lowest overhead, the caller would want to precompute the modified bitmap, but that may not be obvious from the API signature.

There could be a separate method that computes and returns the input to be used.

adamsitnik commented 2 years ago

I am moving it to 8 as I don't see us implementing it before we snap. Please feel free to prove me wrong by sending a PR ;)

gfoidl commented 1 year ago

Re: https://github.com/dotnet/aspnetcore/pull/33776#discussion_r973634682

Prelude and motivating example

Some context for the problem domain in ASP.NET Core: Kestrel needs to verify that authority, host, etc. is valid. This is done via the helpers in HttpCharacters. The validation is done via lookup-tables, which get initialized when kestrel starts, e.g. InitializeAuthority. The actual validation is done per plain for-loop. In https://github.com/dotnet/aspnetcore/pull/33776#discussion_r656939393 I thought about vectorizing these checks, which resulted after a prototype and long wait for xplat-intrinsics in https://github.com/dotnet/aspnetcore/pull/44041. @stephentoub then reminded me about this issue, and asked what IndexOfXyz methods in runtime could be needed to avoid the custom code in aspnetcore-repo. Thus I created another prototype for GeneratedIndexOfAny.

Source generator for IndexOfAny{Except}

https://github.com/gfoidl-Tests/SourceGenerator-Vectors is quickly written prototype for a Roslyn source generator* to create vectorized code for IndexOfAny and IndexOfExcept.

_{* I'm sure some pieces may be written in a better way, as I don't have too much experience with writing generators.}

The aforementioned validation of e.g. Authority in ASP.NET Core could be written as

private const string AlphaNumeric        = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
private const string AuthoritySpecific   = ":.-[]@";
private const string ValidAuthorityChars = AlphaNumeric + AuthoritySpecific;

// ...

public bool IsValidByGeneratedIndexOfAnyExcept() => IsAuthorityValid(this.Authority) < 0;

[GeneratedIndexOfAny(ValidAuthorityChars, FindAnyExcept = true)]
private static partial int IsAuthorityValid(ReadOnlySpan<char> host);

so basically as two-liner.

In a benchmark compared with the shipped IndexOfExcept it looks like

|                             Method |            Authority |      Mean | Ratio |
|----------------------------------- |--------------------- |----------:|------:|
|          IsValidByIndexOfAnyExcept |        hostname:8080 |  66.20 ns |  1.00 |
| IsValidByGeneratedIndexOfAnyExcept |        hostname:8080 |  12.28 ns |  0.18 |
|                                    |                      |           |       |
|          IsValidByIndexOfAnyExcept | www.t(...)e.com [71] | 339.29 ns |  1.00 |
| IsValidByGeneratedIndexOfAnyExcept | www.t(...)e.com [71] |  18.18 ns |  0.05 |

benchmark code

```c# using BenchmarkDotNet.Attributes; Bench bench = new(); Console.WriteLine(bench.Authority); Console.WriteLine(bench.IsValidByIndexOfAnyExcept()); Console.WriteLine(bench.IsValidByGeneratedIndexOfAnyExcept()); #if !DEBUG BenchmarkDotNet.Running.BenchmarkRunner.Run(); #endif public partial class Bench { private const string AlphaNumeric = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789"; private const string AuthoritySpecific = ":.-[]@"; private const string ValidAuthorityChars = AlphaNumeric + AuthoritySpecific; [Params("hostname:8080", "www.thelongestdomainnameintheworldandthensomeandthensomemoreandmore.com")] public string Authority { get; set; } = "hostname:8080"; [Benchmark(Baseline = true)] public bool IsValidByIndexOfAnyExcept() => this.Authority.AsSpan().IndexOfAnyExcept(ValidAuthorityChars) < 0; [Benchmark] public bool IsValidByGeneratedIndexOfAnyExcept() => IsAuthorityValid(this.Authority) < 0; [GeneratedIndexOfAny(ValidAuthorityChars, FindAnyExcept = true)] private static partial int IsAuthorityValid(ReadOnlySpan host); } ```

For code like

internal static partial class Demo1
{
    [GeneratedIndexOfAny("1234567890")]
    public static partial int FirstIndexOfNumber(ReadOnlySpan<char> value);

    [GeneratedIndexOfAny("drvxyz")]
    public static partial int FirstIndexOfSet0(ReadOnlySpan<char> value);

    [GeneratedIndexOfAny("12🌄34")]
    public static partial int FirstIndexOfNonAsciiSet(ReadOnlySpan<char> value);
}

namespace Demo.Internal
{
    internal static partial class Demo2
    {
        [GeneratedIndexOfAny("abcd", FindAnyExcept = true)]
        public static partial int FirstIndexOfNotSet0(ReadOnlySpan<char> value);

        [GeneratedIndexOfAny("abcdef", FindAnyExcept = true)]
        public static partial int FirstIndexOfNotSet1(ReadOnlySpan<char> value);
    }
}

the generator emits

// ...
partial class Demo1
{
    [GeneratedCode("Generator", "1.0.0.0")]
    [EditorBrowsable(EditorBrowsableState.Never)]
    public static partial int FirstIndexOfNumber(System.ReadOnlySpan<char> value)
    {
        ReadOnlySpan<bool> lookup = new bool[128] { false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, true, true, true, true, true, true, true, true, true, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false };
        Vector128<sbyte> mask     = Vector128.Create(0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x08, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00).AsSByte();

        return Vector128.IsHardwareAccelerated && value.Length >= Vector128<short>.Count
            ? Core.IndexOfMatchCharVectorized<Core.Negate>(value, mask)
            : Core.IndexOfMatchCharScalar<Core.DontNegate>(value, lookup);
    }

    [GeneratedCode("Generator", "1.0.0.0")]
    [EditorBrowsable(EditorBrowsableState.Never)]
    public static partial int FirstIndexOfSet0(System.ReadOnlySpan<char> value)
    {
        ReadOnlySpan<bool> lookup = new bool[128] { false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, false, false, false, false, false, false, false, false, false, false, false, false, false, true, false, false, false, true, false, true, true, true, false, false, false, false, false };
        Vector128<sbyte> mask     = Vector128.Create(0x00, 0x00, 0x80, 0x00, 0x40, 0x00, 0x80, 0x00, 0x80, 0x80, 0x80, 0x00, 0x00, 0x00, 0x00, 0x00).AsSByte();

        return Vector128.IsHardwareAccelerated && value.Length >= Vector128<short>.Count
            ? Core.IndexOfMatchCharVectorized<Core.Negate>(value, mask)
            : Core.IndexOfMatchCharScalar<Core.DontNegate>(value, lookup);
    }

    [GeneratedCode("Generator", "1.0.0.0")]
    [EditorBrowsable(EditorBrowsableState.Never)]
    public static partial int FirstIndexOfNonAsciiSet(System.ReadOnlySpan<char> value)
    {
        // Given SetChars consist not of all ASCII, can't handle vectorized, so use fallback
        return value.IndexOfAny("12🌄34");
    }
}

namespace Demo.Internal
{
    partial class Demo2
    {
        [GeneratedCode("Generator", "1.0.0.0")]
        [EditorBrowsable(EditorBrowsableState.Never)]
        public static partial int FirstIndexOfNotSet0(System.ReadOnlySpan<char> value)
        {
            // Given SetChars <= 5, so use specialized method from IndexOfAny{Except}
            return value.IndexOfAnyExcept("abcd");
        }

        [GeneratedCode("Generator", "1.0.0.0")]
        [EditorBrowsable(EditorBrowsableState.Never)]
        public static partial int FirstIndexOfNotSet1(System.ReadOnlySpan<char> value)
        {
            ReadOnlySpan<bool> lookup = new bool[128] { false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, true, true, true, true, true, true, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false };
            Vector128<sbyte> mask     = Vector128.Create(0x00, 0x40, 0x40, 0x40, 0x40, 0x40, 0x40, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00).AsSByte();

            return Vector128.IsHardwareAccelerated && value.Length >= Vector128<short>.Count
                ? Core.IndexOfMatchCharVectorized<Core.DontNegate>(value, mask)
                : Core.IndexOfMatchCharScalar<Core.Negate>(value, lookup);
        }
    }
}

// ... + helper-type with the actual workhorse-methods

emitted helper-method

```c# [MethodImpl(MethodImplOptions.AggressiveInlining)] public static int IndexOfMatchCharScalar(ReadOnlySpan value, ReadOnlySpan lookup) where TNegator : struct, INegator { for (int i = 0; i < value.Length; ++i) { char c = value[i]; bool match = TNegator.NegateIfNeeded(lookup[c]); if (match) { return i; } } return -1; } public static int IndexOfMatchCharVectorized(ReadOnlySpan value, Vector128 bitMaskLookup) where TNegator : struct, INegator { Debug.Assert(Vector128.IsHardwareAccelerated); Debug.Assert(value.Length >= Vector128.Count); // To check if a bit in a bitmask from the Bitmask is set, in a sequential code // we would do ((1 << bitIndex) & bitmask) != 0 // As there is no hardware instrinic for such a shift, we use a lookup that // stores the shifted bitpositions. // So (1 << bitIndex) becomes BitPosLook[bitIndex], which is simd-friendly. // // A bitmask from the bitMaskLookup is created only for values 0..7 (one byte), // so to avoid a explicit check for values outside 0..7, i.e. // high nibbles 8..F, we use a bitpos that always results in escaping. Vector128 bitPosLookup = Vector128.Create( 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, // high-nibble 0..7 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF // high-nibble 8..F ).AsSByte(); Vector128 nibbleMaskSByte = Vector128.Create((sbyte)0xF); Vector128 zeroMaskSByte = Vector128.Zero; nuint idx = 0; uint mask; ref short ptr = ref Unsafe.As(ref MemoryMarshal.GetReference(value)); if (value.Length >= 2 * Vector128.Count) { nuint end = (uint)(value.Length - 2 * Vector128.Count); do { Vector128 source0 = Vector128.LoadUnsafe(ref ptr, idx); Vector128 source1 = Vector128.LoadUnsafe(ref ptr, idx + 8); Vector128 values = NarrowWithSaturation(source0, source1); mask = CreateEscapingMask(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte); if (mask != 0) { goto Found; } idx += 2 * (uint)Vector128.Count; } while (idx <= end); } // Here we know that 8 to 15 chars are remaining. Process the first 8 chars. if (idx <= (uint)(value.Length - Vector128.Count)) { Vector128 source = Vector128.LoadUnsafe(ref ptr, idx); Vector128 values = NarrowWithSaturation(source, source); mask = CreateEscapingMask(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte); if (mask != 0) { goto Found; } idx += (uint)Vector128.Count; } // Here we know that < 8 chars are remaining. We shift the space around to process // another full vector. nuint remaining = (uint)value.Length - idx; if ((nint)remaining > 0) { remaining -= (uint)Vector128.Count; Vector128 source = Vector128.LoadUnsafe(ref ptr, idx + remaining); Vector128 values = NarrowWithSaturation(source, source); mask = CreateEscapingMask(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte); if (mask != 0) { idx += remaining; goto Found; } } goto NotFound; Found: idx += GetIndexOfFirstNeedToEscape(mask); return (int)idx; NotFound: return -1; } [MethodImpl(MethodImplOptions.AggressiveInlining)] private static uint GetIndexOfFirstNeedToEscape(uint mask) { // Found at least one byte that needs to be escaped, figure out the index of // the first one found that needs to be escaped within the 16 bytes. Debug.Assert(mask > 0 && mask <= 65_535); uint tzc = uint.TrailingZeroCount(mask); Debug.Assert(tzc < 16); return tzc; } [MethodImpl(MethodImplOptions.AggressiveInlining)] private static uint CreateEscapingMask( Vector128 values, Vector128 bitMaskLookup, Vector128 bitPosLookup, Vector128 nibbleMaskSByte, Vector128 nullMaskSByte) where TNegator : struct, INegator { // To check if an input byte matches or not, we use a bitmask-lookup. // Therefore we split the input byte into the low- and high-nibble, which will get // the row-/column-index in the bit-mask. // The bitmask-lookup looks like: // high-nibble // low-nibble 0 1 2 3 4 5 6 7 8 9 A B C D E F // 0 1 1 0 0 0 0 1 0 1 1 1 1 1 1 1 1 // 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 // 2 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 // 3 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 // 4 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 // 5 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 // 6 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 // 7 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 // 8 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 // 9 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 // A 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 // B 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 // C 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1 // D 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 // E 1 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 // F 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 // // where 1 denotes a matach, while 0 means no match. // For high-nibbles in the range 8..F every input needs to be escaped, so we // can omit them in the bit-mask, thus only high-nibbles in the range 0..7 need // to be considered, hence the entries in the bit-mask can be of type byte. // // In the bitmask-lookup for each row (= low-nibble) a bit-mask for the // high-nibbles (= columns) is created. Debug.Assert(Vector128.IsHardwareAccelerated); // Perf: the shift needs to be done as Int32, as there exists a hw-instruction and no sw-emulation needs to be done. // Cf. https://github.com/dotnet/runtime/issues/75770 // Due to the Int32-shift we need to mask out any remaining bits from the shifted nibble. Vector128 highNibbles = Vector128.ShiftRightLogical(values.AsInt32(), 4).AsSByte() & nibbleMaskSByte; Vector128 lowNibbles = values & nibbleMaskSByte; Vector128 bitMask = Shuffle(bitMaskLookup, lowNibbles); Vector128 bitPositions = Shuffle(bitPosLookup , highNibbles); Vector128 mask = bitPositions & bitMask; Vector128 comparison = Vector128.Equals(nullMaskSByte, mask); comparison = TNegator.NegateIfNeeded(comparison); return comparison.ExtractMostSignificantBits(); } [MethodImpl(MethodImplOptions.AggressiveInlining)] private static Vector128 NarrowWithSaturation(Vector128 v0, Vector128 v1) { Debug.Assert(Vector128.IsHardwareAccelerated); // TODO: https://github.com/dotnet/runtime/issues/75724 if (Sse2.IsSupported) { return Sse2.PackSignedSaturate(v0, v1); } else { // This is not the exact algorithm for saturation, but for the use-case // here it's does what it should do. I.e. eliminate non-ASCII chars in the // results. Vector128 v0HighNibbles = Vector128.ShiftRightLogical(v0, 8); Vector128 v1HighNibbles = Vector128.ShiftRightLogical(v1, 8); Vector128 ascii0 = Vector128.Equals(Vector128.Zero, v0HighNibbles); Vector128 ascii1 = Vector128.Equals(Vector128.Zero, v1HighNibbles); v0 &= ascii0; v1 &= ascii1; return Vector128.Narrow(v0, v1); } } [MethodImpl(MethodImplOptions.AggressiveInlining)] private static Vector128 Shuffle(Vector128 vector, Vector128 indices) { // Perf: Ssse3.Shuffle produces better code return Sse3.IsSupported ? Ssse3.Shuffle(vector, indices) : Vector128.Shuffle(vector, indices); } public interface INegator { static abstract bool NegateIfNeeded(bool equals); static abstract Vector128 NegateIfNeeded(Vector128 equals); } public struct DontNegate : INegator { public static bool NegateIfNeeded(bool equals) => equals; public static Vector128 NegateIfNeeded(Vector128 equals) => equals; } public struct Negate : INegator { public static bool NegateIfNeeded(bool equals) => !equals; public static Vector128 NegateIfNeeded(Vector128 equals) => ~equals; } ```

The behavior is as follows:

less than 5 chars in the set to search for --> fallback to vectorized implementation from MemoryExtensions
if the set to search for contains non-ASCII --> fallback to MemoryExtensions
otherwise: emit code that uses either vectorization or a table-based lookup

Right now only char is implemented, but it could be quite easily extended to other (primitive) types. From ASP.NET Core's use case at least byte in addition would be desireable.

Side note: for ASP.NET Core's validation there is an "extended" case which allows chars > 0x7F (non-ASCII) too. I don't know how hot this method is, and if it is worth to vectorize it, as it's only executed if opted into non-ASCII, which I believe is seldom (maybe wrong here).

Reflection of the generator approach

Preliminary word: due my work for https://github.com/dotnet/aspnetcore/pull/44041 I'm a bit biased towards the generator at the moment, so it may not be the most objective reflection.

Benefits

No startup and runtime overhead

The generator spits out code that is ready to use. All the validation, analysis is already done.

For the lowest overhead, the caller would want to precompute the modified bitmap, but that may not be obvious from the API signature.

There could be a separate method that computes and returns the input to be used.

Hm, here I see similar problems as with "Discoverability" below. At least it seems a bit unwieldy to me, especially if the result of that init-method is a data-structure which needs to be kept around (and alive).

Performance

The current state of the generator emits ROS<bool> which will be read from assembly's data section, and inline Vector128.Create(sbyte, ...) which results in best code-gen.

As the generator has enough time for analysis, more sophisticated searches could be applied for special cases or contiguous sets, which would add more overhead to a runtime-based version. Though one with init-method could amortized that.

Downsides

Discoverability

Not so intuitive and discoverable as just typing IndexOfAny / IndexOfAnyExcept. Remedy: there could be an analyzer to suggest the use of the generator iff > 5 chars and all are ASCII

Use in Regex generator

As alluded in previous comments the regex generator would be a principal beneficiary of faster IndexOfAnyXyz-variants.

In a quick test (.NET 7 daily build from today, VS 17.3.4) it didn't work that one generator emits code that needs another generator (which I hoped it worked). I'm not aware if this should be possible or if there are any measures to enable such a scenario -- though I didn't follow along on that topic.

If this is a show-stopper then the generator-based approach has to be abandoned.

Only for contant sets

I don't know if this is a real downside, as most if not all (?) sets to check are known at development-time.

More maintenance work

Updating the implementation in MemoryExtensions is cheaper than maintaining a source generator.

Could be provided w/o generator too

As cited above a init-method could be used to create a data-structe that contains the relevant pieces and that is handed to an inbox-implementation. The emitted code from the generator is no magick, more or less one entry-method (for each scalar and vectorized) to which the correct mask is handed over and that's it.

Summary

The generator maybe like taking a sledgehammer to crack a nut, but as it has more time to analyse the given "pattern" it can emit more ideal code. So (at the moment) a generator-based approach is what I'm leaning towards.

Major pain-point is on how to align with the use in the regex-generator.

Notes

The generator is far from ideal, and there are some TODOs left which (I) need to address.
The sets to search are given by a (constant) string, but I could images to extend that with a regex-like character-group/range. I.e. to write the example for validation the authority could be given as [A-Za-z0-9:.-[]@]. (although that may be overkill...)
The emitted helper-methods (see above under the details-toggle) could be provided inbox, so that they don't need to be emitted each time. But I think similar was discussed for the Regex-generator...same could / should apply here too.
Perf from the generator looks like, so the threshould of 5 in MemoryExtensions should be re-evaluated

PS: Now I'm pretty sure to have missed a valid point, as I didn't take notes while developing the generator...mea culpa.

stephentoub commented 1 year ago

Thanks for all your thinking / work on this!

Use in Regex generator

Yes, generators currently can't depend on other generators. I don't know if / when / how that will change, but it's definitely a known limitation.

As cited above a init-method could be used to create a data-structe that contains the relevant pieces and that is handed to an inbox-implementation.

If we were going to utilize a generator as part of a solution here, this is the path I'd want to take: a usable in-box implementation, and then a generator that makes it easier to produce the right inputs and call that helper but that doesn't require a generator to do so.

I don't know if this is a real downside, as most if not all (?) sets to check are known at development-time.

It is. For example, RegexOptions.Compiled should be able to use this as well.

Downsides

A huge downside for me is I'd really, really, really like to avoid spitting use of unsafe code / VectorXx into user assemblies. It's a large amount of code, likely processing untrusted input, and I think we should strongly err on the side of having our generators avoid such output whenever possible. In some situations it might be unavoidable, but I don't believe this is one of those cases.

gfoidl commented 1 year ago

Yep, thanks for your thoughts on this. So I'll try to create an inbox-solution (after a little break 😉).

gfoidl commented 1 year ago

Idea for pure inbox-solution with an init-method:

API

(please don't mind the naming, my brain isn't creative enough for this today)

public static class MemoryExtensions
{
    // ...
    public static IndexOfAnyInitData IndexOfAnyInitialize(ReadOnlySpan<char> values);

    public readonly struct IndexOfAnyInitData
    {
        public bool IsAllAscii       { get; }
        public UInt128 Lookup        { get; }
        public Vector128<sbyte> Mask { get; }

        public IndexOfAnyInitData(bool isAllAscii, UInt128 lookup, Vector128<sbyte> mask);
    }
}

Usage example

private const string AlphaNumeric        = "ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
private const string AuthoritySpecific   = ":.-[]@";
private const string ValidAuthorityChars = AlphaNumeric + AuthoritySpecific;

private static readonly MemoryExtension.IndexOfAnyInitData s_authorityInitData = MemoryExtension.IndexOfAnyInitialize(ValidAuthorityChars);

public bool IsAuthoryValid(ReadOnlySpan<char> authoritiy) => authority.IndexOfAnyExcept(ValidAuthorityChars, s_authorityInitData) < 0;

Perf-number are the same as above (it's the same worker-code).

Implementation

Workhorse-code is the same above emitted from the generator. Here for reference / completeness. ```c# using System.Diagnostics; using System.Runtime.CompilerServices; using System.Runtime.InteropServices; using System.Runtime.Intrinsics; using System.Runtime.Intrinsics.X86; namespace StrippedCoreLib; public static class MemoryExtensions { public readonly struct IndexOfAnyInitData { public bool IsAllAscii { get; } public UInt128 Lookup { get; } public Vector128 Mask { get; } public IndexOfAnyInitData(bool isAllAscii, UInt128 lookup, Vector128 mask) { this.IsAllAscii = isAllAscii; this.Lookup = lookup; this.Mask = mask; } } public static IndexOfAnyInitData IndexOfAnyInitialize(ReadOnlySpan values) { if (values.Length <= 5) { return default; } foreach (char c in values) { if (!char.IsAscii(c)) { return default; } } UInt128 lookup = 0; sbyte[] mask = new sbyte[128 / 8]; foreach (char c in values) { SetBitInMask(ref lookup, mask, c); } Vector128 vecMask = Vector128.LoadUnsafe(ref MemoryMarshal.GetArrayDataReference(mask)); return new IndexOfAnyInitData(true, lookup, vecMask); static void SetBitInMask(ref UInt128 lookup, sbyte[] mask, int c) { Debug.Assert(c < 128); lookup |= (1UL << c); int highNibble = c >> 4; int lowNibble = c & 0xF; mask[lowNibble] |= (sbyte)(1 << highNibble); } } public static int IndexOfAny(this ReadOnlySpan span, ReadOnlySpan values, IndexOfAnyInitData initData) { if (values.Length <= 5 || !initData.IsAllAscii) { return span.IndexOfAny(values); } return Vector128.IsHardwareAccelerated && span.Length >= Vector128.Count ? IndexOfMatchVectorized(span, initData.Mask) : IndexOfMatchScalar(span, initData.Lookup); } public static int IndexOfAnyExcept(this ReadOnlySpan span, ReadOnlySpan values, IndexOfAnyInitData initData) { if (values.Length <= 5 || !initData.IsAllAscii) { return span.IndexOfAnyExcept(values); } return Vector128.IsHardwareAccelerated && span.Length >= Vector128.Count ? IndexOfMatchVectorized(span, initData.Mask) : IndexOfMatchScalar(span, initData.Lookup); } [MethodImpl(MethodImplOptions.AggressiveInlining)] private static int IndexOfMatchScalar(ReadOnlySpan value, UInt128 lookup) where TNegator : struct, INegator { for (int i = 0; i < value.Length; ++i) { char c = value[i]; bool match = ((1UL << c) & lookup) != 0; match = TNegator.NegateIfNeeded(match); if (match) { return i; } } return -1; } private static int IndexOfMatchVectorized(ReadOnlySpan value, Vector128 bitMaskLookup) where TNegator : struct, INegator { Debug.Assert(Vector128.IsHardwareAccelerated); Debug.Assert(value.Length >= Vector128.Count); // To check if a bit in a bitmask from the Bitmask is set, in a sequential code // we would do ((1 << bitIndex) & bitmask) != 0 // As there is no hardware instrinic for such a shift, we use a lookup that // stores the shifted bitpositions. // So (1 << bitIndex) becomes BitPosLook[bitIndex], which is simd-friendly. // // A bitmask from the bitMaskLookup is created only for values 0..7 (one byte), // so to avoid a explicit check for values outside 0..7, i.e. // high nibbles 8..F, we use a bitpos that always results in escaping. Vector128 bitPosLookup = Vector128.Create( 0x01, 0x02, 0x04, 0x08, 0x10, 0x20, 0x40, 0x80, // high-nibble 0..7 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF // high-nibble 8..F ).AsSByte(); Vector128 nibbleMaskSByte = Vector128.Create((sbyte)0xF); Vector128 zeroMaskSByte = Vector128.Zero; nuint idx = 0; uint mask; ref short ptr = ref Unsafe.As(ref MemoryMarshal.GetReference(value)); if (value.Length >= 2 * Vector128.Count) { nuint end = (uint)(value.Length - 2 * Vector128.Count); do { Vector128 source0 = Vector128.LoadUnsafe(ref ptr, idx); Vector128 source1 = Vector128.LoadUnsafe(ref ptr, idx + 8); Vector128 values = NarrowWithSaturation(source0, source1); mask = CreateEscapingMask(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte); if (mask != 0) { goto Found; } idx += 2 * (uint)Vector128.Count; } while (idx <= end); } // Here we know that 8 to 15 chars are remaining. Process the first 8 chars. if (idx <= (uint)(value.Length - Vector128.Count)) { Vector128 source = Vector128.LoadUnsafe(ref ptr, idx); Vector128 values = NarrowWithSaturation(source, source); mask = CreateEscapingMask(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte); if (mask != 0) { goto Found; } idx += (uint)Vector128.Count; } // Here we know that < 8 chars are remaining. We shift the space around to process // another full vector. nuint remaining = (uint)value.Length - idx; if ((nint)remaining > 0) { remaining -= (uint)Vector128.Count; Vector128 source = Vector128.LoadUnsafe(ref ptr, idx + remaining); Vector128 values = NarrowWithSaturation(source, source); mask = CreateEscapingMask(values, bitMaskLookup, bitPosLookup, nibbleMaskSByte, zeroMaskSByte); if (mask != 0) { idx += remaining; goto Found; } } goto NotFound; Found: idx += GetIndexOfFirstNeedToEscape(mask); return (int)idx; NotFound: return -1; } [MethodImpl(MethodImplOptions.AggressiveInlining)] private static uint GetIndexOfFirstNeedToEscape(uint mask) { // Found at least one byte that needs to be escaped, figure out the index of // the first one found that needs to be escaped within the 16 bytes. Debug.Assert(mask > 0 && mask <= 65_535); uint tzc = uint.TrailingZeroCount(mask); Debug.Assert(tzc < 16); return tzc; } [MethodImpl(MethodImplOptions.AggressiveInlining)] private static uint CreateEscapingMask( Vector128 values, Vector128 bitMaskLookup, Vector128 bitPosLookup, Vector128 nibbleMaskSByte, Vector128 nullMaskSByte) where TNegator : struct, INegator { // To check if an input byte matches or not, we use a bitmask-lookup. // Therefore we split the input byte into the low- and high-nibble, which will get // the row-/column-index in the bit-mask. // The bitmask-lookup looks like: // high-nibble // low-nibble 0 1 2 3 4 5 6 7 8 9 A B C D E F // 0 1 1 0 0 0 0 1 0 1 1 1 1 1 1 1 1 // 1 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 // 2 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 // 3 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 // 4 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 // 5 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 // 6 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 // 7 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 // 8 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 // 9 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 // A 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 // B 1 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 // C 1 1 0 1 0 1 0 0 1 1 1 1 1 1 1 1 // D 1 1 0 0 0 0 0 0 1 1 1 1 1 1 1 1 // E 1 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 // F 1 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 // // where 1 denotes a matach, while 0 means no match. // For high-nibbles in the range 8..F every input needs to be escaped, so we // can omit them in the bit-mask, thus only high-nibbles in the range 0..7 need // to be considered, hence the entries in the bit-mask can be of type byte. // // In the bitmask-lookup for each row (= low-nibble) a bit-mask for the // high-nibbles (= columns) is created. Debug.Assert(Vector128.IsHardwareAccelerated); // Perf: the shift needs to be done as Int32, as there exists a hw-instruction and no sw-emulation needs to be done. // Cf. https://github.com/dotnet/runtime/issues/75770 // Due to the Int32-shift we need to mask out any remaining bits from the shifted nibble. Vector128 highNibbles = Vector128.ShiftRightLogical(values.AsInt32(), 4).AsSByte() & nibbleMaskSByte; Vector128 lowNibbles = values & nibbleMaskSByte; Vector128 bitMask = Shuffle(bitMaskLookup, lowNibbles); Vector128 bitPositions = Shuffle(bitPosLookup, highNibbles); Vector128 mask = bitPositions & bitMask; Vector128 comparison = Vector128.Equals(nullMaskSByte, mask); comparison = TNegator.NegateIfNeeded(comparison); return comparison.ExtractMostSignificantBits(); } [MethodImpl(MethodImplOptions.AggressiveInlining)] private static Vector128 NarrowWithSaturation(Vector128 v0, Vector128 v1) { Debug.Assert(Vector128.IsHardwareAccelerated); // TODO: https://github.com/dotnet/runtime/issues/75724 if (Sse2.IsSupported) { return Sse2.PackSignedSaturate(v0, v1); } else { // This is not the exact algorithm for saturation, but for the use-case // here it's does what it should do. I.e. eliminate non-ASCII chars in the // results. Vector128 v0HighNibbles = Vector128.ShiftRightLogical(v0, 8); Vector128 v1HighNibbles = Vector128.ShiftRightLogical(v1, 8); Vector128 ascii0 = Vector128.Equals(Vector128.Zero, v0HighNibbles); Vector128 ascii1 = Vector128.Equals(Vector128.Zero, v1HighNibbles); v0 &= ascii0; v1 &= ascii1; return Vector128.Narrow(v0, v1); } } [MethodImpl(MethodImplOptions.AggressiveInlining)] private static Vector128 Shuffle(Vector128 vector, Vector128 indices) { // Perf: Ssse3.Shuffle produces better code return Sse3.IsSupported ? Ssse3 .Shuffle(vector, indices) : Vector128.Shuffle(vector, indices); } public interface INegator { static abstract bool NegateIfNeeded(bool equals); static abstract Vector128 NegateIfNeeded(Vector128 equals); } public struct DontNegate : INegator { public static bool NegateIfNeeded(bool equals) => equals; public static Vector128 NegateIfNeeded(Vector128 equals) => equals; } public struct Negate : INegator { public static bool NegateIfNeeded(bool equals) => !equals; public static Vector128 NegateIfNeeded(Vector128 equals) => ~equals; } } ```

Notes

If an init-method is used to "prepare" IndexOfAny{Except} so all necessary data should be returned. Thus instead of plain UInt128 a bit more info is in the IndexOfAnyInitData:

IsAllAscii needed check if the vectorized approach can be used or if a fallback to IndexOfAny is needed -- I think this flag should be passed to the fallback too, in order to avoid the try for the probabilistic map
Lookup for the table lookup if span is too short to vectorized -- maybe we don't need this as if span is too short to vectorize just fallback to scalar-scan there?
Mask the bitmask for the vectorized approach

stephentoub commented 1 year ago

Thanks, @gfoidl!

lookup |= (1UL << c);

Is that actually doing the "right thing" with UInt128? I'd have expected this to produce the ulong, which will overflow after 64 bits, and then the result of that is what'll be or'd in.

IsAllAscii

This feels like extra complication we could instead address just by naming, e.g. putting this on the new Ascii class, or having Ascii in the method name like IndexOfAnyAscii / IndexOfAnyExceptAscii. If you passed a non-ASCII value to the initialize method, it would throw. My expectation is the 99% use case for this method is a known set of input characters, and if you know any isn't ASCII, then don't use this method.

Vector128<sbyte>

Why sbyte rather than byte?

UInt128 Lookup

Does this significantly help? If the only thing you had was a Vector128<byte> produced by the init method, what would the implementation do differently? I'm trying to simplify this, and wondering if we could get to the point where the init method just returns a Vector128<byte>, IndexOfAnyAscii just takes a Vector128<byte>, and the scheme for populating that Vector128<byte> is documented so you wouldn't even need the InitializeIndexOfAnyAscii method if you didn't want it.

gfoidl commented 1 year ago

UInt128 Lookup

If the only thing you had was a Vector128<byte>

In the implementation (last comment) there is a difference between the scalar- and vectorized-path as perf-optimization.

Scalar is basically bit-test ((1 << c) & lookup) != 0. There's no intrinsic to do such a thing vectorized, so the approach there is to split the byte into nibbles, then use these nibbles as shuffle-mask for the lookup and bit-test.

The scalar-path could to the same nibble-splitting to get rid of the UInt128-mask, but I guess this decreases perf. The question is that it matter? Talking about {byte, char} vectorization kicks in when span.Length >= {16, 8}, so I think it matters but it should be measured to have numbers for this.

init method just returns a Vector128<byte>, IndexOfAnyAscii just takes a Vector128<byte>, and the scheme for populating that Vector128<byte> is documented

I think to understand your motivation for this, but I see these drawbacks:

Implementation tied to the API-surface Especially if the scheme for populating the mask is documented and part of the public API, then we can't change the implementation in a easy way.
Can't use more sophisticated algorithms Say we find a clever way to handle special set-chars in more performant way that doesn't use Vector128<byte> as mask but something different. That can't be expressed with such an API.

For this API I imagine that based on the shape of set-chars different algorithms could be applied. E.g. the set consists of a few numbers, or control-characters, etc. There may be more specific ways to handle that than the "generic" algorithm based on nibble-splitting and the lookup.

Thus for this iteration of my thinking about this the API should look like:

public static class MemoryExtensions
{
    public abstract class IndexOfAnyInitData { }

    public static IndexOfAnyInitData IndexOfAnyInitialize(ReadOnlySpan<char> values);
}

So we gain the freedom to return for IndexOfAnyInitData whatever seems to be the best fit for the given values.

Downside is that IndexOfAnyInitData needs to be allocated. For storing in a static readonly not really a problem, but maybe for RegexOptions.Compiled not as easy as passing a value-type around?!

An implementation could look like then:

private sealed class BitMaskIndexOfAnyInitData : IndexOfAnyInitData
{
    public UInt128 Lookup        { get; init; }
    public Vector128<sbyte> Mask { get; init; }
}

public static IndexOfAnyInitData IndexOfAnyInitialize(ReadOnlySpan<char> values)
{
    foreach (char c in values)
    {
        if (!char.IsAscii(c))
        {
            throw new InvalidOperationException("Non ASCII here");
        }
    }

    // Analyze the values here and determine the best algorithm to use.
    // Return the init-data specific to that algorithm.
    // E.g.
    return new BitMaskIndexOfAnyInitData { Lookup = lookup, Mask = vecMask };
}

public static int IndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyInitData initData)
{
    // Switch on the initData to select the proper algorithm

    if (initData is BitMaskIndexOfAnyInitData bitMaskIndexOfAnyInitData)
    {
        return Vector128.IsHardwareAccelerated && span.Length >= Vector128<sbyte>.Count
            ? BitMaskIndexOfMatchVectorized<Negate>(span, bitMaskIndexOfAnyInitData.Mask)
            : BitMaskIndexOfMatchScalar<DontNegate>(span, bitMaskIndexOfAnyInitData.Lookup);
    }

    throw new NotSupportedException("At the moment only the bitmask-appraoch is supported");
}

public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, IndexOfAnyInitData initData)
{
    // Analogous to above method
}

Why sbyte rather than byte?

The code only works for ASCII-sets, so sbyte as the >= 0 values are the ASCII-range. Though from an API point of view it doesn't really matter, as Vector128<byte> can be re-interpreted as Vector128<sbyte> easily. sbyte is mainly motivated from the implementation which works on Vector128<sbyte>. I have no strong stance here, and whatever seems better for the API I'm fine with.

If you passed a non-ASCII value to the initialize method, it would throw. My expectation is the 99% use case for this method is a known set of input characters, and if you know any isn't ASCII, then don't use this method.

:+1: I'd expect the same.

Now I also think this method shouldn't fallback to the current IndexOfAny-methods. In the implementation from last comment it would fallback if <= 5 set-chars or if not all are ASCII. For the ASCII I like what you suggested (just throw from init in case of non-ASCII encountered).

Don't fallback for the case of <= 5 set-chars provides another choice for the consumer which API to use which can be based on custom benchmarks. In any way we should reconsider the threshould of 5 set-chars. And / or maybe also move the probabilistic map initialization into the init-method here (with its own IndexOfAnyInitData derived-type).

lookup |= (1UL << c);

Oh, you are right. I though to have checked that (can't remember what I tested actually).

UInt128 a = 0;
UInt128 b = UInt128.MaxValue;
UInt128 c = (1UL << 127);
UInt128 d = (UInt128.One << 127);

Print(a);
Print(b);
Print(c);
Print(d);

static void Print(UInt128 value) => Console.WriteLine($"0x{value:X32}");

0x00000000000000000000000000000000
0xFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF

0x00000000000000008000000000000000
                  ^ (overflowed)
0x80000000000000000000000000000000
  ^ (correct)

stephentoub commented 1 year ago

UInt128 Lookup

If the only thing you had was a Vector128

In the implementation (last comment) there is a difference between the scalar- and vectorized-path as perf-optimization.

Sure, but the UInt128 and the Vector128<byte> are just two different representations of the same information. How expensive is it to convert from one to the other? My question/suggestion was basically whether we could get away with the method just taking one of them, and if the other was needed, convert it once in IndexOfAnyAscii. For example, just always take a UInt128 with the relevant bits set, and then create a Vector128<byte> from it if there are enough items being searched to warrant vectorization... what's the overhead of that? If it's significant, obviously we don't want to do that. If it's a few instructions, the simplification could be worthwhile, especially since at that point we probably wouldn't need the initialization method at all: anyone who wanted to call this could either hardcode the UInt128 value from having precomputed it, or write the two-line loop themselves:

UInt128 lookup = 0;
foreach (char c in "...") lookup |= (UInt128.One << c);

I'm mainly just trying to understand / explore options at this point. Note that I care about overheads for small inputs and also vectorized larger inputs when the match is found earlier; I care a lot less about overheads when Vector128 isn't hardware accelerated.

Downside is that IndexOfAnyInitData needs to be allocated. For storing in a static readonly not really a problem

It could be. Not that it's impossible to do, but it gets awkward. For example, I'd like to be able to use this in source generated regexes. If this were just relevant to finding the next place a regex could match, we'd have only one of them. But we use {Last}IndexOf{Any} in a bunch of places, e.g. if there's a set after a (lazy) loop, we'll use {Last}IndexOf{Any} to optimize the backtracking, so with this API we'd end up needing to store one of these per such construct.

I also have a visceral negative reaction to a source generator not being able to compute everything it needs at compile time and instead defer work to startup time. (Even if it's so cheap it's unfounded, and even if solutions like NativeAOT's ability to execute static cctors at compile time would obviate even those concerns.)

but maybe for RegexOptions.Compiled not as easy as passing a value-type around?!

Right. Compiled uses DynamicMethod: we can't add fields.

I have no strong stance here, and whatever seems better for the API I'm fine with.

I'd prefer byte.

So we gain the freedom to return for IndexOfAnyInitData whatever seems to be the best fit for the given values.

There are certainly benefits to keeping the data opaque such that the IndexOfAnyAscii and its associated initialization method can be evolved. I'd like to better understand the cost of converting the UInt128 simple lookup form into a Vector128<byte> form; if that's super cheap such that doing it on every IndexOfAnyAscii invocation that needs the vector would be barely measurable, I think that's still my preference. If it's more costly, the opaque approach is probably the right one, even though it complicates the source generator scenario.

gfoidl commented 1 year ago

How expensive is it to convert from one to the other?

Quick benchmark for any except of ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789:.-[]@ (authority from kestrel).

WithInitData is the approach outlined in https://github.com/dotnet/runtime/issues/68328#issuecomment-1253682987.

WithUInt128 is foreach (char c in "...") lookup |= (UInt128.One << c); with these APIs:

public static UInt128 IndexOfAnyInitialize(ReadOnlySpan<char> values);
public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, UInt128 initData);

so the vectorized version needs to convert the bit-array into the bit-mask needed at invocation.

WithVector is the nibble-splitting (for both vectorized + scalar) with these APIs:

public static Vector128<byte> IndexOfAnyInitialize(ReadOnlySpan<char> values);
public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, Vector128<byte> initData);

This yields interesting results, which I did not expect to be that way (especially that nibble-splitting is so fast).

|       Method |     Authority |       Mean | Ratio |
|------------- |-------------- |-----------:|------:|
| WithInitData |        [fe80] |  11.649 ns |  1.00 |
|  WithUInt128 |        [fe80] |  10.206 ns |  0.89 |
|   WithVector |        [fe80] |   6.602 ns |  0.57 |
|              |               |            |       |
| WithInitData | hostname:8080 |   5.035 ns |  1.00 |
|  WithUInt128 | hostname:8080 | 252.096 ns | 50.07 |
|   WithVector | hostname:8080 |   4.581 ns |  0.91 |

Explanation first set with authority [fe80]: WithInitData and WithUInt128 have the same algorithm, but WithUInt128 has less overhead, as no check for the correct type of init-data (see comment above). Shifting UInt128.One << c is more expensive than splitting into nibbles and doing simple bit-operations. Thus WithVector is way faster (also the loop-size is just about a third of the ones which shifts the UInt128).

Explanation second set with authority hostname:8080: WithInitData and WithVector have the same algorithm, but WithVector has less overhead, as no check for the correct type of init-data (see comment above). WithUInt128 is really slow, as the bit-mask needed for the vectorized approach needs to be constructed out of the lookup |= (UInt128.One << c) data.

So concluding I'd go with following approach:

public static class MemoryExtensions
{
    public static Vector128<byte> IndexOfAnyInitialize(ReadOnlySpan<char> values);

    public static int IndexOfAny(this ReadOnlySpan<char> span, Vector128<byte> initData);
    public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, Vector128<byte> initData);
}

Questions to this:

Should the the scheme for populating that Vector128<byte> be documented officially? This would become part of the API-contract, so any change is prohibited. Although when used from a generator is also like setting that schema into stone, as updates to the scheme won't be reflected in binaries that are created from generatored code. TBH I can't imagine that this scheme will change, as it's simple and performant (which is what I like).

Should the type for init-data be Vector128<byte> or UInt128? Perf-wise it doesn't matter, they can be reinterpreted to each other as they have the same bit-size. Stick to Vector128<byte> to put some emphasis that the code relies on vectorization? On the other hand there's no (?) API that exposes vector-types (expect the intrinsics ifself, of course), so it may be more natural to have UInt128.

There are certainly benefits to keeping the data opaque such that the IndexOfAnyAscii and its associated initialization method can be evolved.

In case we see some common patterns of sets that could use a better algorithms, then another overload with an opaque init-data could be added.

With this approach all our points should be satisfied? (Mine are :wink:)

stephentoub commented 1 year ago

Thanks!

Quick benchmark

Can you share the benchmark code?

Should the the scheme for populating that Vector128 be documented officially?

I think so. Whether or not we do, any changes to it are breaking, as the value can and will be persisted in binaries. This is why I was hoping the input could be a simple bitmap that the API would then transform as it needs. I'm surprised that transformation takes 250ns, that there's no way to reinterpret a UInt128 into a Vector128<byte> and then do manipulations on that vector to transform it from the simple bitmap into the required shape much faster than that.

Should the type for init-data be Vector128<byte> or UInt128?

I'd prefer UInt128. The code doesn't require vectorization (it'll function fine if Vector128.IsHardwareAccelerated is false and we take a fallback path), and Vector128<T> looks scary.

gfoidl commented 1 year ago

For code see https://github.com/gfoidl-Tests/SourceGenerator-Vectors/tree/runtime-inbox-solution

For bitmap to bitmask for vectors there may be other approaches too. The used algorithm is just simple enough, SIMD-friendly and performant. I can eloberate more tomorrow (being on mobile now, which is a bit unhandy for this).

stephentoub commented 1 year ago

Thanks for sharing.

I'm surprised that transformation takes 250ns, that there's no way to reinterpret a UInt128 into a Vector128 and then do manipulations on that vector to transform it from the simple bitmap into the required shape much faster than that.

For reference, I was thinking of solutions like https://stackoverflow.com/questions/54408726/whats-the-fastest-way-to-perform-an-arbitrary-128-256-512-bit-permutation-using, but without requiring AVX2. I tried an approach involving a lookup table, where for every byte of the UInt128 we had a lookup table of 256 entries each of which had the 16 bytes for the Vector128 it'd map to, and then doing the transform would or together the 16 vectors from the appropriate location in the lookup table. Beyond requiring the untenable 65K for storing that table, it also took ~9ns on my machine, which while way better than the 250ns in your example, is still too slow. Unless @tannergooding knows of some magic to do this in just a handful of instructions, we're probably still left with the two-method solution.

So concluding I'd go with following approach

Seems like it. Something like:

public static class MemoryExtensions
{
+    public static Vector128<byte> CreateIndexOfAnyAsciiVector(ReadOnlySpan<char> asciiValues);

+    public static int IndexOfAny(this ReadOnlySpan<byte> span, Vector128<byte> asciiVector);
+    public static int IndexOfAnyExcept(this ReadOnlySpan<byte> span, Vector128<byte> asciiVector);

+    public static int IndexOfAny(this ReadOnlySpan<char> span, Vector128<byte> asciiVector);
+    public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, Vector128<byte> asciiVector);
}

gfoidl commented 1 year ago

we're probably still left with the two-method solution.

For the generator use-case: the creation of the mask is just a couple of lines, and if this is tied to the API (cf. notes about documentation above) then at runtime it's just the call to the worker-method.

I thought you prefer UInt128? So my API proposal looks like

public static class MemoryExtensions
{
+    [CLSCompliant(false)]
+    public static UInt128 CreateIndexOfAnyAsciiBitData(ReadOnlySpan<char> asciiValues);

+    [CLSCompliant(false)]
+    public static int IndexOfAny(this ReadOnlySpan<byte> span, UInt128 asciiBitData);
+    [CLSCompliant(false)]
+    public static int IndexOfAnyExcept(this ReadOnlySpan<byte> span, UInt128 asciiBitData);

+    [CLSCompliant(false)]
+    public static int IndexOfAny(this ReadOnlySpan<char> span, UInt128 asciiBitData);
+    [CLSCompliant(false)]
+    public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, UInt128 asciiBitData);
}

dropped Vector from the name, use BitData instead
asciiValues seems to generic too me, there should be a mental link to the pre-computed bit data, so changed it to asciiBitData

Though I like UInt128 more to be in the signature, one thing to consider is the case when the bit data is computed at compile time (e.g. generator, high-perf orientated user) and can be baked in as constant. It's easier to specifiy a constant vector (inline create definition) than a "constant" UInt128 (static readonly field (not an option for DynamicMethod) or ad-hoc creation).

With the vector as argument, the Regex-generator could emit code like

int index = span.IndexOfAny(Vector128.Create(0x42, 0x...));

runtime-wise this has best perf too, as the vector constant is read from data-section. In contrast with UInt128:

int index = spa.IndexOfAny(new UInt128(0xC0, 0x01));

needs to be constructed on the stack, then reinterpreted as vector at runtime (if vectorized path is taken).

Side note for generator use if vector is used as argument: you can borrow from here (though it's really simple with vector).

gfoidl commented 1 year ago

For https://github.com/dotnet/runtime/issues/76106 and from https://github.com/dotnet/runtime/issues/68328#issuecomment-1251123533 (at the end)

The sets to search are given by a (constant) string, but I could images to extend that with a regex-like character-group/range. I.e. to write the example for validation the authority could be given as [A-Za-z0-9:.-[]@]

This was something I had in mind what the init-method could determine itself (coherent range), but that would need a (abstract) class as return type.

Having the explicit overloads as you proposed seems nicer and easier to use.

MihaZupan commented 1 year ago

@gfoidl when you were benchmarking different approaches, did you see what the perf difference is for the scalar path when using "any precomupted data" vs the "Vector128 asciiVector"?

I.e. is there a noticeable difference for the length < Vector.Count case if the only mask you have available is optimized for the vectorized path vs. if you have "IndexOfAnyInitData" that contains both the mask for vectorization and anything to speed up the scalar path (different mask / bool[128] / bloom filter etc.)?

For reference, the CharacterMap class in Markdig is effectively an IndexOfAny searcher and we store both the mask for vectorization and a bool[128] for the fallback.

gfoidl commented 1 year ago

For this I did just the benchmark from https://github.com/dotnet/runtime/issues/68328#issuecomment-1254038540 where Authority := [fe80].

The focus so far was mostly on exploring how the API may look like without introducing inherit perf-traps. Thus I didn't run benchmarks excessive.

Sure with specialized types for scalar and vectorized code-path one could fine-tune perf, but I think the pure "Vector128 asciiVector" is a good and fast compromise here. Would you prefer a split of the pathes? (as done in markdig and with the "IndexOfAnyInitData")

MihaZupan commented 1 year ago

I'm mainly just trying to avoid us locking ourselves into a single implementation that may not be as efficient as possible for all inputs in all cases.

For example, if the API was instead

public sealed class StringSearcher
{
    public int Search(ReadOnlySpan<char> input);
    public static StringSearcher IndexOfAny(ReadOnlySpan<char> asciiChars);
    public static StringSearcher IndexOfAnyExcept(ReadOnlySpan<char> asciiChars);
}

The usage for most users would be the same while leaving all implementation details hidden. The generated Regex code may be different (static readonly field instead of a "magic constant" at the call site), but I'm not sure that matters - you still have a shared helper.

I'll try to do some benchmarking this week too to see if there is any difference here.

With this sort of API, you may not even need #76106. The ranges could just be a detected special case in the implementation and there may not be any perf benefit from having explicit APIs.

stephentoub commented 1 year ago

The ranges could just be a detected special case in the implementation and there may not be any perf benefit from having explicit APIs.

There's the benefits called out of not having to do all the analysis at startup, of not having to maintain a separate caching field for the constructed StringSearcher, of keeping the code locally readable rather than having to trace the static field being used back to where it's defined to know what's being searched for, of efficiently supporting non-ASCII ranges, of working with types other than char, etc.

MihaZupan commented 1 year ago

Right, I'm not saying we shouldn't consider the range APIs explicitly - personally the "readability at the call site" is reason enough.

I'm just noting the sort of flexibility an alternative API shape gives you here - for most of the called out internal use-cases of range, there may not be a noticeable difference in searching throughput.

MihaZupan commented 1 year ago

The performance of the scalar path based on what sort of data you have precomputed:

X86:	Method	Length	Mean
VectorizedMask_Char	1	2.391 ns	0.0133 ns
VectorizedMask_Byte	1	2.051 ns	0.0065 ns
BitVector_Char	1	1.987 ns	0.0069 ns
BitVector_Byte	1	1.544 ns	0.0062 ns
VectorizedMask_Char	7	9.666 ns	0.0274 ns
VectorizedMask_Byte	7	9.417 ns	0.0314 ns
BitVector_Char	7	7.658 ns	0.0173 ns
BitVector_Byte	7	5.985 ns	0.0167 ns
VectorizedMask_Char	1000	1,227.1 ns	3.37 ns
VectorizedMask_Byte	1000	1,227.4 ns	3.39 ns
BitVector_Char	1000	952.6 ns	2.21 ns
BitVector_Byte	1000	719.9 ns	1.34 ns

ARM64:	Method	Length	Mean
VectorizedMask_Char	7	10.194 ns	0.0026 ns
VectorizedMask_Byte	7	9.844 ns	0.0003 ns
BitVector_Char	7	8.025 ns	0.0063 ns
BitVector_Byte	7	5.620 ns	0.0733 ns
VectorizedMask_Char	1000	1,562.273 ns	0.1715 ns
VectorizedMask_Byte	1000	1,508.891 ns	0.0858 ns
BitVector_Char	1000	1,258.821 ns	0.0466 ns
BitVector_Byte	1000	924.538 ns	0.0451 ns

Where BitVector is a uint[8] prepared ahead of time and VectorizedMask only has access to the mask optimized for the vectorized path.

The Length = 7 case is the worst-case scalar path before we switch to the vectorized path, and Length = 1000 is an estimate for how the API would perform on platforms without the vectorized path.

It's possible I missed some optimization that could be done in the VectorizedMask path - please let me know: Benchmark code.

If these numbers are correct, we may want to avoid an API that only uses Int128.

MihaZupan commented 1 year ago

Trying to move this along a bit, how about something like this? This lets us precompute other kinds of info to speed up the non-vectorized path as well - see https://github.com/dotnet/runtime/issues/68328#issuecomment-1263555067.

Example impl: dae17b40840ef294943ef6171d035802ba052d92

namespace System.Buffers.Text // Does this namespace make sense?
{
    public sealed class IndexOfAnyAsciiSearcher
    {
        public IndexOfAnyAsciiSearcher(ReadOnlySpan<char> asciiValues) { throw null; }

        public int IndexOfAny(ReadOnlySpan<byte> searchSpace) { throw null; }
        public int IndexOfAny(ReadOnlySpan<char> searchSpace) { throw null; }
        public int IndexOfAnyExcept(ReadOnlySpan<byte> searchSpace) { throw null; }
        public int IndexOfAnyExcept(ReadOnlySpan<char> searchSpace) { throw null; }
        public int LastIndexOfAny(ReadOnlySpan<byte> searchSpace) { throw null; }
        public int LastIndexOfAny(ReadOnlySpan<char> searchSpace) { throw null; }
        public int LastIndexOfAnyExcept(ReadOnlySpan<byte> searchSpace) { throw null; }
        public int LastIndexOfAnyExcept(ReadOnlySpan<char> searchSpace) { throw null; }

        // These are useful so you don't have to keep the 'asciiValues' input around for single-value queries
        // It's also much cheaper compared to 'asciiValues.Contains(c)`
        public bool Contains(byte b) { throw null; }
        public bool Contains(char c) { throw null; }
    }
}

Example usage:

internal static class HttpRuleParser
{
    private static readonly IndexOfAnyAsciiSearcher s_tokenChars =
        new("!#$%&'*+-.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ^_`abcdefghijklmnopqrstuvwxyz|~");

    public static bool IsToken(ReadOnlySpan<byte> input) =>
        s_tokenChars.IndexOfAnyExcept(input) < 0;

    public static bool IsTokenChar(char c) =>
        s_tokenChars.Contains(c);
}

gfoidl commented 1 year ago

With the IndexOfAnyAsciiSearcher as separate type we get the same downsides as discussed above for the regex-usecase.

IL-generated code can't use fields to cache the information
source generator can't pre-computes all the values by it's own and feed it into the worker-methods

Besides that I like this approach. Maybe we should extend that proposal with

namespace System.Buffers.Text // Does this namespace make sense?
{
    public sealed class IndexOfAnyAsciiSearcher
    {
        // Like proposed

+       public static int IndexOfAnyAscii(Vector128<byte> preComputedMask) { throw null; }
        // and the other variations as static methods
    }
}

so that generators, etc. can use the entry with the pre-computed mask and all other users can init the type and re-use that.

Downside is that the preComputedMask can't be updated in it's scheme easily. Also it's a bit "double the trouble" which I don't like.

MihaZupan commented 1 year ago

IL-generated code can't use fields to cache the information

Are you sure about this? AFAIK you can emit anything at runtime (even types & fields).

Even if you couldn't, can't you always cheat by storing the cached state with the regex compiler and reference it from generated code?

class RegexCompiler
{
    private static IndexOfAnyAsciiSearcher[] _searchers
}

Generated code:

// ...
var searcher = RegexCompiler._searchers[SomeConstant]; // Or some unsafe code to skip the length check

gfoidl commented 1 year ago

Sorry, I meant dynamic method -- as used by RegexOptions.Compiled. So there are two use cases for Regex: source generator and RegexOptions.Compiled.

MihaZupan commented 1 year ago

Right, the source generator should obviously work, but I was referring to the emitted code.

We're currently only emitting the methods, but I don't know that there's anything preventing us from emitting a dedicated class/fields as well.

With the workaround I was referring to the generated IL, AFAIK you can reference private fields from different classes like that.

jkotas commented 1 year ago

I don't know that there's anything preventing us from emitting a dedicated class/fields as well.

Dynamically emitted types are much more expensive than dynamically emitted methods.

MihaZupan commented 1 year ago

Here's my (untested) attempt at using this API in the Regex source generator and RegexCompiler: https://github.com/MihaZupan/runtime/commit/c9bb7368b14b378245b2831563bab8c3a1d648b3

With the generated IL essentially being (see changes in RegexCompiler):

int i = RegexCompiler.s_asciiSearchers[42].IndexOfAny(slice.Slice(offset));

Obviously, Stephen should chime in here, but this doesn't seem that bad to me to the point that we'd need additional APIs which have their own downsides.

stephentoub commented 1 year ago

AFAIK you can emit anything at runtime (even types & fields).

Not with DynamicMethod, which is what Regex uses.

int i = RegexCompiler.s_asciiSearchers[42].IndexOfAny(slice.Slice(offset));

This appears to suffer from a leak, as creating a new Regex with a not-yet-seen before combination of ASCII chars will expand the array. How is that addressed? There could in theory be up to 2^128 of them. DynamicMethods are collectible, so once a compiled Regex is no longer used / referenced, all the state associated with it today can go away, but this leaves some of it behind.

but this doesn't seem that bad to me to the point that we'd need additional APIs which have their own https://github.com/dotnet/runtime/issues/68328#issuecomment-1263555067.

I'm not following what that link is intending to show. That's not a comparison between an API that takes a searcher object and an API that takes an Int128; it's a comparison between APIs that take two different versions of a single Int128. If there was really a meaningful difference, the API could take two precomputed bitmaps, one for the scalar path and one for the vector path. Also, the benefit of a cited searcher object like this would be that its implementation could support varying representations, in which case there's likely a virtual dispatch cost that needs to be factored into a comparison.

Trying to move this along a bit, how about something like this?

I continue to have the same concerns about this that I previously shared: 1) works well with RegexOptions.Compiled and 2) the source generator is able to compute everything ahead of time. However, if we end up deciding to accept the cited downsides, we should still aim for whatever object that's created to simply be a wrapper for the computed data, and you'd pass that object to {Last}IndexOfAny{Except} overloads on MemoryExtensions/string, rather than having yet another place someone needs to look for these methods / another type that exposes such methods / etc. If it's useful as an implementation detail, that object can have internal methods invokable by the public {Last}IndexOfAny{Except} overloads on MemoryExtensions/string.

MihaZupan commented 1 year ago

That's not a comparison between an API that takes a searcher object and an API that takes an Int128; it's a comparison between APIs that take two different versions of a single Int128.

It's trying to showcase that an implementation that only accepted an Int128 optimized for the vector path will have suboptimal performance for short inputs (before the vectorized path can kick in).

I can update the benchmark to essentially have something like this at the start

if (length < 8) return VectorizedPath(input, _vectorizedBitmap);

for (...) // Existing loop

[NoInlining]
private void VectorizedPath(...) { }

Also, the benefit of a cited searcher object like this would be that its implementation could support varying representations, in which case there's likely a virtual dispatch cost that needs to be factored into a comparison.

What I had in mind was an object like this.

private readonly Vector128<byte> _bitmap;
private readonly bool _needleContainsZero;
private readonly BitVector256 _lookup; // A 32-byte struct

public int IndexOfAny(ReadOnlySpan<byte> searchSpace)
{
    if (searchSpace.Length >= Vector128<short>.Count)
        return IndexOfAnyVectorized(searchSpace, _bitmap);

    for (int i = 0; i < searchSpace.Length; i++)
        if (_lookup.Contains(searchSpace[i]))
            return i;

    return -1;
}

This appears to suffer from a leak, as creating a new Regex with a not-yet-seen before combination of ASCII chars will expand the array. How is that addressed?

I didn't realize compiled instances can get recycled and that the runtime is able to cleanup everything. Would storing such data on the CompiledRegexRunner instance be an option?

I see we already store _culture this way.

2) the source generator is able to compute everything ahead of time

Are you referring to the runtime cost of computing the bitmaps from the string literal (the IndexOfAnyAsciiSearcher ctor)?

we should still aim for whatever object that's created to simply be a wrapper for the computed data, and you'd pass that object to {Last}IndexOfAny{Except} overloads on MemoryExtensions/string, rather than having yet another place someone needs to look for these methods / another type that exposes such methods / etc. If it's useful as an implementation detail, that object can have internal methods invokable by the public {Last}IndexOfAny{Except} overloads on MemoryExtensions/string.

I'd be perfectly fine with that. I'm mostly pushing for giving ourselves the option to precompute whatever state we deem beneficial without locking ourselves in too much by the API choice.

One thing I like more this way is that the parameter order stays the same at the call site. That is, haystack.IndexOfAny(preparedNeedle) instead of preparedNeedle.IndexOfAny(haystack).

How about this? (doesn't really change anything for the Regex case ofc)

namespace System
{
    public static partial class MemoryExtensions
    {
        public static int IndexOfAny(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues values) { throw null; }
        public static int IndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues values) { throw null; }
        public static int IndexOfAnyExcept(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues values) { throw null; }
        public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues values) { throw null; }
        public static int LastIndexOfAny(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues values) { throw null; }
        public static int LastIndexOfAny(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues values) { throw null; }
        public static int LastIndexOfAnyExcept(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues values) { throw null; }
        public static int LastIndexOfAnyExcept(this ReadOnlySpan<char> span, IndexOfAnyAsciiValues values) { throw null; }
        // And repeat the overloads for Span
    }

    public sealed class IndexOfAnyAsciiValues
    {
        public IndexOfAnyAsciiValues(ReadOnlySpan<char> asciiValues) { throw null; }
    }
}

If there was really a meaningful difference, the API could take two precomputed bitmaps, one for the scalar path and one for the vector path.

From my experiments, it does make a meaningful difference.

For the byte inputs, it's even better if you can have a 32-byte lookup instead to save the c < 128 check. This is worth it for the 16 extra bytes if you have a dedicated class for this, but likely wouldn't be if you now have to pass more state on the stack.

stephentoub commented 1 year ago

Would storing such data on the CompiledRegexRunner instance be an option?

We might be able to do that. I'd be interested in understanding how much overhead would be added to short inputs / early match case for having to load the array from the field and index into it just to get the searcher object that is then usable for performing the search.

Are you referring to the runtime cost of computing the bitmaps from the string literal (the IndexOfAnyAsciiSearcher ctor)?

That (including any JIT'ing costs in support of the infrastructure to perform that computation, allocation costs, etc), and any potential benefits that could come from the bitmaps actually being a compile-time constant, either with the JIT at run-time or with Native AOT. Presumably the overheads here are small, such that unless you had a huge number of regexes, it would be unlikely to significantly impact startup time, but it'd be good to get numbers around that.

How about this?

Right, that's essentially what @gfoidl outlined in https://github.com/dotnet/runtime/issues/68328#issuecomment-1251375718. His approach just had the type as abstract instead of sealed, with a factory method instead of a constructor, so that the factory could create different concrete instantations based on the input provided, e.g. if you gave it inputs that were just a range, using IndexOfAnyInRange as the implementation. It could possibly also enable https://github.com/dotnet/runtime/issues/62447 just by adding a new overload of that factory method.

MihaZupan commented 1 year ago

Based on https://github.com/dotnet/runtime/issues/68328#issuecomment-1263555067, you'd be looking at a ~3 ns difference per IndexOfAny in the worst-case (length=7 inputs) if you only had the Int128. If you then cut into that by an extra if check and an array lookup ... I guess it doesn't make sense to complicate the API further.

I retract my objections over IndexOfAny(Span<char> span, Int128 someConstant). Thank you for keeping up with my spam :)

So we're back at

public static partial class MemoryExtensions
{
    public static Vector128<byte> CreateIndexOfAnyAsciiBitmap(ReadOnlySpan<char> asciiValues) { throw null; }

    public static int IndexOfAny(this ReadOnlySpan<byte> span, Vector128<byte> asciiBitmap) { throw null; }
    public static int IndexOfAnyExcept(this ReadOnlySpan<byte> span, Vector128<byte> asciiBitmap) { throw null; }
    public static int LastIndexOfAny(this ReadOnlySpan<byte> span, Vector128<byte> asciiBitmap) { throw null; }
    public static int LastIndexOfAnyExcept(this ReadOnlySpan<byte> span, Vector128<byte> asciiBitmap) { throw null; }

    public static int IndexOfAny(this ReadOnlySpan<char> span, Vector128<byte> asciiBitmap) { throw null; }
    public static int IndexOfAnyExcept(this ReadOnlySpan<char> span, Vector128<byte> asciiBitmap) { throw null; }
    public static int LastIndexOfAny(this ReadOnlySpan<char> span, Vector128<byte> asciiBitmap) { throw null; }
    public static int LastIndexOfAnyExcept(this ReadOnlySpan<char> span, Vector128<byte> asciiBitmap) { throw null; }
    // And repeat the overloads for Span
}

With open questions of

Vector128<byte> vs Int128
bitmap vs bitmask vs vector vs bitData naming?

gfoidl commented 1 year ago

Vector128<byte> vs Int128

Perf-wise Vector128<byte> as outlined in https://github.com/dotnet/runtime/issues/68328#issuecomment-1254668435 (below the horizontal rule) -- in short: it's easier to specify vector-constants (pre-computed, etc.) than Int128.

Further open question is how to handle non-ASCII inputs.

For CreateIndexOfAnyAsciiBitmap (or hower it will be named) we stated it should throw for non-ASCII. Which exception-type?
What happens for the {Last}IndexOfAny{Except} method if for span non-ASCII input is given? Throw or return some index, and for the latter case what should be "some index" that has consistent behavior between OfAny and OfAnyExcept?

stephentoub commented 1 year ago

What happens for the {Last}IndexOfAny{Except} method if for span non-ASCII input is given?

This must be supported. What's the concern?

MihaZupan commented 1 year ago

For CreateIndexOfAnyAsciiBitmap (or hower it will be named) we stated it should throw for non-ASCII. Which exception-type?

ArgumentOutOfRange?

What happens for the {Last}IndexOfAny{Except} method if for span non-ASCII input is given? Throw or return some index, and for the latter case what should be "some index" that has consistent behavior between OfAny and OfAnyExcept?

It should behave the same way IndexOfAny("lots of ascii chars") behaves right now.

Consider the reference implementations

static int IndexOfAny(ReadOnlySpan<char> span, ReadOnlySpan<char> values)
{
    for (int i = 0; i < span.Length; i++)
        if (values.Contains(span[i])) return i;

    return -1;
}
static int LastIndexOfAny(ReadOnlySpan<char> span, ReadOnlySpan<char> values)
{
    for (int i = span.Length - 1; i >= 0; i--)
        if (values.Contains(span[i])) return i;

    return -1;
}
static int IndexOfAnyExcept(ReadOnlySpan<char> span, ReadOnlySpan<char> values)
{
    for (int i = 0; i < span.Length; i++)
        if (!values.Contains(span[i])) return i;

    return -1;
}
static int LastIndexOfAnyExcept(ReadOnlySpan<char> span, ReadOnlySpan<char> values)
{
    for (int i = span.Length - 1; i >= 0; i--)
        if (!values.Contains(span[i])) return i;

    return -1;
}

gfoidl commented 1 year ago

What's the concern?

Now deceased. Before I forgot that there's a reference implementation and this should / must have the same behavior. Thanks for pointing out!

MihaZupan commented 1 year ago

Vector128 vs Int128 bitmap vs bitmask vs vector vs bitData naming

Should these questions be handled via API review, or do we have more ideas here?

gfoidl commented 1 year ago

Vector128<byte>
no idea, whatever seems the best name I'm happy with

stephentoub commented 1 year ago

I retract my objections over IndexOfAny(Span span, Int128 someConstant). Thank you for keeping up with my spam :)

I was actually coming around to the data object approach you were both pushing for :)

Essentially we have this:

public static partial class MemoryExtensions
{
    public static TValues CreateIndexOfAnyAsciiValues(ReadOnlySpan<char> asciiValues);

    public static int IndexOfAny(this ReadOnlySpan<byte> span, TValues asciiValues);
    // ... matrix of ReadOnlySpan vs Span, IndexOfAny vs LastIndexOfAny, and Any vs AnyExcept
}

and we're debating the type of TValues.

I do still really like the simplicity of taking a bitmap, either an Int128 or a Vector128<byte>, that can be precomputed and stored in the binary. However, there are a growing number of benefits to having some opaque data type instead:

We can compute any number of bitmaps to accelerate different portions of the operation (scalar vs vectorized, potentially different maps if Vector256<byte> could be supported, etc)
We can adapt the algorithm over time. Exposing the actual bitmap means we're either locked to an algorithm optimized to that bitmap representation or every invocation needs to spend at least some amount of cycles converting it to something else. And it seems likely that better algorithms will emerge, because folks think of new ways to do it and/or because new intrinsics become available that enable better approaches.
We can special-case certain subsets of inputs if we come up with better ways to handle them, e.g. if all the inputs are in two consecutive ranges, if all the inputs are just the ASCII letters and we could do an xor and then a single-range comparison, etc.

On top of that, in theory it should be possible for Native AOT to still precompute everything and bake it into the binary; it already has some such capabilities, and if it doesn't already extend to whatever type we might employ here, it could extended to in the future.

Basically, in exchange for the simplicity of Int128/Vector128<byte> and slightly better startup today, we're giving up some amount of throughput on every use, potentially more throughput in the future if better algorithms are devised, and we're effectively hardcoding the algorithm being used into the public API.

Given all this, if we can demonstrate that: a) It is feasible for RegexCompiler to use this without meaningful sacrifice, and b) There's negligible overhead for first use at run time that needs to compute the object, then I think we should go with the approach you were both espousing:

public static partial class MemoryExtensions
{
    public static IndexOfAnyAsciiValues CreateIndexOfAnyAsciiValues(ReadOnlySpan<char> asciiValues);

    public static int IndexOfAny(this ReadOnlySpan<byte> span, IndexOfAnyAsciiValues asciiValues);
    // ... matrix of ReadOnlySpan vs Span, IndexOfAny vs LastIndexOfAny, and Any vs AnyExcept

    public struct IndexOfAnyAsciiValues { } // no public members
}

We can debate whether the type is a struct (which could wrap an object) or whether it's a sealed or abstract class, whether that type is nested, and naming, but those are all also things that can be discussed in API review.

gfoidl commented 1 year ago

I need a bit more though, but as brain-storming idea based on @stephentoub last comment:

public struct IndexOfAnyAsciiValues
{
    internal ushort AlgorithmVersion { get; set; }
    internal Vector128<byte> Bitmap128 { get; set; }
}

then the (Edit: I mixed up the both) ~RegexCompiler~ SourceGenerator could pre-computed the values itself, and the IndexOfAnyAsciiValues is just a wrapper-struct with Alg-Version set to 1.

In the future the implementation could decide based on the Alg-Version on how to proceed. Vec256 can be added to the struct, etc.

stephentoub commented 1 year ago

then the RegexCompiler could pre-computed the values itself

RegexCompiler needs to compute everything at run time. It's the source generator that could benefit from precomputing things at source gen / build time.

Trying to version the data in this fashion is clever but might also be limiting, e.g. Miha had demonstrated some possible throughput wins to computing additional variations on the bitmap data, built binaries wouldn't benefit from changes in the implementation on framework upgrade, etc.

MihaZupan commented 1 year ago

I was actually coming around to the data object approach you were both pushing for :)

Damn, I gave in too soon 😄

a) It is feasible for RegexCompiler to use this without meaningful sacrifice

I tried this out and it's quite straightforward. Also fewer LOC changes compared to the IndexOfAnyAsciiSearcher I was experimenting with before as you keep the same spanSlice.IndexOf(needleArgs) order.

b) There's negligible overhead for first use at run time that needs to compute the object

Based on numbers from #76740, the JIT overhead would likely dominate, the compute cost of the bitmaps themselves is on the order of a 10s of ns per instance.

MihaZupan commented 1 year ago

I guess another question is whether the Create helper should just be a ctor?

internal static readonly IndexOfAnyAsciiValues Ascii_FF037E0000007E000000 =
    MemoryExtensions.CreateIndexOfAnyAsciiValues("abcdefABCDEF0123456789");

internal static readonly IndexOfAnyAsciiValues Ascii_FF037E0000007E000000 =
    new("abcdefABCDEF0123456789");

stephentoub commented 1 year ago

I guess another question is whether the Create helper should just be a ctor?

If the type remains a struct, I don't have a strong opinion. If we decide we want it to be abstract so that internally we can have different implementations, then it obviously would need to remain a factory.

dotnet / runtime