Light up IndexOfAnyAsciiSearcher for AVX512

stephentoub commented 9 months ago

The implementation that backs many IndexOfAny searches with SearchValues today only leverages Vector128/256: https://github.com/dotnet/runtime/blob/a5461cbc8ed20e0981fd4e846a180f35b07dcc0a/src/libraries/System.Private.CoreLib/src/System/SearchValues/IndexOfAnyAsciiSearcher.cs We should teach it about Vector512.

(@MihaZupan has a prototype.)

ghost commented 9 months ago

Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.

Issue Details

The implementation that backs many IndexOfAny searches with `SearchValues` today only leverages Vector128/256: https://github.com/dotnet/runtime/blob/a5461cbc8ed20e0981fd4e846a180f35b07dcc0a/src/libraries/System.Private.CoreLib/src/System/SearchValues/IndexOfAnyAsciiSearcher.cs We should teach it about Vector512.

Author:	stephentoub
Assignees:	-
Labels:	`area-System.Runtime`
Milestone:	9.0.0

MihaZupan commented 9 months ago

For reference, the implementation could look something like this: https://github.com/MihaZupan/runtime/compare/searchvalues-state...MihaZupan:runtime:searchvalues-avx512 (might need some cleanup after #91937).

With that implementation, I was seeing higher throughput for longer values as expected (about ~1.65x), but noticeably more overhead for early matches (regressed more regex cases than it improved).

Ruihan-Yin commented 6 months ago

Hi @stephentoub @MihaZupan This is @Ruihan-Yin, from the .net team at Intel, would like to follow up on this issue.

From previous conversation,

For reference, the implementation could look something like this: MihaZupan/runtime@searchvalues-state...MihaZupan:runtime:searchvalues-avx512 (might need some cleanup after #91937).

With that implementation, I was seeing higher throughput for longer values as expected (about ~1.65x), but noticeably more overhead for early matches (regressed more regex cases than it improved).

As there is already a prototype, I would like to know if there is any plan to submit a PR based on the existing works. Or, if no, will it be okay for us to take the prototype and continue the work to form a PR?

MihaZupan commented 6 months ago

Hey @Ruihan-Yin, I've pushed an updated version of my change that compiles again to https://github.com/MihaZupan/runtime/compare/main...MihaZupan:runtime:searchvalues-ascii-avx512.

Benchmark results

``` BenchmarkDotNet v0.13.11, Ubuntu 22.04.3 LTS (Jammy Jellyfish) Intel Xeon Platinum 8370C CPU 2.80GHz, 1 CPU, 8 logical and 4 physical cores .NET SDK 8.0.100-preview.5.23303.2 [Host] : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI Job-SVIKBC : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI Job-FCGAUB : .NET 9.0.0 (42.42.42.42424), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI IterationCount=100 LaunchCount=3 WarmupCount=15 ``` | Method | Toolchain | Length | MatchAtStart | Mean | Error | Ratio | |----------------- |---------- |------- |------------- |----------:|----------:|------:| | IndexOfAny | main | 33 | False | 4.351 ns | 0.0173 ns | 1.00 | | IndexOfAny | pr | 33 | False | 3.695 ns | 0.0002 ns | 0.85 | | | | | | | | | | IndexOfAnyExcept | main | 33 | False | 4.033 ns | 0.0303 ns | 1.00 | | IndexOfAnyExcept | pr | 33 | False | 3.928 ns | 0.0113 ns | 0.98 | | | | | | | | | | IndexOfAny | main | 33 | True | 4.313 ns | 0.2266 ns | 1.00 | | IndexOfAny | pr | 33 | True | 6.948 ns | 0.8691 ns | 1.77 | | | | | | | | | | IndexOfAnyExcept | main | 33 | True | 2.774 ns | 0.0114 ns | 1.00 | | IndexOfAnyExcept | pr | 33 | True | 4.419 ns | 0.0026 ns | 1.59 | | | | | | | | | | IndexOfAny | main | 64 | False | 4.326 ns | 0.0072 ns | 1.00 | | IndexOfAny | pr | 64 | False | 4.385 ns | 0.2552 ns | 1.01 | | | | | | | | | | IndexOfAnyExcept | main | 64 | False | 3.830 ns | 0.0040 ns | 1.00 | | IndexOfAnyExcept | pr | 64 | False | 3.894 ns | 0.0023 ns | 1.02 | | | | | | | | | | IndexOfAny | main | 64 | True | 3.611 ns | 0.0019 ns | 1.00 | | IndexOfAny | pr | 64 | True | 4.370 ns | 0.0140 ns | 1.21 | | | | | | | | | | IndexOfAnyExcept | main | 64 | True | 2.741 ns | 0.0008 ns | 1.00 | | IndexOfAnyExcept | pr | 64 | True | 4.412 ns | 0.0017 ns | 1.61 | | | | | | | | | | IndexOfAny | main | 128 | False | 7.623 ns | 0.2039 ns | 1.00 | | IndexOfAny | pr | 128 | False | 7.769 ns | 0.7353 ns | 0.99 | | | | | | | | | | IndexOfAnyExcept | main | 128 | False | 6.232 ns | 0.0022 ns | 1.00 | | IndexOfAnyExcept | pr | 128 | False | 5.690 ns | 0.1762 ns | 0.90 | | | | | | | | | | IndexOfAny | main | 128 | True | 3.729 ns | 0.0382 ns | 1.00 | | IndexOfAny | pr | 128 | True | 6.264 ns | 0.8090 ns | 1.63 | | | | | | | | | | IndexOfAnyExcept | main | 128 | True | 2.736 ns | 0.0009 ns | 1.00 | | IndexOfAnyExcept | pr | 128 | True | 4.086 ns | 0.1197 ns | 1.48 | | | | | | | | | | IndexOfAny | main | 1000 | False | 37.310 ns | 0.0296 ns | 1.00 | | IndexOfAny | pr | 1000 | False | 23.893 ns | 0.0971 ns | 0.64 | | | | | | | | | | IndexOfAnyExcept | main | 1000 | False | 40.196 ns | 0.0707 ns | 1.00 | | IndexOfAnyExcept | pr | 1000 | False | 27.997 ns | 0.0461 ns | 0.70 | | | | | | | | | | IndexOfAny | main | 1000 | True | 3.871 ns | 0.0502 ns | 1.00 | | IndexOfAny | pr | 1000 | True | 3.980 ns | 0.0115 ns | 1.03 | | | | | | | | | | IndexOfAnyExcept | main | 1000 | True | 2.740 ns | 0.0011 ns | 1.00 | | IndexOfAnyExcept | pr | 1000 | True | 4.197 ns | 0.1730 ns | 1.47 |

I avoided opening a PR as the regressions for early matches are quite noticeable. If you have any ideas on how to improve on this, please feel free to continue the work.

Ruihan-Yin commented 6 months ago

Hi @MihaZupan , thanks for the update! Just a few more questions to understand the data better:

In the presented numbers, what is the MatchAtStart/Early Match referring to?
Do we know any possible reason for the regression, like any red flag shown in profiler?

MihaZupan commented 6 months ago

The benchmark numbers above are for this:

public class IndexOfAnyAsciiBenchmark
{
    private static readonly SearchValues<char> s_chars = SearchValues.Create("abcABC123");
    private string _charBuffer;
    private string _charBufferExcept;

    [Params(8, 16, 32, 64, 128, 1000, 10_000)]
    public int Length;

    [Params(true, false)]
    public bool MatchAtStart;

    [GlobalSetup]
    public void Setup()
    {
        var charBuffer = new char[Length];
        var charBufferExcept = new char[Length];
        charBuffer.AsSpan().Fill('\n');
        charBufferExcept.AsSpan().Fill('b');

        if (MatchAtStart)
        {
            charBuffer[0] = 'b';
            charBufferExcept[0] = '\n';
        }

        _charBuffer = new string(charBuffer);
        _charBufferExcept = new string(charBufferExcept);
    }

    [Benchmark]
    public int IndexOfAny() => _charBuffer.AsSpan().IndexOfAny(s_chars);

    [Benchmark]
    public int IndexOfAnyExcept() => _charBufferExcept.AsSpan().IndexOfAnyExcept(s_chars);
}

Do we know any possible reason for the regression, like any red flag shown in profiler?

I haven't looked at this under a profiler, but I can get the codegen for you if that'd help. The set of operations performed in the early match case is essentially the same, just with a wider vector.

Ruihan-Yin commented 6 months ago

Thanks for sharing!

MihaZupan commented 6 months ago

Example diffs for the Length = 64, MatchAtStart = true case: https://www.diffchecker.com/fSyCBb1H/ A bit nicer codegen if I force the helpers to be inlined even in the cold block: https://www.diffchecker.com/tu3uIa4f/

Method	Toolchain	Length	MatchAtStart	Mean	Error	Ratio
IndexOfAnyExcept	main	64	True	2.741 ns	0.0008 ns	1.00
IndexOfAnyExcept	pr	64	True	4.412 ns	0.0017 ns	1.61

Part of the issue here is how the search logic is structured.

Essentially it looks like this in main

if (length < 8)
{
    // Scalar loop
    return NotFound;
}

if (length > 16)
{
    if (length > 32)
    {
        // Process the input in chunks of 32 characters (Vector256)
    }

    // Process the last [1-32] characters by loading two overlapping vectors (Vector256)
    return NotFound;
}

// Process the [8-16] characters by loading two overlapping vectors (Vector128)
return NotFound;

and like this in pr with AVX512

if (length < 8)
{
    // Scalar loop
    return NotFound;
}

if (length > 16)
{
    if (length > 32)
    {
        if (length > 64)
        {
            // Process the input in chunks of 64 characters (Vector512)
        }

        // Process the last [1-64] characters by loading two overlapping vectors (Vector512)
        return NotFound;
    }

    // Process the last [16-32] characters by loading two overlapping vectors (Vector256)
    return NotFound;
}

// Process the [8-16] characters by loading two overlapping vectors (Vector128)
return NotFound;

For the Length = 64 case, we go from finding a match in the regular search loop to finding a match in the overlapped fallback. Loading the overlapped input & calculating the result from an overlapped match is slightly more expensive, which is enough to show up in this case where we're talking about ns differences.

MihaZupan commented 6 months ago

(sorry about the spam)

With a small tweak, the results are much closer (varies quite a bit between runs, but still).

Method	Toolchain	Length	Mean	Error	Ratio
IndexOfAny_Char	main	64	3.641 ns	0.0034 ns	1.00
IndexOfAny_Char	pr	64	4.381 ns	0.0682 ns	1.20

IndexOfAnyExcept_Char	main	64	2.861 ns	0.1486 ns	1.00
IndexOfAnyExcept_Char	pr	64	3.685 ns	0.0031 ns	1.29

IndexOfAny_Char	main	128	4.160 ns	0.3170 ns	1.00
IndexOfAny_Char	pr	128	4.053 ns	0.0011 ns	0.99

IndexOfAnyExcept_Char	main	128	2.747 ns	0.0021 ns	1.00
IndexOfAnyExcept_Char	pr	128	3.159 ns	0.0097 ns	1.15

I'll test how this impacts the Regex benchmarks I initially mentioned.

MihaZupan commented 3 months ago

Reran the numbers with https://github.com/dotnet/runtime/compare/main...MihaZupan:runtime:searchvalues-ascii-avx512-4 Looks like a ~0.5 - 1 ns regression for early matches, and a speedup to ~1.5x in throughput for longer inputs. Regex is seeing more regressions than improvements.

Basic IndexOfAny benchmarks

| Method | Toolchain | Length | MatchAtStart | Mean | Error | Ratio | |-------------------------- |---------- |------- |------------- |-----------:|----------:|------:| | IndexOfAny_Char | main | 33 | True | 3.469 ns | 0.0609 ns | 1.00 | | IndexOfAny_Char | pr | 33 | True | 4.335 ns | 0.0757 ns | 1.26 | | | | | | | | | | IndexOfAnyExcept_Char | main | 33 | True | 2.475 ns | 0.0010 ns | 1.00 | | IndexOfAnyExcept_Char | pr | 33 | True | 3.715 ns | 0.0022 ns | 1.50 | | | | | | | | | | | | | | | | | | | | | | | | | | IndexOfAny_Char | main | 65 | True | 3.401 ns | 0.0504 ns | 1.00 | | IndexOfAny_Char | pr | 65 | True | 4.096 ns | 0.1228 ns | 1.20 | | | | | | | | | | IndexOfAnyExcept_Char | main | 65 | True | 2.475 ns | 0.0016 ns | 1.00 | | IndexOfAnyExcept_Char | pr | 65 | True | 2.872 ns | 0.0020 ns | 1.16 | | | | | | | | | | | | | | | | | | IndexOfAny_Char | main | 10000 | False | 328.756 ns | 0.1378 ns | 1.00 | | IndexOfAny_Char | pr | 10000 | False | 210.471 ns | 0.1152 ns | 0.64 | | | | | | | | | | IndexOfAnyExcept_Char | main | 10000 | False | 374.419 ns | 0.3652 ns | 1.00 | | IndexOfAnyExcept_Char | pr | 10000 | False | 260.572 ns | 0.0642 ns | 0.70 | | | | | | | | | | LastIndexOfAny_Char | main | 10000 | False | 327.438 ns | 0.1851 ns | 1.00 | | LastIndexOfAny_Char | pr | 10000 | False | 210.978 ns | 0.0945 ns | 0.64 | | | | | | | | | | LastIndexOfAnyExcept_Char | main | 10000 | False | 384.517 ns | 0.1233 ns | 1.00 | | LastIndexOfAnyExcept_Char | pr | 10000 | False | 261.597 ns | 0.0735 ns | 0.68 |

Sherlock Regex benchmarks

https://github.com/dotnet/performance/blob/main/src/benchmarks/micro/libraries/System.Text.RegularExpressions/Perf.Regex.Industry.cs#L129 | Toolchain | Pattern | Mean | Error | Ratio | |---------- |------------------------------------- |-----------------:|---------------:|------:| | main | (?i)Sher[a-z]+\|Hol[a-z]+ | 103,522.91 ns | 1,600.349 ns | 1.00 | | pr | (?i)Sher[a-z]+\|Hol[a-z]+ | 104,615.26 ns | 2,585.538 ns | 1.01 | | | | | | | | main | (?i)Sherlock\|Holmes\|Watson | 94,673.68 ns | 4,766.057 ns | 1.01 | | pr | (?i)Sherlock\|Holmes\|Watson | 95,724.15 ns | 4,285.540 ns | 1.02 | | | | | | | | main | (?i)Sherlock\|(...)er\|John\|Baker [49] | 289,495.97 ns | 4,839.174 ns | 1.00 | | pr | (?i)Sherlock\|(...)er\|John\|Baker [49] | 284,341.73 ns | 6,128.478 ns | 0.98 | | | | | | | | main | (?i)the | 378,116.92 ns | 1,339.989 ns | 1.00 | | pr | (?i)the | 379,105.12 ns | 620.316 ns | 1.00 | | | | | | | | main | (?m)^Sherlock(...)rlock Holmes$ [37] | 82,288.25 ns | 3,928.339 ns | 1.01 | | pr | (?m)^Sherlock(...)rlock Holmes$ [37] | 82,778.06 ns | 3,832.797 ns | 1.01 | | | | | | | | main | (?s).* | 53.28 ns | 0.135 ns | 1.00 | | pr | (?s).* | 52.75 ns | 0.128 ns | 0.99 | | | | | | | | main | [^\\n]* | 735,174.34 ns | 22,699.109 ns | 1.00 | | pr | [^\\n]* | 739,572.06 ns | 28,992.773 ns | 1.01 | | | | | | | | main | [a-q][^u-z]{13}x | 33,777.02 ns | 801.044 ns | 1.00 | | pr | [a-q][^u-z]{13}x | 33,651.23 ns | 744.776 ns | 1.00 | | | | | | | | main | [a-zA-Z]+ing | 4,645,316.04 ns | 16,973.212 ns | 1.00 | | pr | [a-zA-Z]+ing | 4,841,529.48 ns | 8,956.256 ns | 1.04 | | | | | | | | main | \\b\\w+n\\b | 9,815,880.28 ns | 33,565.510 ns | 1.00 | | pr | \\b\\w+n\\b | 11,018,320.07 ns | 286,473.844 ns | 1.12 | | | | | | | | main | \\p{L} | 10,517,664.75 ns | 8,946.311 ns | 1.00 | | pr | \\p{L} | 11,916,466.49 ns | 409,511.232 ns | 1.13 | | | | | | | | main | \\p{Ll} | 10,673,143.82 ns | 331,458.549 ns | 1.00 | | pr | \\p{Ll} | 10,787,613.37 ns | 13,621.084 ns | 1.01 | | | | | | | | main | \\p{Lu} | 554,702.13 ns | 13,089.124 ns | 1.00 | | pr | \\p{Lu} | 436,503.07 ns | 1,401.226 ns | 0.79 | | | | | | | | main | \\s[a-zA-Z]{0,12}ing\\s | 4,969,805.46 ns | 14,676.201 ns | 1.00 | | pr | \\s[a-zA-Z]{0,12}ing\\s | 5,237,948.60 ns | 41,554.425 ns | 1.05 | | | | | | | | main | \\w+ | 5,147,282.40 ns | 11,110.191 ns | 1.00 | | pr | \\w+ | 5,464,592.92 ns | 14,554.391 ns | 1.06 | | | | | | | | main | \\w+\\s+Holmes | 3,986,310.25 ns | 17,035.155 ns | 1.00 | | pr | \\w+\\s+Holmes | 4,256,361.76 ns | 22,193.135 ns | 1.07 | | | | | | | | main | \\w+\\s+Holmes\\s+\\w+ | 4,161,642.67 ns | 10,531.738 ns | 1.00 | | pr | \\w+\\s+Holmes\\s+\\w+ | 4,425,879.41 ns | 6,260.327 ns | 1.06 | | | | | | | | main | Holmes.{0,25}(...).{0,25}Holmes [39] | 68,435.58 ns | 445.729 ns | 1.00 | | pr | Holmes.{0,25}(...).{0,25}Holmes [39] | 69,696.18 ns | 437.339 ns | 1.02 | | | | | | | | main | Sher[a-z]+\|Hol[a-z]+ | 65,242.66 ns | 814.983 ns | 1.00 | | pr | Sher[a-z]+\|Hol[a-z]+ | 65,292.98 ns | 691.412 ns | 1.00 | | | | | | | | main | Sherlock\\s+Holmes | 82,699.64 ns | 3,829.205 ns | 1.01 | | pr | Sherlock\\s+Holmes | 86,351.23 ns | 352.296 ns | 1.05 | | | | | | | | main | Sherlock\|Holmes\|Watson | 74,601.68 ns | 510.511 ns | 1.00 | | pr | Sherlock\|Holmes\|Watson | 74,047.63 ns | 292.165 ns | 0.99 | | | | | | | | main | Sherlock\|Holm(...)er\|John\|Baker [45] | 202,258.67 ns | 1,713.216 ns | 1.00 | | pr | Sherlock\|Holm(...)er\|John\|Baker [45] | 212,293.86 ns | 670.855 ns | 1.05 | | | | | | | | main | the\\s+\\w+ | 458,843.06 ns | 5,086.162 ns | 1.00 | | pr | the\\s+\\w+ | 448,289.28 ns | 1,171.229 ns | 0.98 |

JulieLeeMSFT commented 1 month ago

@MihaZupan, we are reaching preview 7 snap soon. Do you plan to finsih this work for .NET 9?

MihaZupan commented 2 weeks ago

Change is being reverted in #104688

dotnet / runtime

Light up IndexOfAnyAsciiSearcher for AVX512 #93222