Open stephentoub opened 9 months ago
Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.
Author: | stephentoub |
---|---|
Assignees: | - |
Labels: | `area-System.Runtime` |
Milestone: | 9.0.0 |
For reference, the implementation could look something like this: https://github.com/MihaZupan/runtime/compare/searchvalues-state...MihaZupan:runtime:searchvalues-avx512 (might need some cleanup after #91937).
With that implementation, I was seeing higher throughput for longer values as expected (about ~1.65x), but noticeably more overhead for early matches (regressed more regex cases than it improved).
Hi @stephentoub @MihaZupan This is @Ruihan-Yin, from the .net team at Intel, would like to follow up on this issue.
From previous conversation,
For reference, the implementation could look something like this: MihaZupan/runtime@searchvalues-state...MihaZupan:runtime:searchvalues-avx512 (might need some cleanup after #91937).
With that implementation, I was seeing higher throughput for longer values as expected (about ~1.65x), but noticeably more overhead for early matches (regressed more regex cases than it improved).
As there is already a prototype, I would like to know if there is any plan to submit a PR based on the existing works. Or, if no, will it be okay for us to take the prototype and continue the work to form a PR?
Hey @Ruihan-Yin, I've pushed an updated version of my change that compiles again to https://github.com/MihaZupan/runtime/compare/main...MihaZupan:runtime:searchvalues-ascii-avx512.
I avoided opening a PR as the regressions for early matches are quite noticeable. If you have any ideas on how to improve on this, please feel free to continue the work.
Hi @MihaZupan , thanks for the update! Just a few more questions to understand the data better:
MatchAtStart
/Early Match
referring to?The benchmark numbers above are for this:
public class IndexOfAnyAsciiBenchmark
{
private static readonly SearchValues<char> s_chars = SearchValues.Create("abcABC123");
private string _charBuffer;
private string _charBufferExcept;
[Params(8, 16, 32, 64, 128, 1000, 10_000)]
public int Length;
[Params(true, false)]
public bool MatchAtStart;
[GlobalSetup]
public void Setup()
{
var charBuffer = new char[Length];
var charBufferExcept = new char[Length];
charBuffer.AsSpan().Fill('\n');
charBufferExcept.AsSpan().Fill('b');
if (MatchAtStart)
{
charBuffer[0] = 'b';
charBufferExcept[0] = '\n';
}
_charBuffer = new string(charBuffer);
_charBufferExcept = new string(charBufferExcept);
}
[Benchmark]
public int IndexOfAny() => _charBuffer.AsSpan().IndexOfAny(s_chars);
[Benchmark]
public int IndexOfAnyExcept() => _charBufferExcept.AsSpan().IndexOfAnyExcept(s_chars);
}
Do we know any possible reason for the regression, like any red flag shown in profiler?
I haven't looked at this under a profiler, but I can get the codegen for you if that'd help. The set of operations performed in the early match case is essentially the same, just with a wider vector.
Thanks for sharing!
Example diffs for the Length = 64, MatchAtStart = true
case: https://www.diffchecker.com/fSyCBb1H/
A bit nicer codegen if I force the helpers to be inlined even in the cold block: https://www.diffchecker.com/tu3uIa4f/
Method | Toolchain | Length | MatchAtStart | Mean | Error | Ratio |
---|---|---|---|---|---|---|
IndexOfAnyExcept | main | 64 | True | 2.741 ns | 0.0008 ns | 1.00 |
IndexOfAnyExcept | pr | 64 | True | 4.412 ns | 0.0017 ns | 1.61 |
Part of the issue here is how the search logic is structured.
Essentially it looks like this in main
if (length < 8)
{
// Scalar loop
return NotFound;
}
if (length > 16)
{
if (length > 32)
{
// Process the input in chunks of 32 characters (Vector256)
}
// Process the last [1-32] characters by loading two overlapping vectors (Vector256)
return NotFound;
}
// Process the [8-16] characters by loading two overlapping vectors (Vector128)
return NotFound;
and like this in pr
with AVX512
if (length < 8)
{
// Scalar loop
return NotFound;
}
if (length > 16)
{
if (length > 32)
{
if (length > 64)
{
// Process the input in chunks of 64 characters (Vector512)
}
// Process the last [1-64] characters by loading two overlapping vectors (Vector512)
return NotFound;
}
// Process the last [16-32] characters by loading two overlapping vectors (Vector256)
return NotFound;
}
// Process the [8-16] characters by loading two overlapping vectors (Vector128)
return NotFound;
For the Length = 64
case, we go from finding a match in the regular search loop to finding a match in the overlapped fallback.
Loading the overlapped input & calculating the result from an overlapped match is slightly more expensive, which is enough to show up in this case where we're talking about ns
differences.
(sorry about the spam)
With a small tweak, the results are much closer (varies quite a bit between runs, but still).
Method | Toolchain | Length | Mean | Error | Ratio |
---|---|---|---|---|---|
IndexOfAny_Char | main | 64 | 3.641 ns | 0.0034 ns | 1.00 |
IndexOfAny_Char | pr | 64 | 4.381 ns | 0.0682 ns | 1.20 |
IndexOfAnyExcept_Char | main | 64 | 2.861 ns | 0.1486 ns | 1.00 |
IndexOfAnyExcept_Char | pr | 64 | 3.685 ns | 0.0031 ns | 1.29 |
IndexOfAny_Char | main | 128 | 4.160 ns | 0.3170 ns | 1.00 |
IndexOfAny_Char | pr | 128 | 4.053 ns | 0.0011 ns | 0.99 |
IndexOfAnyExcept_Char | main | 128 | 2.747 ns | 0.0021 ns | 1.00 |
IndexOfAnyExcept_Char | pr | 128 | 3.159 ns | 0.0097 ns | 1.15 |
I'll test how this impacts the Regex benchmarks I initially mentioned.
Reran the numbers with https://github.com/dotnet/runtime/compare/main...MihaZupan:runtime:searchvalues-ascii-avx512-4 Looks like a ~0.5 - 1 ns regression for early matches, and a speedup to ~1.5x in throughput for longer inputs. Regex is seeing more regressions than improvements.
@MihaZupan, we are reaching preview 7 snap soon. Do you plan to finsih this work for .NET 9?
Change is being reverted in #104688
The implementation that backs many IndexOfAny searches with
SearchValues
today only leverages Vector128/256: https://github.com/dotnet/runtime/blob/a5461cbc8ed20e0981fd4e846a180f35b07dcc0a/src/libraries/System.Private.CoreLib/src/System/SearchValues/IndexOfAnyAsciiSearcher.cs We should teach it about Vector512.(@MihaZupan has a prototype.)