tiering makes regex-redux significantly slower

danmoseley commented 1 year ago

I noticed in the public numbers that source generated regexes (6.cs) are significantly slower than ref-emit compiled regexes (5.cs). This is even though source generated mode doesn't have to emit and compile any code at runtime:

This doesn't show up in Benchmark.NET, only on the command line. With default settings, generated is 16% slower,

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen
  Time (mean ± σ):      1.473 s ±  0.050 s    [User: 1.159 s, System: 0.228 s]
  Range (min … max):    1.383 s …  1.521 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg
  Time (mean ± σ):      1.273 s ±  0.014 s    [User: 0.967 s, System: 0.237 s]
  Range (min … max):    1.262 s …  1.318 s    20 runs

disabling tiered compilation makes the difference go away, both are 25% faster, and generated is slightly faster than compiled, as expected:

C:\proj\rr>set DOTNET_TieredCompilation=0

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen
  Time (mean ± σ):      1.064 s ±  0.019 s    [User: 0.733 s, System: 0.237 s]
  Range (min … max):    1.045 s …  1.119 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg
  Time (mean ± σ):      1.079 s ±  0.007 s    [User: 0.758 s, System: 0.236 s]
  Range (min … max):    1.068 s …  1.098 s    20 runs

Standalone repro, clone https://github.com/danmoseley/repro1.git and run repro.bat. hyperfine is there as a convenient way to benchmark an exe, feel free to use something else.

Is there any way to improve this, or is this just a limitation that shows up with a short lived app like this?

== what follows is not relevant to this issue but just for comparison == BTW, for what it's worth here's native AOT numbers. I guess the regular configuration here is actually forced to the interpreter.

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe gen"
Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe gen
  Time (mean ± σ):      1.094 s ±  0.037 s    [User: 0.533 s, System: 0.335 s]
  Range (min … max):    1.030 s …  1.161 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe reg"
Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe reg
  Time (mean ± σ):      1.393 s ±  0.021 s    [User: 0.888 s, System: 0.204 s]
  Range (min … max):    1.363 s …  1.435 s    20 runs

and interpreter and nonbacktracking using nativeAOT. I don't know why the interpreter is slower than "compiled" if the latter is using the interpreter as well.

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe non"
Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe non
  Time (mean ± σ):      3.514 s ±  0.026 s    [User: 3.210 s, System: 0.207 s]
  Range (min … max):    3.476 s …  3.576 s    20 runs

C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe nbt"
Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe nbt
  Time (mean ± σ):      1.661 s ±  0.059 s    [User: 1.353 s, System: 0.214 s]
  Range (min … max):    1.606 s …  1.789 s    20 runs

ghost commented 1 year ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details

I noticed in the [public numbers](https://programming-language-benchmarks.vercel.app/csharp) that source generated regexes (6.cs) are significantly slower than ref-emit compiled regexes (5.cs). This doesn't show up in Benchmark.NET, only on the command line. with default settings, generated is 16% slower, even though it doesn't have to emit and compile any code at runtime: ``` C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen" Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen Time (mean ± σ): 1.473 s ± 0.050 s [User: 1.159 s, System: 0.228 s] Range (min … max): 1.383 s … 1.521 s 20 runs C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg" Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg Time (mean ± σ): 1.273 s ± 0.014 s [User: 0.967 s, System: 0.237 s] Range (min … max): 1.262 s … 1.318 s 20 runs ``` disabling tiered compilation makes the difference go away, both are 25% faster, and generated is slightly faster than compiled, as expected: ``` C:\proj\rr>set DOTNET_TieredCompilation=0 C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen" Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen Time (mean ± σ): 1.064 s ± 0.019 s [User: 0.733 s, System: 0.237 s] Range (min … max): 1.045 s … 1.119 s 20 runs C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg" Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg Time (mean ± σ): 1.079 s ± 0.007 s [User: 0.758 s, System: 0.236 s] Range (min … max): 1.068 s … 1.098 s 20 runs ``` Standalone repro, clone https://github.com/danmoseley/repro1.git and run repro.bat. [hyperfine](https://github.com/sharkdp/hyperfine) is there as a convenient way to benchmark an exe, feel free to use something else. Is there any way to improve this, or is this just a limitation that shows up with a short lived app like this? BTW, for what it's worth here's native AOT numbers. I guess the regular configuration here is actually forced to the interpreter. ``` C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe gen" Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe gen Time (mean ± σ): 1.094 s ± 0.037 s [User: 0.533 s, System: 0.335 s] Range (min … max): 1.030 s … 1.161 s 20 runs C:\proj\rr>\t\hyperfine\hyperfine.exe --warmup 1 --min-runs 20 "C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe reg" Benchmark 1: C:\proj\rr\bin\release\net8.0\win-x64\publish\rr.exe reg Time (mean ± σ): 1.393 s ± 0.021 s [User: 0.888 s, System: 0.204 s] Range (min … max): 1.363 s … 1.435 s 20 runs ```

Author:	danmoseley
Assignees:	-
Labels:	`area-CodeGen-coreclr`, `untriaged`, `needs-area-label`
Milestone:	-

danmoseley commented 1 year ago

@stephentoub any idea why RegexOptions.Compiled under native AOT is a lot faster than RegexOptions.None (but slower than non AOT)

stephentoub commented 1 year ago

any idea why RegexOptions.Compiled under native AOT is a lot faster than RegexOptions.None

In a variety of places we assume "Compiled" isn't literally "the only thing that's different is emitting MSIL" but rather "you're asking us to take more time to optimize throughput", and as such there are optimizations performed when Compiled is set that aren't related to emitting MSIL, like spending more time analyzing sets to determine the most optimal thing to search for as part of finding a starting position. I'd bet if you were to debug through you'd find the RegexFindOptimizations is different when you set Compiled vs None.

AndyAyersMS commented 1 year ago

Tiering does not do well on a (hopefully smallish) category of apps that are jit intensive, compute intensive, and short running. Crossgen2 is like this, for instance.

Sometimes setting DOTNET_TC_CallCountingDelayMs=0 can help.

cc @EgorBo

danmoseley commented 1 year ago

In the benchmark game, we don't really have an opportunity to set environment variables (such as DOTNET_TieredCompilation=0). Is there anything that can be done in the code to discourage tiering?

AndyAyersMS commented 1 year ago

I think we are better off trying to fix tiering to handle these cases better. If we disable / discourage tiering then we end up losing perf if the app is not short lived. And I would guess it may be harder to determine if an app is going to be long running than to determine if the app would benefit from more aggressive tiering up.

There are various ideas floating about:

bypass tiering for subset of methods that won't benefit from tiering (eg most property get/set)
priority queue (or similar) for rejitting so we tier up the most important methods sooner
avoid some unnecessary rejittings (.NET 8 only) https://github.com/dotnet/runtime/issues/84517
(possibly) adaptive call counting delay -- initially set it lower and back off if app seems to still be in "startup mode"
adjustments to OSR thresholds

stephentoub commented 1 year ago

I seem to remember at one point that recompilation for tiering was done with work items queued to the thread pool. Was that / is that still the case? If that is, that's going to be even more problematic for this particular test case, which might end up saturating the pool with work items, such that the tiering work items won't get a chance to execute promptly.

JulieLeeMSFT commented 1 year ago

CC @mangod9 PTAL.

danmoseley commented 1 year ago

is the penalty here that

the counting / instrumentation cost is not getting amortized by later improved codegen
similar but the cost of repeating code gen, ie., tiering is not the problem per se, if the jitting didn't happen til later it would be better because it isn't getting amortized
the ref emit code while never tiered is better optimized than the tier-0 regular code, and the code doesn't get rejitted early enough to compete or possibly
tiering is making bad choices about what to rejit, thus deoptimizing code
rejitted code itself is a deoptimization in some cases

anyway, hopefully this standalone repro with relatively few methods involved is a convenient vehicle to investigate.

mangod9 commented 1 year ago

Is this a regression? @kouvel, dont believe Tiering workitems are queued to the regular ThreadPool, are they?

stephentoub commented 1 year ago

dont believe Tiering workitems are queued to the regular ThreadPool, are they?

If they're not then presumably this comment should be revised: https://github.com/dotnet/runtime/blob/8b25fd382260e8eafbfd77f64b0fe28dc7301c2e/src/coreclr/vm/tieredcompilation.cpp#L32-L34

EgorBo commented 1 year ago

@stephentoub will it make sense to sort of use non-compiled mode for Regex during first N seconds of an app start so we can make sure all the more important things are well handled? Afair we already have a similiar logic on C# level for Expression trees? where we have a sort of C#-level call counting.

The problem that our promotion mechanism looks like this: a timer with 100ms interval is fired and checks - were any new Tier0 compilations requested since the last 100ms interval? if there were any - delay the promotion (actually, call counting stub installation) till the next 100ms tick. So a single tier0 compilation may delay Tier1 promotion for e.g. 100000 methods (seen something like that in Bing) - it is something we want to improve but just don't know how exactly, e.g. the priority-based queue like Andy mentioned.

stephentoub commented 1 year ago

will it make sense to sort of use non-compiled mode for Regex during first N seconds of an app start so we can make sure all the more important things are well handled? Afair we already have a similiar logic on C# level for Expression trees? where we have a sort of C#-level call counting.

RegexOptions.Compiled works fine and doesn't go through tiering because it uses dynamic methods which don't tier.

The problem Dan is highlighting is with source generated regexes, at which point you're suggesting not using code in the developer's app, which would be speculating that the code there was in fact written by the source generator and is in fact identical in semantics to the interpreter, neither of which we can guarantee (nor would I feel comfortable doing such a substitution even if we could).

EgorBo commented 1 year ago

Presumably, the source-generator based code should be prejittable, right? In that case I assume it's not an issue - if user doesn't use R2R we probably can guess they don't care about start up that much?

Although, the short living benchmarks without R2R can suffer (maybe even with it actually) - the conservative promotion algorithm applies here as well.

AndyAyersMS commented 1 year ago

This is one of the benchmarks from Benchmarks Games so we can't change how it is run.

@EgorBo can you dig in when you have a chance and see if the various hypotheses above are correct? Until then we shouldn't speculate too much on what fixes we might need.

danmoseley commented 1 year ago

for what it's worth I was looking at the fork of benchmark games that is a bit more active and takes regular github PR's. that doesn't change the problem though. https://github.com/hanabi1224/Programming-Language-Benchmarks/pull/396

EgorBo commented 1 year ago

@danmoseley you mentioned in the description "repro.bat" - was it supposed to be in the repro repository?

danmoseley commented 1 year ago

Oops, added. But all it was doing was running the apps.

EgorBo commented 1 year ago

@AndyAyersMS

Default: 1.258 ms Default, PGO=0: 1.276 ms Default, TC=0: 870.2 ms Default, TC_CallCountingDelayMs=0: 843.3 ms Default, TC_CallCountingDelayMs=0, PGO=0: 870.1 ms

(also checked OSR but it being disabled didn't improve anything)

So yes, it's the tier1 promotion problem, exactly the same as https://github.com/dotnet/runtime/issues/83112

kouvel commented 1 year ago

dont believe Tiering workitems are queued to the regular ThreadPool, are they?

Tiering uses a separate background thread.

There is another config var that was intended for those benchmark-like cases that wouldn't play well with tiering, TC_AggressiveTiering=1. That setting includes TC_CallCountingDelayMs=0 and also adjusts call count thresholds to be minimal to tier up more quickly. If it would be useful, it could be made configurable through .runtimeconfig.json and the project.

danmoseley commented 1 year ago

if we'll recommend and support that, exposing it in runtime.config would make that clear. it would also possibly be acceptable to the benchmarks owners. finally, it would avoid flowing to any spawned child apps.

EgorBo commented 1 year ago

There are basically 3 scenarios: 1) We need to tier up as quick as possible - short living, compute-heavy apps like crossgen, benchmarks, might be also relevant to some web services too, e.g. some Azure Function with a compute-heavy task. 2) We need to start as quick as possible so background Tier1 jit events might negatively impact on it - most UI apps, WPF, possibly some microservices as well where we need to serve the 1st request as soon as possible. 3) We should delay the promotion even further compared to 2) so we can collect a better profile and optimize our long-living app/service better (where we don't care about the startup really) - it's especially important because we don't support context-sensitive profiling so if we profile & promote, say, Enumerable.Sum too early we might bake a profile that won't be valid for other future uses

The question is whether we can detect one of these scenarious without user's help. Possible heuristics:

is it a console app/wpf/aspnet
is gc server mode enabled
etc

Also, maybe we can leave a sort of a staticpgo-like hint for future sessions that the previous one was short/long living compute-intense, etc?

danmoseley commented 1 year ago

DOTNET_TC_AggressiveTiering=1 indeed gives a big improvement.

C:\proj\rr>hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen
  Time (mean ± σ):      1.374 s ±  0.033 s    [User: 1.055 s, System: 0.228 s]
  Range (min … max):    1.350 s …  1.471 s    20 runs

C:\proj\rr>hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg
  Time (mean ± σ):      1.246 s ±  0.008 s    [User: 0.933 s, System: 0.230 s]
  Range (min … max):    1.236 s …  1.268 s    20 runs

C:\proj\rr>set DOTNET_TC_AggressiveTiering=1

C:\proj\rr>hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll gen"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll gen
  Time (mean ± σ):     944.1 ms ±  15.4 ms    [User: 680.9 ms, System: 225.9 ms]
  Range (min … max):   917.3 ms … 970.7 ms    20 runs

C:\proj\rr>hyperfine.exe --warmup 1 --min-runs 20 "dotnet exec bin\release\net8.0\rr.dll reg"
Benchmark 1: dotnet exec bin\release\net8.0\rr.dll reg
  Time (mean ± σ):      1.113 s ±  0.007 s    [User: 1.093 s, System: 0.214 s]
  Range (min … max):    1.103 s …  1.133 s    20 runs

EgorBo commented 1 year ago

Unfortunately, there is nothing we can do here as part of .NET 8.0 timeframe

danmoseley commented 1 year ago

Just to get clarity here, is the problem that the tier0 code is poor, or also that the tier1 compilation work is consuming resources?

Are there any "easy wins" where the tier0 code is particularly poor in this scenario and could reasonably be improved?

EgorBo commented 1 year ago

Just to get clarity here, is the problem that the tier0 code is poor, or also that the tier1 compilation work is consuming resources?

Are there any "easy wins" where the tier0 code is particularly poor in this scenario and could reasonably be improved?

Oops, didn't notice this question. We try to improve Tier0's perf if it doesn't hurt JIT's throughput, but overall, I think, it's reasonable to expect 2-10x slower perf from code execution in Tier0 compared to Tier1. For this particular case ("short-living, compute-heavy workload") we have only two options: 1) Expect some hint from the developer that an app needs no tiering or needs to reach top tier as fast as it can. E.g. via some runtimeconfig/msbuild property 2) Try to guess that somehow - this one is tricky because we have other types of scenarios.

dotnet / runtime

tiering makes regex-redux significantly slower #87753