Open ptr-null opened 2 years ago
Beside the codegen you should write the method like
public int Sum()
{
if (Interlocked.Read(ref _state) == 0)
{
return 0;
}
int sum = 0;
int[] array = _array;
for (int i = 0; i < array.Length; ++i)
sum += array[i];
return sum;
}
It's faster then, and codegen will be as expected.
So key points are:
sum
from the Interlocked-check, as this causes the spilling --> return 0
and initialize sum
after the checkThis issue has been marked needs-author-action
and may be missing some important information.
@ptr-null please try what @gfoidl suggested and let us know if it removes the regression. cc @dotnet/jit-contrib.
This is an epic problem of LSRA's including resolution moves in the middle of the loop where spill/reload happens. The generated code that gfoidl suggested is much better, but regardless the LSRA problem should be handled in runtime.
@gfoidl, the example I used here is quite reduced form of the real code piece (still reproducing the issue). Workaround is not something to worry about (I have several for my real problem). Thanks for the array-trick hint though. I've seen it before, but forgot happily. Those range checks are especially annoying to see in my case, because of I have array as readonly
field of the instance (hello to https://github.com/dotnet/runtime/issues/11797). The other annoying thing I noticed is this != null
added (https://github.com/dotnet/runtime/issues/44087 once again).
cc @JulieLeeMSFT
This is likely won't get time during .NET 8, but I will mark it as Pri3 for .NET 8.
This falls in the category of "resolution phase" of LSRA noted in https://github.com/dotnet/runtime/issues/47194.
Description
Consider the following method containing tight loop:
The method is modified then by adding a short check before the loop, as follows:
It is reasonable to expect (given that array is sufficiently long) that the check added should not affect method performance dramatically, right?
But it does:
Modified method becomes ~3x slower, which suggests that the loop is performed slower.
Analysis
Indeed, it is. Inspecting IL does not expose any difference (loop bodies are identical in both cases). Code generated by RyuJIT for the loop is different however.
In the first case, the sum is accumulated into eax register:
Where as in the second one it is accumulated into r9d register, which is loaded from and saved to stack on each iteration of the loop:
There is one additional jump in the second case also. See complete generated codes at the end.
Regression
As one may see from the benchmark results below, this is the regression actually:
Both methods were on par in the .NET Framework 4.8 and .NET Core 2.2
Generated code
.NET 6.0.6 (6.0.622.26707), X64 RyuJIT
category:cq theme:register-allocator skill-level:expert cost:large impact:large