Closed runfoapp[bot] closed 2 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
cc @LakshanF
Deleted duplicates in runfo, didn't mean to close this
The NativeAOT rolling runs are passing with #72108
Failed again in https://github.com/dotnet/runtime/pull/71485
Blocking the test in https://github.com/dotnet/runtime/pull/72437.
The failure looks non-determinstic. We've seen it fail on Windows and Linux x64 (I didn't go over all the above failures, but it might not be a problem on ARM64).
The test is using Vectors. I wonder if there's a problem with preservation of XMM registers or something like that.
Cc @VSadov as this might be in your area of interest. (This has been happening for a while so it's not from the Linux suspension PR that just merged.)
Have we seen this failure on win-x64?
I would suspect preserving of the xmm result in 2-register return case. Win-x64 would not be affected by that. Otherwise it is something else, but likely still related to preserving xmm results.
I've looked through a few failures and it looks like the failures are x64-specific. They were seen on Linux and Windows, but I have not seen a failure on ARM64
This is interesting because Windows and Linux have different specifics in terms of preserving xmm registers (i.e. in Linux all xmm are volatile, but not on Windows, on the other hand Linux can return two xmm registers from a call)
It is hard to see commonality here, that would also be specific to NativeAOT. The only thing that I can think of is that, I believe, NativeAOT disables use of AVX by default, so vector math may be emitted differently.
If you can get a local repro of the failure, you can test the AVX instruction set theory by adding <IlcInstructionSet>avx2</IlcInstructionSet>
to the project file and rebuilding. This will enable AVX2 on NativeAOT.
Similarly, setting set COMPlus_EnableAVX=0
will disable AVX on CoreCLR. Maybe it will repro there too then?
I would try extracting the test method into a standalone app that just runs this test on multiple threads, triggering GCs from time to time if you haven't tried that already.
@MichalStrehovsky - yes that is what I plan to do. Try enabling AVX on NativeAOT and see if that changes the outcome.
I am unable to reproduce this locally - either by running the test in a loop or by extracting the test code in a standalone app and running that in a loop on multiple threads while forcing frequent suspensions.
I tried on both Windows and Linux.
Also - I did try disabling AVX and running the repro sample on CoreCLR - nope.
Matrix4x4CreateFromAxisAngleTest failed in #74381.
C:\helix\work\workitem\e>System.Numerics.Vectors.Tests.exe -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing -xml testResults.xml
Running assembly:System.Numerics.Vectors.Tests, Version=7.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51
[FAIL] System.Numerics.Tests.Matrix4x4Tests.Matrix4x4CreateFromAxisAngleTest
Assert.True() Failure
Expected: True
Actual: False
at System.Numerics.Tests.Matrix4x4Tests.Matrix4x4CreateFromAxisAngleTest() + 0x4c4
at System.Numerics.Vectors.Tests!<BaseAddress>+0x63121c
at System.Reflection.DynamicInvokeInfo.Invoke(Object thisPtr, IntPtr methodToCall, Object[] parameters, BinderBundle binderBundle, Boolean wrapInTargetInvocationException) + 0xcf
Finished System.Numerics.Vectors.Tests, Version=7.0.0.0, Culture=neutral, PublicKeyToken=cc7b13ffcd2ddd51
I hoped it was fixed by some context/register related change. It is just less common now. At least it is on Windows+x64, which should be somewhat easier to deal with.
Frequency updated in top post -- somehow it happens very often in 7.0-preview7 branch, but main and 7.0 are also hit -- roughly 2x per week (in last 2 weeks).
When it fails in a Windows run, it always fails in 4 runs in a row for the same build - on Windows10, Windows11, Windows10server, Windows8.1
Are we running the same binary? Perhaps the failure is a corruption at compile time? Is there a way to get a binary from a failed run?
I'd be interested to see the failing exe artifacts\bin\System.Numerics.Vectors.Tests\Release\net7.0\publish\System.Numerics.Vectors.Tests.exe
From this build, for example
https://dev.azure.com/dnceng/public/_build/results?buildId=1971318&view=results
You can get it with runfo, instructions here: https://github.com/dotnet/runtime/blob/main/eng/testing/debug-dump-template.md
For this failed run: https://helix.dot.net/api/2019-06-17/jobs/67c3cd97-4086-489f-a0b2-6ed45d5196d5/workitems/System.Numerics.Vectors.Tests/console
This runfo command works for me: runfo get-helix-payload -j 67c3cd97-4086-489f-a0b2-6ed45d5196d5 -w System.Numerics.Vectors.Tests -o c:\blah
Thanks! I have obtained the binary. And surely enough it fails. So it is a compile-time bug not a run-time bug.
No wonder it was not reproducing via re-running the test.
now I can build the same commit locally and see what the diffs are in the bits or when executing/debugging
The difference start when the following line is executed:
expected = Matrix4x4.CreateFromQuaternion(Quaternion.CreateFromAxisAngle(Vector3.Normalize(Vector3.One), radians));
LOADS DIFFERENT DATA ==> 00007FF763CB7E5E movups xmm0,xmmword ptr [__readonlydata_System_Numerics_Vectors_Tests_System_Numerics_Tests_Matrix4x4Tests__Matrix4x4CreateFromAxisAngleTest+40h (07FF7647DF070h)]
00007FF763CB7E65 movups xmm1,xmmword ptr [__readonlydata_System_Numerics_Vectors_Tests_System_Numerics_Tests_Matrix4x4Tests__Matrix4x4CreateFromAxisAngleTest+50h (07FF7647DF080h)]
00007FF763CB7E6C andps xmm1,xmm0
00007FF763CB7E6F mulps xmm0,xmm1
00007FF763CB7E72 movaps xmm1,xmm0
00007FF763CB7E75 shufps xmm1,xmm0,0B1h
The first movups
loads some static value (I suspect it is Vector.One
?)
My locally built exe, that passes the test, loads
XMM0 = DDDDDDDD3F800000-3F8000003F800000
The failing exe from the lab build loads:
XMM0 = FFFFFFFF3F800000-3F8000003F800000
3f800000
is float32 1.0 , so we have three 1.0 floats. Not sure what the rest is. Garbage? it seems it makes a difference.
@MichalStrehovsky
I think someone else should take a look. I have only very vague idea where these __readonlydata_
things come from.
Vector3 is 12 bytes, but the readonly data that Vector3.One
is loaded from is 16 bytes. The JIT leaves the top 4 bytes of the readonly data uninitialized that leads to failures depending on what the uninitialized data happened to be. This looks like a JIT bug. @dotnet/jit-contrib Could you please take a look?
The questions to ask:
This is silent bad codegen. We should fix it for .NET 7.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | runfoapp[bot] |
---|---|
Assignees: | VSadov |
Labels: | `area-CodeGen-coreclr`, `area-NativeAOT-coreclr` |
Milestone: | 7.0.0 |
cc @tannergooding
We try to promote 12b simd to 16b everywhere where possible so I guess the data section's value has to be 16 bytes with zeroed last 4 bytes indeed
Are the top 4 bytes of xmm register that holds Vector3 data required to be initialized to 0? Is the JIT expected to initialize the top 4 bytes of the Vector3 readonly data to 0?
We don't have any strict requirement here. For most operations (add
, and
, subtract
, etc) the actual functionality is "element-wise" and so the value of the top 4-bytes doesn't matter.
The few cases where it crosses lanes (such as DotProduct
or Sum
) we're supposed to ensure the upper bits are zeroed or ignored. In the case of DotProduct
we emit the encoding that says to only use the lowest 3 elements.
Given this is NAOT only and NAOT must target a lower baseline, its possible there is a bug in the dpps
emulation emitted for SSE2 only hardware that isn't correctly accounting for this.
I'll take a look tomorrow.
Given this is NAOT only and NAOT must target a lower baseline
If we don't have any other test coverage for the lower baselines, it might be a good quality week topic. With NAOT one at least has a deterministic repro once a bad binary has been produced. Running into this in a JIT configuration would be a lot more difficult to root cause.
This is not the first issue caused by this test hole - there were #72081 and #72158 just in the past month. NAOT covers the lowest denominator but there's a whole spectrum in between the lowest denominator and whatever machines we run the tests on that can happen in the real world and it looks like we don't test.
We do test, but only in the outer loop and only for the runtime tests. There has been a longstanding issue to also run the library tests under the various JIT Stress options for ISA enablement
Given this is NAOT only and NAOT must target a lower baseline, its possible there is a bug in the
dpps
emulation emitted for SSE2 only hardware that isn't correctly accounting for this.I'll take a look tomorrow.
Assigning to Tanner.
@tannergooding Did you have a chance to take a look?
Yes. but it doesn't look to be in the Vector3.Dot
emulation.
The relevant logic is here: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lowerxarch.cpp#L3350
We pick Sse.Multiply
, Sse3.HorizontalAdd
-or- Sse.Shuffle
, and Sse.Add
and then the very first thing we do is mask things so that the unused component is zero: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/lowerxarch.cpp#L3418
GT_CNS_VEC
, but that's an unrelated JIT throughput optimizationThe called Matrix and Quaternion methods aren't accelerated, just the Vector3.Normalize
call, the various field accesses, the UnitX/Y/Z
property accesses, and the Vector3
constructor call.
The test doesn't really use any special tricks or code paths either: https://github.com/dotnet/runtime/blob/4c240009fa1d955805b5e284b6f58f1e643d1ef1/src/libraries/System.Numerics.Vectors/tests/Matrix4x4Tests.cs#L582
Here is codegen for Vector3.Length
that I believe is the root cause of the problem:
[MethodImpl(MethodImplOptions.NoInlining)]
static float f(Vector3 v)
=> v.Length();
movss xmm0,dword ptr [rcx+8] ds:00000052`9e6ff828=3f800000
movsd xmm1,mmword ptr [rcx]
shufps xmm1,xmm0,44h
// xmm1 has Vector3. The top 4 bytes of xmm1 have undefined value. They happen to be well defined in this specific case, but the JIT assumes the general case where they are undefined.
movups xmm0,xmmword ptr [xxx!_readonlydata_xxx_Program____Main___g__f_0_0 (00007ff6`59060ee0)] // 0x0000 FFFF FFFF FFFF - mask for the top 4 bytes
andps xmm0,xmm1 // xmm0 has properly masked Vector3
mulps xmm0,xmm1 // We are multiplying properly masked Vector3 with unmasked Vector3!
movaps xmm1,xmm0
shufps xmm1,xmm0,0B1h
addps xmm1,xmm0
movaps xmm0,xmm1
shufps xmm0,xmm1,4Eh
addps xmm0,xmm1
sqrtss xmm0,xmm0
ret
I think mulps xmm0,xmm1
is wrong. It should be mulps xmm0,xmm0
so that we multiply the masked values. The bad code works most of the time (0 multiplied by undefined value is zero) except when the undefined value happens to be NaN that will turn the whole thing into NaN.
Does this analysis make sense? Can you take it from here?
Yes, that makes sense. It should definitely be multiplying tmp * tmp
, rather than tmp * original
.
I'll see about getting a fix up first thing tomorrow.
Frequency - last 30 days as of 8/27:
Note: main branch failures in last 30 days
Runfo report