dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.47k stars 4.77k forks source link

ARM64 SVE: GatherVectorWithByteOffsetFirstFaulting failures during stress testing #106621

Closed a74nh closed 3 months ago

a74nh commented 3 months ago

With today's HEAD (dc0432bcb6e)

When running stress tetsing, I sometimes get errors with GatherVectorWithByteOffsetFirstFaulting. Seems to be inconsistent when it occurs.

Using HardwareIntrinsics_Arm_ro.dll

===================Running jitstress===================
------------------- {'JitMinOpts': '1'} -------------------
------------------- {'JitStress': '1'} -------------------
------------------- {'JitStress': '2'} -------------------
------------------- {'JitStress': '1', 'TieredCompilation': '1'} -------------------
------------------- {'JitStress': '2', 'TieredCompilation': '1'} -------------------
Test failed:
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
Sve.GatherVectorWithByteOffsetFirstFaulting<Single>(Single, Single, Int32): RunBasicScenario_LoadFirstFaulting failed:
       firstOp: (1E-45, 0, 1E-45, 1)
      secondOp: (0.6533055)
       thirdOp: (0, 237, 3, 12)
        result: (0.6533055, 0, 1.5155E-41, 0)
   faultResult: (<1, 1, 0, 0>)
..........................................
System.Exception: One or more scenarios did not complete as expected.
   at JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_GatherVectorWithByteOffsetFirstFaulting_float_int() in /home/alahay01/dotnet/runtime_sve/artifacts/tests/coreclr/obj/linux.arm64.Checked/Managed/JIT/HardwareIntrinsics/Arm/Sve/Sve_ro/Sve_ro/gen/Sve.GatherVectorWithByteOffsetFirstFaulting.float.int.cs:line 86
   at Program.<<Main>$>g__TestExecutor3308|0_3309(StreamWriter tempLogSw, StreamWriter statsCsvSw, <>c__DisplayClass0_0&) in /home/alahay01/dotnet/runtime_sve/artifacts/tests/coreclr/obj/linux.arm64.Checked/Managed/JIT/HardwareIntrinsics/HardwareIntrinsics_Arm_ro/generated/XUnitWrapperGenerator/XUnitWrapperGenerator.XUnitWrapperGenerator/FullRunner.g.cs:line 83607
------------------- {'TailcallStress': '1'} -------------------
------------------- {'ReadyToRun': '0'} -------------------
===================Running jitstressregs===================
------------------- {'JitStressRegs': '1'} -------------------
------------------- {'JitStressRegs': '2'} -------------------
------------------- {'JitStressRegs': '3'} -------------------
------------------- {'JitStressRegs': '4'} -------------------
------------------- {'JitStressRegs': '8'} -------------------
------------------- {'JitStressRegs': '0x10'} -------------------
------------------- {'JitStressRegs': '0x80'} -------------------
Test failed:
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
Sve.GatherVectorWithByteOffsetFirstFaulting<Int32>(Int32, Int32, Int32): RunBasicScenario_LoadFirstFaulting failed:
       firstOp: (1, 0, 1, 1)
      secondOp: (1532479874)
       thirdOp: (0, 155, 3, 12)
        result: (1532479874, 0, 9496155, 0)
   faultResult: (<1, 1, 0, 0>)
..........................................
System.Exception: One or more scenarios did not complete as expected.
   at JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_GatherVectorWithByteOffsetFirstFaulting_int() in /home/alahay01/dotnet/runtime_sve/artifacts/tests/coreclr/obj/linux.arm64.Checked/Managed/JIT/HardwareIntrinsics/Arm/Sve/Sve_ro/Sve_ro/gen/Sve.GatherVectorWithByteOffsetFirstFaulting.int.cs:line 86
   at Program.<<Main>$>g__TestExecutor3309|0_3310(StreamWriter tempLogSw, StreamWriter statsCsvSw, <>c__DisplayClass0_0&) in /home/alahay01/dotnet/runtime_sve/artifacts/tests/coreclr/obj/linux.arm64.Checked/Managed/JIT/HardwareIntrinsics/HardwareIntrinsics_Arm_ro/generated/XUnitWrapperGenerator/XUnitWrapperGenerator.XUnitWrapperGenerator/FullRunner.g.cs:line 83631
------------------- {'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStressRegs': '0x2000'} -------------------
===================Running jitstress2-jitstressregs===================
------------------- {'JitStress': '2', 'JitStressRegs': '1'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '2'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '3'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '4'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '8'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x10'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x80'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x1000'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '0x2000'} -------------------

On another run:

===================Running jitstress2-jitstressregs===================
------------------- {'JitStress': '2', 'JitStressRegs': '1'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '2'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '3'} -------------------
------------------- {'JitStress': '2', 'JitStressRegs': '4'} -------------------
Test failed:
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
..........................................
Sve.GatherVectorUInt32WithByteOffsetsZeroExtendFirstFaulting<Int32>(Int32, UInt32, Int32): RunBasicScenario_LoadFirstFaulting failed:
       firstOp: (0, 1, 1, 1)
      secondOp: (483312, 111609)
       thirdOp: (172, 0, 7, 12)
        result: (0, 483312, 0, 0)
   faultResult: (<1, 1, 0, 0>)
..........................................
System.Exception: One or more scenarios did not complete as expected.
   at JIT.HardwareIntrinsics.Arm._Sve.Program.Sve_GatherVectorUInt32WithByteOffsetsZeroExtendFirstFaulting_offsets_int_int() in /home/alahay01/dotnet/runtime_sve/artifacts/tests/coreclr/obj/linux.arm64.Checked/Managed/JIT/HardwareIntrinsics/Arm/Sve/Sve_ro/Sve_ro/gen/Sve.GatherVectorUInt32WithByteOffsetsZeroExtendFirstFaulting.offsets.int.int.cs:line 76
   at Program.<<Main>$>g__TestExecutor3284|0_3285(StreamWriter tempLogSw, StreamWriter statsCsvSw, <>c__DisplayClass0_0&) in /home/alahay01/dotnet/runtime_sve/artifacts/tests/coreclr/obj/linux.arm64.Checked/Managed/JIT/HardwareIntrinsics/HardwareIntrinsics_Arm_ro/generated/XUnitWrapperGenerator/XUnitWrapperGenerator.XUnitWrapperGenerator/FullRunner.g.cs:line 83031
dotnet-policy-service[bot] commented 3 months ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

SwapnilGaikwad commented 3 months ago

Hi @amanasifkhalid, do you still see this issue. I couldn't reproduce this. However, I saw another failure instead and that occurs rarely, 1 out of 10 times. I'll open an issue for it. Although, not sure how to reproduce that.

SwapnilGaikwad commented 3 months ago

I saw another failure instead and that occurs rarely, 1 out of 10 times. I'll open an issue for it

Opened #106815.

amanasifkhalid commented 3 months ago

do you still see this issue. I couldn't reproduce this.

I was able to repro it after about a dozen stress test runs, though I haven't tried reproing with the latest HEAD. I'll take another look...

However, I saw another failure instead and that occurs rarely, 1 out of 10 times. I'll open an issue for it. Although, not sure how to reproduce that.

Since the intrinsic failing in #106815 uses the same template as this intrinsic, I suspect these failures might be bugs in the template itself (which is better than it being a JIT bug). Something interesting to note is both of the failures Alan posted above were with variants operating on Int32/UInt32, and the failure you posted on #106815 is also working with 32-bit integers as the base type. It's possible that the test template isn't handling the bounded memory correctly, and this size of base type happens to hit the bug. Though if you manage to repro this failure on another base type, that would be a good challenge to my hypothesis.

amanasifkhalid commented 3 months ago

I've dug into this a bit, and this seems to be an issue with the test validation logic, particularly the bit in CheckGatherVectorWithByteOffsetFirstFaultingBehavior for calculating the expected FFR value. Right now, we determine that the load should have faulted if a given offset exceeds the length of the data, but we do not account for the fact that the load can walk off the end of the data array even if the starting offset is in-range. For example, if the data array is a single 32-bit integer, and the starting offset is 1, we will read bytes 1-4, which means the last byte read should cause a fault, but the current logic will expect it to not fault. The fix is straightforward: We just need to factor in the size of the load when determining if it will exceed the bounds of the data array.