Closed BruceForstall closed 1 year ago
I built a coredistools.dll updated to the latest LLVM, 16.0.1, and this failure persists.
This looks to be an actual bug in the emitted codegen, noting the emitted BD
vs the correct FD
in the 3rd byte.
Emitted: 62D3BD483BF001
Correct: 62D3FD483BF001
The technical encoding is: EVEX.512.66.0F3A.W1 3B /r ib
. where the EVEX encoding is 0x62 P0 P1 P2
:
* P0: R X B R' 0 0 m m
* P1: W v v v v 1 p p
* P2: z L' L b V' a a a
We've encoded (where P1E is emitted, P1C is correct):
* P0: 1 1 0 1 0 0 1 1
* P1E: 1 0 1 1 1 1 0 1
* P1C: 1 1 1 1 1 1 0 1
* P2: 0 1 0 0 1 0 0 0
So we're simply misencoding the VEX.vvvv
register here. It's meant to be stored in bit inverted format and combined with the V'
bit, we've got 1_0111
emitted, which is 8
.
The error comes in that vextract
doesn't use the vvvv
field at all, and so it must be 1_1111
. Instead since the instruction is ins reg/mem, reg, imm
the mod/rm field is used to store the destination.
It's not immediately obvious where we're going wrong here and we may actually have a related issue with the AVX vextractf128
and vextracti128
So we do document the format incorrectly as IF_RRW_RRW_CNS
when it should be IF_RWR_RRD_CNS
. This is a long-standing issue and something we have wrong for a couple instructions, but one that doesn't cause any problems in practice.
The issue ends up being that the instruction is incorrectly tracking the IsDstDstSrcAVXInstruction
flag, which means we double encode what is supposed to be just regFor012Bits
also in the regFor3456Bits
. This is unique to the Avx512 extract instructions and isn't an issue for the Avx ones after all
Fix should just be removing the flag. We should also probably audit the IsDstDstSrc
and IsDstSrcSrc
flags for all SIMD instructions. This is mostly needed for the new the AVX512 ones, but we've recently audited the other flags in general, so it wouldn't hurt to finish doing the same here.
It was confusing because both VS and windbg disassemble this instruction as we expect to see it. Presumably because they ignore the field that is incorrectly encoded?
However, when I run this program (JIT\HardwareIntrinsics\General\Vector512_1\Vector512_1_ro\Vector512_1_ro.dll), with DOTNET_TieredCompilation=0
I get an illegal instruction exception (on both AMD 7950X and Intel Xeon Platinum 8370C).
So is this test case actually running in the CI? It must be if the test ended up in the spmi coreclr_tests.run collection.
Is it running but not correctly reporting the error/failure?
So is this test case actually running in the CI?
The encoding is only incorrect when we have vextracti64x4 reg, reg
. If we instead contain it as part of a store and emit vextracti64x4 [addr], reg
we don't have any issues.
Up until yesterday we were also only running Vector512 tests in Pri1, so its possible that we weren't seeing this in CI because of that and that it will now be showing up in the outerloop jobs.
The latest outerloop job (from Apr 19, 2023 at 2:00 AM PDT) has a few failures in AVX tests, but not in this one:
However, when I look at the console log for the HardwareIntrinsics_General_ro test, I see the failure:
11:03:55.549 Running test: _Vector512_1_ro::JIT.HardwareIntrinsics.General._Vector512_1.Program.GetAndWithLowerAndUpperByte()
Beginning scenario: RunBasicScenario
Fatal error. System.Runtime.InteropServices.SEHException (0x80004005): External component has thrown an exception.
at JIT.HardwareIntrinsics.General._Vector512_1.VectorGetAndWithLowerAndUpper__GetAndWithLowerAndUpperByte.RunBasicScenario()
at JIT.HardwareIntrinsics.General._Vector512_1.Program.GetAndWithLowerAndUpperByte()
at Program.<<Main>$>g__TestExecutor26|0_25(System.IO.StreamWriter, System.IO.StreamWriter, <>c__DisplayClass0_0 ByRef)
at Program.<Main>$(System.String[])
App Exit Code: -1073741795
Expected: 100
Actual: -1073741795
END EXECUTION - FAILED
FAILED
[XUnitLogChecker]: Item 'HardwareIntrinsics_General_ro' did not finish running. Checking and fixing the log...
[XUnitLogChecker]: XUnit log file has been fixed!
1259/2363 tests run.
* 1259 tests passed.
* 0 tests failed.
* 0 tests skipped.
[XUnitLogChecker]: Checking for dumps...
[XUnitLogChecker]: No crash dumps found. Continuing...
[XUnitLogChecker]: Finished!
ERROR: The process "corerun.exe" not found.
2023-04-19T11:03:58.485Z INFO run.py run(48) main Beginning reading of test results.
2023-04-19T11:03:58.485Z INFO run.py __init__(42) read_results Searching 'C:\h\w\B51E0A11\w\AD88096E\e' for test results files
2023-04-19T11:03:58.485Z INFO run.py __init__(48) read_results Found results file C:\h\w\B51E0A11\w\AD88096E\e\JIT\HardwareIntrinsics\HardwareIntrinsics_General_ro\HardwareIntrinsics_General_ro.testResults.xml with format xunit
2023-04-19T11:03:58.501Z INFO run.py __init__(42) read_results Searching 'C:\h\w\B51E0A11\w\AD88096E\uploads' for test results files
2023-04-19T11:03:58.501Z INFO run.py packing_test_reporter(30) report_results Packing 1259 test reports to 'C:\h\w\B51E0A11\w\AD88096E\e\__test_report.json'
2023-04-19T11:03:58.501Z INFO run.py packing_test_reporter(33) report_results Packed 281920 bytes
['HardwareIntrinsics_General_ro' END OF WORK ITEM LOG: Command exited with 0]
BUT, notice that even though the test failed, the harness returns 0 -- no failure.
Possibly this is a XUnitLogChecker problem?
=======
btw, the tests are invoked with:
call JIT\HardwareIntrinsics\HardwareIntrinsics_General_ro\HardwareIntrinsics_General_ro.cmd -usewatcher
I don't have a HardwareIntrinsics_General_ro.cmd
in my locally built artifacts, even though I built pri-1 tests. Why?
I DO have artifacts\tests\coreclr\windows.x64.Checked\JIT\HardwareIntrinsics\General\Vector512_1\Vector512_1_ro\Vector512_1_ro.cmd, for example.
Possibly this is a XUnitLogChecker problem?
CC. @trylek are there any known issues here? Could we be missing reported tests failures in some cases?
Perhaps we are looking for # failed
and not accounting for tests which didn't run, crashes, bad exit codes, or some timeouts?
I don't have a HardwareIntrinsics_General_ro.cmd in my locally built artifacts, even though I built pri-1 tests. Why?
Did you do anything related to BuildAsStandalone
? I think that may stop the merged test wrappers from being built.
Also, @ivdiazsa -- it looks like XUnitLogChecker is ignoring the test run exit code
Did you do anything related to BuildAsStandalone? I think that may stop the merged test wrappers from being built.
Good point. I always build with BuildAsStandalone=true
I opened https://github.com/dotnet/runtime/issues/85056 to track the CI infrastructure problem (of not reporting this test failure)
This is the case where the failure takes down the entire process. Presumably the test that you're looking for didn't run. We don't have the wrapper logic to attempt to rerun the rest of the test group after the failure. Mitigations are (1) Disable (or mark as RequiresProcessIsolation) catastrophic failures asap to unblock other testing, and (2) keep a balanced merged test group size (# number of 100s) to keep the wins from merging but reduce the collateral damage from these failures.
I see the logic in src/tests/Directory.Build.targets that disables merged test groups under BuildAsStandalone. It's possible that this isn't necessary. Similar to RequiresProcessIsolation, this generates a .cmd/.sh/local entry point, but a merged group can still execute it out-of-process. I guess often it wouldn't be useful since you can run the test standalone.
This is the case where the failure takes down the entire process
Shouldn't the harness try to detect this and mark itself as failing? It looks like we do capture the test process having a bad exit code and that we didn't run everything...
There's an issue where one level needs to report at "success" to avoid helix retrying the job. This isn't an area that I understand - those you've already tagged (plus maybe @jkoritzinsky?) should know more.
Presumably the test that you're looking for didn't run
@markples There are multiple things going on here, but the core problem is the important test here, that took down the merged group, did not get reported as failing (and in fact the entire group did not get reported as failing).
Bruce is right. It's a bug with the log checker I'm already looking into. I have no idea why it's overriding the test's exit code. It might have to do with the contents of the HelixCommand script, as the log checker is the last thing to get called.
SuperPMI is reporting an asm diff in:
but no textual diff, for both x64 and x86. For x64, it is in 236d7997-3d71-45f9-b7d7-5241ad89a56f.windows.x64\coreclr_tests.run.windows.x64.checked.mch.
The superpmi log shows:
The issue is that coredistools is failing to disassemble:
coredistools was last updated based on LLVM 13.0.1 in March 2022 with https://github.com/dotnet/jitutils/pull/351 (and related). The most recent version is LLVM 16.0.0 released March 2023 (https://releases.llvm.org/). Does it have a fix for this?
I verified that this disassembles correctly in Visual Studio.
Are there other cases where vextracti64x4 is properly decoded so it's not all cases, just some?
If we can't update coredistools with a fix, we'll have to do something so SuperPMI doesn't display spurious diffs. E.g., filter out the failing MCs.
@dotnet/avx512-contrib @dotnet/jit-contrib