dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.47k stars 4.76k forks source link

Assert failure: executionAborted in `GcInfoDecoder::EnumerateLiveSlots` #102370

Closed jakobbotsch closed 4 months ago

jakobbotsch commented 6 months ago

Build Information

Build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=678333&view=results Build error leg or test failing: Example console log: https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-102261-merge-e822cbb23a0f465186/LibraryImportGenerator.Unit.Tests/1/console.c3aa79dd.log?helixlogtype=result

/root/helix/work/workitem/e /root/helix/work/workitem/e
  Discovering: LibraryImportGenerator.Unit.Tests (method display = ClassAndMethod, method display options = None)
  Discovered:  LibraryImportGenerator.Unit.Tests (found 185 of 190 test cases)
  Starting:    LibraryImportGenerator.Unit.Tests (parallel test collections = on [2 threads], stop on fail = off)
    LibraryImportGenerator.UnitTests.Compiles.ValidateSnippetsWithMarshalType [SKIP]
      No current scenarios to test.

Assert failure(PID 28 [0x0000001c], Thread: 73 [0x0049]): executionAborted
    File: /__w/1/s/src/coreclr/vm/gcinfodecoder.cpp:801
    Image: /root/helix/work/correlation/dotnet

Maybe related to #101890?

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "executionAborted",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Known issue validation

Build: :mag_right: https://dev.azure.com/dnceng-public/public/_build/results?buildId=678333 Error message validated: [executionAborted] Result validation: :x: Known issue did not match with the provided build. Validation performed at: 5/17/2024 8:39:56 AM UTC

Report

Build Definition Test Pull Request
2503080 dotnet-runtime LibraryImportGenerator.Unit.Tests.WorkItemExecution
747517 dotnet/runtime System.Formats.Tar.Tests.WorkItemExecution dotnet/runtime#105099
2490036 dotnet-runtime System.Text.RegularExpressions.Tests.WorkItemExecution
2488046 dotnet-runtime LibraryImportGenerator.Unit.Tests.WorkItemExecution
2487307 dotnet-runtime LibraryImportGenerator.Unit.Tests.WorkItemExecution

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
1 1 5
jakobbotsch commented 6 months ago

Also cc @VSadov, I'm not familiar with the logic here, but wonder if it could be related to the new GC safe points.

dotnet-policy-service[bot] commented 6 months ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

VSadov commented 6 months ago

Also cc @VSadov, I'm not familiar with the logic here, but wonder if it could be related to the new GC safe points.

With interruptible GC safe points we can stress test each and every safe point. In the past we would skip a good portion of safe points - JIT helpers, direct calls, for example would not be stress tested.

So - the change to safe points could be involved here, but it could also expose some existing bug. (we did not see that when the change was merged, but there is a possibility of some nondeterministic bug)

Does this happen without the PR change?

jakobbotsch commented 6 months ago

Does this happen without the PR change?

I don't know, I haven't investigated. Happened again in the recent libraries-jitstress run: Pipeline run: https://dev.azure.com/dnceng-public/public/_build/results?buildId=690710&view=results Job name: net9.0-linux-Release-arm-disabler2r Console log: https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-heads-main-705beed46fcf4015ad/LibraryImportGenerator.Unit.Tests/1/console.f4051e9b.log?helixlogtype=result

jakobbotsch commented 6 months ago

Saw one on linux-x64 as well in https://dev.azure.com/dnceng-public/public/_build/results?buildId=692730&view=results. Console log: https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-pull-102808-merge-ca406cfc9a8448aea7/LibraryImportGenerator.Unit.Tests/1/console.b56c6cd1.log?helixlogtype=result

davidwrighton commented 6 months ago

Happened again on linux-arm https://dev.azure.com/dnceng-public/public/_build/results?buildId=698579&view=ms.vss-test-web.build-test-results-tab&runId=17382094&resultId=208376&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

Console log https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-101580-merge-24599464203f4fedbf/LibraryImportGenerator.Unit.Tests/1/console.4e5f55dd.log?helixlogtype=result

VSadov commented 6 months ago

Always on arm32 and always in LibraryImportGenerator.Unit.Tests

The failure is strange though. It means that we try to initiate a stack walk in a fully-interruptible method, but the IP happens to not be in one of the interruptible ranges. It is hard to think of how this could happen as anything that leads to stack walks should at some point ask "is this IP interruptible?".

VSadov commented 6 months ago

The last failure is interesting as it is not in a JIT stress run.

jakobbotsch commented 4 months ago

The failure is strange though. It means that we try to initiate a stack walk in a fully-interruptible method, but the IP happens to not be in one of the interruptible ranges. It is hard to think of how this could happen as anything that leads to stack walks should at some point ask "is this IP interruptible?".

Here is 100% reliable repro on linux-x64: https://gist.github.com/jakobbotsch/b7b98e082d7be7a189d2edc23b375a89

I think the issue was exposed by #101761. It is similar in spirit to #102919, #103300, #104042, where the JIT is incorrectly handling the fact that we now have a managed helper call in a new location. In this particular case the tailcall logic turns off GC before it starts placing arguments to the tailcall in the incoming parameters area (the GC ness of those args may not match the parameters, hence we need to do that). However, there is a block copy calling into the bulk write barrier helper after the JIT disables GC. So I think the underlying problem is that the JIT turns off GC too early. We need similar reasoning as added by #103301.

I am looking into a fix.

VSadov commented 4 months ago

@jakobbotsch - Also the first instruction of an epilog should not have GC forbidden. It was fixed in https://github.com/dotnet/runtime/pull/104336/files#r1664973906 , but right now reverted.

In theory that also could cause Assert failure: executionAborted - if we need to start stackwalk in a leaf method after a call and epilog starts right after that call as well.

AndyAyersMS commented 4 months ago

@jakobbotsch - Also the first instruction of an epilog should not have GC forbidden. It was fixed in https://github.com/dotnet/runtime/pull/104336/files#r1664973906 , but right now reverted.

In theory that also could cause Assert failure: executionAborted - if we need to start stackwalk in a leaf method after a call and epilog starts right after that call as well.

Maybe you can PR that separately, if the fix for the reversion will take a while?

VSadov commented 4 months ago

Maybe you can PR that separately, if the fix for the reversion will take a while?

I am considering that, but need to be sure it was not the part of the PR that made JIT stress unhappy.

VSadov commented 4 months ago

The partial re-revert, that contains only JIT fixes has some GC stress failures (extra ones compared to a no-op GC stress off main). Suspiciously the failures are all on arm64 and with JITStress.

So, it is possible that it is the JIT changes for the No-GC range in the epilog that make JIT stress unhappy.

Maybe the "fix" was incomplete and there is one more thing, perhaps in in JIT stress itself, that indirectly relies on old behavior.

VSadov commented 4 months ago

I am considering that, but need to be sure it was not the part of the PR that made JIT stress unhappy.

Turns out it was the "fix" that exposed bad GC info on the instruction right after CORINFO_HELP_FAIL_FAST. It was always possible to have it incorrect, but as long as we treated the first instruction of epilog as uninterruptible, it was mitigating the issue.

Treating the first instruction of epilog as uninterruptible is not the right thing to do, in general, especially after calls that could return or could do GC.
CORINFO_HELP_FAIL_FAST is fairly special in that sense - it does not return and does not do GC, but also the way we emit this call very late for the GS cookie check often results in corrupted GC info on the next instruction.

The part that CORINFO_HELP_FAIL_FAST is a NoGC helper saves us in partially interruptible code since emitting the call does not result in a GC safe point. In Fully interruptible code, the bad GC info is observable (although only on a single instruction), thus we only see problems in rare GC-JIT stress combinations.

I have a few solutions in mind, but need to figure what actually works more naturally.
Re: https://github.com/dotnet/runtime/pull/105596#issuecomment-2259013925