dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.85k stars 4.62k forks source link

nativeaot/SmokeTests/Exceptions failing with `Assertion failed: (n_heaps <= heap_number) || !gc_t_join.joined()` #103839

Closed elinor-fung closed 6 days ago

elinor-fung commented 2 months ago
Assertion failed: (n_heaps <= heap_number) || !gc_t_join.joined(), file D:\a\_work\1\s\src\coreclr\gc\gc.cpp, line 6988

Return code:      1
Raw output file:      C:\h\w\B51009A0\w\B29A098A\uploads\Reports\nativeaot.SmokeTests\Exceptions\Exceptions\Exceptions.output.txt
Raw output:
BEGIN EXECUTION
call C:\h\w\B51009A0\p\nativeaottest.cmd C:\h\w\B51009A0\w\B29A098A\e\nativeaot\SmokeTests\Exceptions\Exceptions\ Exceptions.dll 
Exception caught!
Null reference exception in write barrier caught!
Null reference exception caught!
Test Stacktrace with exception on stack:
   at BringUpTest.FilterWithStackTrace(Exception) + 0x28
   at BringUpTest.Main() + 0x31c
   at System.Runtime.EH.FindFirstPassHandler(Object, UInt32, StackFrameIterator&, UInt32&, Byte*&) + 0x188
   at System.Runtime.EH.DispatchEx(StackFrameIterator&, EH.ExInfo&) + 0x161
   at System.Runtime.EH.RhThrowEx(Object, EH.ExInfo&) + 0x4b
   at BringUpTest.Main() + 0xaf

Exception caught via filter!
Expected: 100
Actual: 3
END EXECUTION - FAILED

Build Information

Build: https://dev.azure.com/dnceng-public/cbb18261-c48f-4abb-8651-8cdcb5474649/_build/results?buildId=715849 Build error leg or test failing: nativeaot\SmokeTests\Exceptions\Exceptions\Exceptions.cmd Pull request: https://github.com/dotnet/runtime/pull/103821

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "Assertion failed: (n_heaps <= heap_number) || !gc_t_join.joined()",
  "ErrorPattern": "",
  "BuildRetry": false,
  "ExcludeConsoleLog": false
}

Report

Build Definition Test Pull Request
782793 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution dotnet/runtime#106713
780945 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution dotnet/runtime#106662
779397 dotnet/runtime readytorun/GenericCycleDetection/Depth1Test/Depth1Test.cmd dotnet/runtime#80154
777119 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution dotnet/runtime#106474
777004 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution dotnet/runtime#106419
776719 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution dotnet/runtime#105946
775455 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution dotnet/runtime#106309
770671 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution
769702 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution dotnet/runtime#106130
768094 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution dotnet/runtime#106010
761651 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution dotnet/runtime#105757
757283 dotnet/runtime nativeaot.SmokeTests.WorkItemExecution dotnet/runtime#105578

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 4 12
dotnet-policy-service[bot] commented 2 months ago

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas See info in area-owners.md if you want to be subscribed.

dotnet-policy-service[bot] commented 2 months ago

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

jkotas commented 2 months ago

Looks like a DATAs race condition. @dotnet/gc Could you please take a look?

Note that nativeaot\SmokeTests\Exceptions test is explicitly opted into server GC to get some coverage for server GC during default CI run.

mrsharm commented 1 month ago

Are there any dumps available? I can't seem to find them. Tried to repro locally to no avail. Seems like it's a low probability assertion failure (2 / month).

MichalStrehovsky commented 1 month ago

Are there any dumps available? I can't seem to find them. Tried to repro locally to no avail. Seems like it's a low probability assertion failure (2 / month).

Yeah, it doesn't look like infra captured a dump for this.

There are 4 hits per month but we don't have any dedicated server GC testing. This is the one and only test we run with server GC enabled. We rely on CoreCLR testing to catch GC bugs right now (even this test is not really testing Server GC - it just tests that setting the csproj property to enable server GC actually enables the server GC).

mangod9 commented 2 weeks ago

@mrsharm @MichalStrehovsky are any dumps available for this, or is there a local repro?

mrsharm commented 2 weeks ago

I couldn't locally repro this and nor could I get to any dumps. My one guess (by a long shot) is that this might be related to the other DATAS race condition we found via Reliability Framework where there is a race in the GetHeap while change_heap_count is invoked but without a dump it's difficult to validate.

mangod9 commented 2 weeks ago

The reliability framework issue was fixed correct? Looks like this issue reproed today.

mrsharm commented 2 weeks ago

The reliability framework issue was fixed correct? Looks like this issue reproed today.

It wasn't - I think we were still working on a solution. CC: @Maoni0.

mangod9 commented 2 weeks ago

ah ok. We can tag it as such then, and see if the repro stops after that is fixed.

Maoni0 commented 1 week ago

I made a fix at https://github.com/dotnet/runtime/pull/106752.

mrsharm commented 6 days ago

image

@cshung, we should wait some time before confirming this issue has truly fixed - I am observing that the bot is still picking up the same failures.

cshung commented 6 days ago

@mrsharm, wouldn't the bot reopen it if it finds new failures? I was hoping to confirm the fix by doing that. The builds found from the bot seems to be either 9.0 or two days ago.