dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.84k stars 4.62k forks source link

baseservices/exceptions/stackoverflow/stackoverflowtester/stackoverflowtester fails on Linux #46175

Closed sandreenko closed 1 month ago

sandreenko commented 3 years ago

It was disabled so we have not seen it, the log is:

  baseservices/exceptions/stackoverflow/stackoverflowtester/stackoverflowtester.sh [FAIL]

      Return code:      1
      Raw output file:      /root/helix/work/workitem/baseservices/exceptions/Reports/baseservices.exceptions/stackoverflow/stackoverflowtester/stackoverflowtester.output.txt
      Raw output:
      BEGIN EXECUTION
      /root/helix/work/correlation/corerun stackoverflowtester.dll ''
      Running stackoverflow test(smallframe main)
      "Stack overflow."
      "Repeat 174461 times:"
      "--------------------------------"
      "   at TestStackOverflow.Program.InfiniteRecursionC()"
      "   at TestStackOverflow.Program.InfiniteRecursionB()"
      "   at TestStackOverflow.Program.InfiniteRecursionA()"
      "--------------------------------"
      "   at TestStackOverflow.Program.Test(Boolean)"
      "   at TestStackOverflow.Program.Main(System.String[])"
      "apply_reg_state: ip and cfa unchanged; stopping here (ip=0x7fb3fc0f2c)"
      Gathering state for process 522 corerun
      Writing minidump with heap to file /home/helixbot/dotnetbuild/dumps/coredump.522.dmp
      Written 61616128 bytes (15043 pages) to core file
      Dump successfully written
      ""
      Missing "Main" method frame at the last line
      Expected: 100
      Actual: 101
      END EXECUTION - FAILED

note that on some archs it fails with a timeout.

AzDo example.

sandreenko commented 3 years ago

PTAL @echesakovMSFT I believe you were working with this test.

echesakov commented 3 years ago

PTAL @echesakovMSFT I believe you were working with this test.

No, @janvorli created this test

janvorli commented 3 years ago

I had no idea the test was disabled. @sandreenko where have you seen it failing with timeout?

sandreenko commented 3 years ago

@janvorli it was in the same job, here the log https://dev.azure.com/dnceng/public/_build/results?buildId=923829&view=ms.vss-test-web.build-test-results-tab&runId=29324470&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab&resultId=102790

Note: the test is disabled again, I have disabled it in #46162

janvorli commented 3 years ago

After I've fixed the lookup for the Main in the stack trace, the ARM64 legs are failing due to the fact that we don't have the probing helper change in yet. So for large frames, the failure point is too far from the SP, the failure is not recognized as stack overflow and the sigsegv alternate stack is not large enough to execute the full stack overflow reporting. The alternate stack is about two pages large while the stack overflow needs about 8 pages. We need to wait for the stack probing helper change to reenable the tests. The OSX / Linux x64 legs are failing due to timeouts caused most likely by the fact that our test infra generates core dumps for the processes that the test launches and that are expectedly failing with the stack overflow. I'll look into a way to prevent dumps generation for the secondary processes.

mangod9 commented 3 years ago

is the stack probing helper change noted above merged, or is it still pending?

echesakov commented 3 years ago

is the stack probing helper change noted above merged, or is it still pending?

The change was postponed to 7.0.0 - we need to fix https://github.com/dotnet/runtime/issues/47810 first. Otherwise, enabling the stack probing helper introduces regressions.

mangod9 commented 3 years ago

Ok thanks for the update. Moving this to 7 as well.

am11 commented 2 years ago

Another set of tests have started to fail on CoreCLR Pri0 Runtime Tests Run Linux arm64 checked.

logs: https://helix.dot.net/api/2019-06-17/jobs/70a35f2c-194c-4f0d-97e6-a693efb480e4/workitems/profiler.eventpipe/console

  Starting:    profiler.eventpipe.XUnitWrapper (parallel test collections = on, max threads = 4)
    profiler/eventpipe/eventpipe/eventpipe.sh [FAIL]
      Unhandled exception. System.Exception: Profilee returned exit code 255 instead of expected exit code 100.
         at Profiler.Tests.ProfilerTestRunner.FailFastWithMessage(String error)
         at Profiler.Tests.ProfilerTestRunner.Run(String profileePath, String testName, Guid profilerClsid, String profileeArguments, ProfileeOptions profileeOptions, Dictionary`2 envVars, String reverseServerName, Boolean loadAsNotification, Int32 notificationCopies)
         at EventPipeTests.EventPipe.Main(String[] args)
      apply_reg_state: ip and cfa unchanged; stopping here (ip=0x7fb6cd6024)
      /root/helix/work/workitem/e/profiler/eventpipe/eventpipe/eventpipe.sh: line 384:    47 Aborted                 (core dumped) $LAUNCHER $ExePath "${CLRTestExecutionArguments[@]}"

      Return code:      1

Should we disable all these tests until https://github.com/dotnet/runtime/issues/47810 and this issues are resolved?

echesakov commented 2 years ago

@am11 I am not sure I understand connection between the failing profiler test and the issue with stack probing. Can you please elaborate?

am11 commented 2 years ago

@echesakovMSFT, ah ok. The error from libunwind is "apply_reg_state: ip and cfa unchanged;", so I thought this issue is tracking that from the logs in the top post. Is that error unrelated and do we need to track it?

echesakov commented 2 years ago

@am11 Yes, it looks unrelated.

mangod9 commented 2 years ago

@JulieLeeMSFT, Egor had pointed to this https://github.com/dotnet/runtime/issues/47810 which needs to be merged before rechecking whether this test would pass. Is it planned for 7 (its currently marked as future)?

mangod9 commented 2 years ago

moving this to 8.

mangod9 commented 1 year ago

Looks like https://github.com/dotnet/runtime/issues/47810 is still not merged. @JulieLeeMSFT @BruceForstall assume this is not planned for 8?

BruceForstall commented 1 year ago

@mangod9 Note that this test is disabled for all Linux, as well as for win-x86 (https://github.com/dotnet/runtime/issues/84911). Issue #47810 is an optimization for arm64 only. The arm64 stack probing issue is https://github.com/dotnet/runtime/issues/13519. There is no current plan to implement it. (cc @kunalspathak)

But, as mentioned, that should only affect arm64. All the other test failures of this test (non-arm64 Linux and win-x86) could be independently investigated.

mangod9 commented 1 year ago

@janvorli, would your recent exceptions work handle this case? If so we can move to 9

mangod9 commented 1 month ago

Looks like the disabled test was enabled as part of JanV's fix. Closing now.

janvorli commented 1 month ago

@mangod9 my PR was closed, not merged in and the tests are still disabled. Based on @jkotas feedback, I wanted to make the fix more bullet proof, but then it went out of my radar with all the EH work. I am reopening the issue. I'll try to get back to fixing it soon.

mangod9 commented 1 month ago

oh sorry, missed that the PR was closed before merging. Assuming we can enable again in 9