dotnet / arcade

Tools that provide common build infrastructure for multiple .NET Foundation projects.
MIT License
671 stars 347 forks source link

llvm-symbolizer not present in base queue #11631

Closed kunalspathak closed 3 months ago

kunalspathak commented 1 year ago

Build

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-77578-merge-965165820fec43e19e/JIT.Stress/1/console.f7c5d70b.log?helixlogtype=result

https://dev.azure.com/dnceng-public/public/_build/results?buildId=82793&view=ms.vss-test-web.build-test-results-tab&runId=1731386&resultId=102137&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

Pull Request

https://github.com/dotnet/runtime/pull/77578

Action required for the engineering services team

Additional information about the issue reported

To triage this issue (First Responder / @dotnet/dnceng):

In https://github.com/dotnet/runtime/pull/77578, we are trying to generate the crash stacktrace using llvm-symbolizer. While it is present in containers, the base Linux and macOS queues doesn't have it and we see error using it. See the logs I referenced in the issue. Can we get it and lldb installed on base image?

CC: @hoyosjs @JulieLeeMSFT

Release Note Category

Release Note Description

Add llvm and llvm-symbolizer to Ubunut.1804.Amd64 and RedHat.7.Amd64

michellemcdaniel commented 1 year ago

Hi Kunal, we will get on this. @hoyosjs do you know if this just comes built in with llvm? lldb 3.9 is already being installed on the base ubuntu.1804 queues. Do you need a different version? This is the test queue, so I don't think it would be an issue to upgrade that to something newer, but I'd like to check before making any major changes.

hoyosjs commented 1 year ago

Do you know why 3.9? And llvm sounds good.

michellemcdaniel commented 1 year ago

I do not know why 3.9. Possibly historic reasons? @MattGal it looks like we set our lldb version to 3.9 back in 2020. Do you know why we're using that?

Edit Oh, actually, we set this in 2019.

Edit: that is also a lie. I am still digging to how long ago we chose 3.9 and never updated it.

hoyosjs commented 1 year ago

Probably for diagnostics...

michellemcdaniel commented 1 year ago

Yeah. I think that's also what's on the docker images that y'all are using and upgrading to something more modern is also breaking things. I worry updating that will break y'all

MattGal commented 1 year ago

@kunalspathak we support several different linux distros, not all of which may have a usable version of llvm-symbolizer. Would it be acceptable if this were only added to Ubuntu Helix machines, or do you need it everywhere? Odds are it's not going to work with some of our more unusual linuxes.

kunalspathak commented 1 year ago

@kunalspathak we support several different linux distros, not all of which may have a usable version of llvm-symbolizer. Would it be acceptable if this were only added to Ubuntu Helix machines, or do you need it everywhere? Odds are it's not going to work with some of our more unusual linuxes.

@hoyosjs - what do you think?

hoyosjs commented 1 year ago

Updating the queues the runtime uses directly would be the first priority:

We'll have to evaluate the helix containers, but those are much easier to update and we've even built the toolset in some of the containers historically.

hoyosjs commented 1 year ago

@MattGal do you know where the symbolizer might not be available? cc: @jkoritzinsky since this might be interesting for your *SAN work

MattGal commented 1 year ago

@MattGal do you know where the symbolizer might not be available? cc: @jkoritzinsky since this might be interesting for your *SAN work

Offhand I'd venture it might not be available on old SLES or Mariner. It's one of those things we don't know until we try.

hoyosjs commented 1 year ago

Those don't tend to impact our priority scenario - the PR analysis checks

michellemcdaniel commented 1 year ago

PR to add them to the two linux based queues: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/27535

I think for OSX, we're going to have to get ddfun involved

michellemcdaniel commented 1 year ago

Opened https://portal.microsofticm.com/imp/v3/incidents/details/349676322/home to get llvm added to the OSX queue.

michellemcdaniel commented 1 year ago

(Moved to tracking while we wait for DDFun to update the systems)

JulieLeeMSFT commented 1 year ago

(Moved to tracking while we wait for DDFun to update the systems)

@michellemcdaniel do we know the time estimate until DDFun to update the system?

michellemcdaniel commented 1 year ago

I do not. I know it's been assigned, but I haven't seen any movement on it. I will ping the ICM

michellemcdaniel commented 1 year ago

In general, it takes 1-2 weeks to get this many systems updated (100ish machines), and next week is Thanksgiving, so it's likely going to be at the longer end of that estimate.

kunalspathak commented 1 year ago

PR to add them to the two linux based queues: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/27535

Does this rollout llvm to our linux helix queues? I kicked off a run on #77578 that would consume it and still see failure about llvm-symbolizer not present. See https://dev.azure.com/dnceng-public/public/_build/results?buildId=94545&view=ms.vss-test-web.build-test-results-tab .

michellemcdaniel commented 1 year ago

We did not have a rollout last week due to the US holiday. The linux changes should rollout this week.

michellemcdaniel commented 1 year ago

Heads up: DDFun says the OSX queue has been updated to have llvm on them

kunalspathak commented 1 year ago

I tried this out but seems there is still some issue.

Test Infrastructure Failure: System.ComponentModel.Win32Exception (2): An error occurred trying to start process 'llvm-symbolizer' with working directory '/private/tmp/helix/working/ADD7099B/w/A75E0909/e'. No such file or directory

kunalspathak commented 1 year ago

https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-77578-merge-7245de3e3bb44b4383/JIT.Stress/1/console.ba62542f.log?helixlogtype=result

ulisesh commented 1 year ago

@kunalspathak the job was executed in the queue osx.1200.amd64.open but the request was to install llvm in OSX.1200.ARM64 so it is expected for it to not be available in the amd64 queue. In which queue do you need it?

kunalspathak commented 1 year ago

was executed in the queue osx.1200.amd64.open but the request was to install llvm in OSX.1200.ARM64 so it is expected for it to not be available in the amd64 queue. In which queue do you need it?

I just noticed this from @hoyosjs . I think we also need it for OSX x64, right @hoyosjs ?

Updating the queues the runtime uses directly would be the first priority:

  • Ubuntu.1804.Amd64.Open
  • RedHat.7.Amd64.Open
  • OSX.1200.ARM64
hoyosjs commented 1 year ago

Yes, sorry - it would be needed on osx.*.*.open

jonfortescue commented 1 year ago

Results of investigation into creating a brewless LLVM artifact:

LLVM distributes a tarball of binaries for ARM64 macOS but not amd64. The only idea I have is that we could produce our own tar.xz or even our own pkg installer of amd64 darwin binaries (either built from source or brew installed locally) but that would be a massive pain to keep up-to-date since I don't think the vendors have access to mac hardware and I don't know that it's reasonable to have an FTE with a mac build and/or install llvm every three months.

cc/ @Chrisboh

Chrisboh commented 1 year ago

Let's add the install of this as part of the work DDFun has to do manually to setup a machine. @hoyosjs / @kunalspathak do understand that any time we need to change / update this it will take a considerable amount of time to change. Do you think this is something that will need to change often?

hoyosjs commented 1 year ago

Barring format changes on apple's behalf, I don't expect this to change often at all.

jonfortescue commented 1 year ago

Created https://portal.microsofticm.com/imp/v3/incidents/details/358905819/home to have DDFun do this for all mac open queues.

MattGal commented 1 year ago

@jonfortescue should this be closed and/or superseded by @ulisesh 's FR work?

ulisesh commented 1 year ago

The IcM I created for #12495 was to install mono-libgdiplus in the machines where it was missing. I didn't do any work for llvm-symbolizer

MattGal commented 1 year ago

The IcM I created for #12495 was to install mono-libgdiplus in the machines where it was missing. I didn't do any work for llvm-symbolizer

Understood. since all three (llvm/lldb, openSSL, and GDI) are installed via Brew I had assumed we'd check all of them. Either way since this has been opened for over a month and the IcM is "complete" I'm closing it; product teams can complain if it's missing somewhere.

jonfortescue commented 1 year ago

The IcM is nearing completion, but isn't technically complete. However, it's probably fine to keep this closed.

hoyosjs commented 9 months ago

@AlitzelMendez @missymessa this is the issue that's behind https://github.com/dotnet/runtime/issues/91975. It's missing in the following queues:

The query is

AzureDevOpsTests
| where Repository == 'dotnet/runtime' and RunCompleted > ago(10d)
| where Message contains "An error occurred trying to start process 'llvm-symbolizer' with working directory"
| extend QueueAndContainer = trim(' ', substring(TestRunName, indexof(TestRunName, '@') + 1))
| summarize count() by QueueAndContainer
JulieLeeMSFT commented 9 months ago

Ping @dotnet/dnceng to take care of this issue ASAP please.

missymessa commented 9 months ago

@hoyosjs What is this blocking for y'all right now?

hoyosjs commented 8 months ago

@missymessa These three queues represent the largest buckets that don't have crashes automatically symbolicated. This is due to a lack of llvm-symbolyzer in the base queue which we can't modify. Without this, native crashes don't have a good way of surfacing information for build analysis.

garath commented 8 months ago

Thanks @hoyosjs and @JulieLeeMSFT. To be clear, this isn't blocking builds or preventing releases, but it is making it hard to diagnose test failures. Is there anything else we should know to help set priority? (Unfortunately our Ops team has a rather large backlog right now and we need to be very crisp to be sure we're handling issues in the best order.)

hoyosjs commented 8 months ago

These three and https://github.com/dotnet/arcade/issues/11868 are queues where we can't enable blocking on build analysis for runtime easily, since no crash info will be available for those.

garath commented 8 months ago

Is it correct that the llvm package on, for example, Ubuntu would include llvm-symbolizer?

garath commented 8 months ago

Ah, I misunderstood. I see that Ubuntu.2204.Amd64.Open was not part of the original request, so it's a "new install" rather than "why are these missing" for that queue.

As for the state of the MacOS queues... I'll have to dig a bit deeper there.

JulieLeeMSFT commented 8 months ago

We are blocking all PR merge on red from 3/19 in dotnet/runtime. It will be a big pain to developers if they don't get traces to debug the failure and unblock themselves to merge on green. We have worked on this feature for almost 2 years, and this is the last piece that needs to be in place to ensure smooth developer experience when we enforce merge on green on 3/19. We have been requesting this feature for so many months. So, please prioritize this support.

Thanks @hoyosjs and @JulieLeeMSFT. To be clear, this isn't blocking builds or preventing releases, but it is making it hard to diagnose test failures. Is there anything else we should know to help set priority? (Unfortunately our Ops team has a rather large backlog right now and we need to be very crisp to be sure we're handling issues in the best order.)

hoyosjs commented 8 months ago

Is it correct that the llvm package on, for example, Ubuntu would include llvm-symbolizer?

On ubuntu that's likely enough for now. But for macOS it's likely very different :)

garath commented 8 months ago
AzureDevOpsTests
| where Repository == 'dotnet/runtime' and RunCompleted > ago(10d)
| where Message contains "An error occurred trying to start process 'llvm-symbolizer' with working directory"
| extend QueueAndContainer = trim(' ', substring(TestRunName, indexof(TestRunName, '@') + 1))
| summarize count() by QueueAndContainer

@hoyosjs I'm not seeing any results from this query. Should it still be working?

garath commented 8 months ago

I don't have bandwidth to take up this issue yet, but in an effort to speed things up a bit I've opened a request to DDFUN asking them to check on the MacOS systems in question. I'll follow-up here with the results. -- ICM 479938683

garath commented 8 months ago

@hoyosjs DDFUN spot checked a few machines in the MacOS queues and have confirmed that llvm-symbolizer is installed and should be available on the path. I asked them for the specific path to the bins and they found these:

AMD64: /usr/local/opt/llvm/bin/llvm-symbolizer ARM64: /opt/homebrew/Cellar/llvm/15.0.7_1/bin/llvm-symbolizer

Does this match what you're seeing in your builds?

hoyosjs commented 8 months ago

Are these on the path? I still see hits on runs from today:

AzureDevOpsTests
| where Repository endswith('runtime') and RunCompleted > ago(10d)
| where Message contains "'llvm-symbolizer' with working directory"
| extend QueueAndContainer = trim(' ', substring(TestRunName, indexof(TestRunName, '@') + 1))
Processing /cores/coredump.96726.dmp.crashreport.json
Printing stacktrace from '/cores/coredump.96726.dmp.crashreport.json'
Invoking llvm-symbolizer --pretty-print
Errors while running llvm-symbolizer --pretty-print
System.ComponentModel.Win32Exception (2): An error occurred trying to start process 'llvm-symbolizer' with working directory '/private/tmp/helix/working/B2090961/w/A22F08D1/e/Interop/Interop'. No such file or directory
   at System.Diagnostics.Process.ForkAndExecProcess(ProcessStartInfo startInfo, String resolvedFilename, String[] argv, String[] envp, String cwd, Boolean setCredentials, UInt32 userId, UInt32 groupId, UInt32[] groups, Int32& stdinFd, Int32& stdoutFd, Int32& stderrFd, Boolean usesTerminal, Boolean throwOnNoExec) in /_/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/Process.Unix.cs:line 496
   at System.Diagnostics.Process.StartCore(ProcessStartInfo startInfo) in /_/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/Process.Unix.cs:line 456
   at CoreclrTestLib.CoreclrTestWrapperLib.TryPrintStackTraceFromCrashReport(String crashReportJsonFile, TextWriter outputWriter)

image

garath commented 8 months ago

Are these on the path? I still see hits on runs from today:

They've confirmed the right path is listed in /etc/paths.

I've extracted a random sample of failing machines and asked for those to be checked to rule out an inconsistent configuration.

Your query gives a good view of failing cases but I wonder if we can establish if there have been any successful cases. Do you know of a message that would be printed if it was successful?

hoyosjs commented 8 months ago

I tried looking - I see no successful invocations of it on macOS. On linux containers it looks like:

Processing /home/helixbot/dotnetbuild/dumps/coredump.2203.dmp.crashreport.json
Printing stacktrace from '/home/helixbot/dotnetbuild/dumps/coredump.2203.dmp.crashreport.json'
Invoking llvm-symbolizer --pretty-print
Stack trace:
----------------------------------
Thread Id: 0x89b
      Child SP               IP Call Site
 0x7ffca315f5f0 0x7f8585cfb1d8 libclrjit.so!?? at ??:0:0
 0x7ffca315f710 0x7f8585ea7d39 libclrjit.so!Compiler::impImportBlockCode(BasicBlock*) at /__w/1/s/src/coreclr/jit/importer.cpp:7987:56
 0x7ffca315f8f0 0x7f8585d0448a libclrjit.so!insTupleTypeInfos at emitxarch.cpp:0:0
 0x7ffca315f9f0 0x7f8585cfb479 libclrjit.so!?? at ??:0:0
 0x7ffca315fb10 0x7f8585e1ecce libclrjit.so!Compiler::fgSwitchToOptimized(char const*) at /__w/1/s/src/coreclr/jit/flowgraph.cpp:473:5
 0x7ffca315fb80 0x7f8585f616fa libclrjit.so!Compiler::fgMorphExpandCast(GenTreeCast*) at /__w/1/s/src/coreclr/jit/morph.cpp:562:9
 0x7ffca315fbb0
...
janvorli commented 8 months ago

Instead of symbolizer, macOS has atos tool. An old note from my personal onenote has an example:

atos -o artifacts/bin/coreclr/OSX.arm64.Debug/libcoreclr.dylib.dwarf 0x7ac654
EEStartupHelper() (in libcoreclr.dylib.dwarf) (ceemain.cpp:1001)
(use the dwarf file to get the source line)

Or 
atos -o artifacts/bin/coreclr/OSX.arm64.Debug/libcoreclr.dylib.dwarf 0x7ac654 -fullPath
EEStartupHelper() (in libcoreclr.dylib.dwarf) (/Users/janvorli/git/runtime/src/coreclr/vm/ceemain.cpp:1001)