Closed kunalspathak closed 3 months ago
Hi Kunal, we will get on this. @hoyosjs do you know if this just comes built in with llvm? lldb 3.9 is already being installed on the base ubuntu.1804 queues. Do you need a different version? This is the test queue, so I don't think it would be an issue to upgrade that to something newer, but I'd like to check before making any major changes.
Do you know why 3.9? And llvm sounds good.
I do not know why 3.9. Possibly historic reasons? @MattGal it looks like we set our lldb version to 3.9 back in 2020. Do you know why we're using that?
Edit Oh, actually, we set this in 2019.
Edit: that is also a lie. I am still digging to how long ago we chose 3.9 and never updated it.
Probably for diagnostics...
Yeah. I think that's also what's on the docker images that y'all are using and upgrading to something more modern is also breaking things. I worry updating that will break y'all
@kunalspathak we support several different linux distros, not all of which may have a usable version of llvm-symbolizer. Would it be acceptable if this were only added to Ubuntu Helix machines, or do you need it everywhere? Odds are it's not going to work with some of our more unusual linuxes.
@kunalspathak we support several different linux distros, not all of which may have a usable version of llvm-symbolizer. Would it be acceptable if this were only added to Ubuntu Helix machines, or do you need it everywhere? Odds are it's not going to work with some of our more unusual linuxes.
@hoyosjs - what do you think?
Updating the queues the runtime uses directly would be the first priority:
We'll have to evaluate the helix containers, but those are much easier to update and we've even built the toolset in some of the containers historically.
@MattGal do you know where the symbolizer might not be available? cc: @jkoritzinsky since this might be interesting for your *SAN work
@MattGal do you know where the symbolizer might not be available? cc: @jkoritzinsky since this might be interesting for your *SAN work
Offhand I'd venture it might not be available on old SLES or Mariner. It's one of those things we don't know until we try.
Those don't tend to impact our priority scenario - the PR analysis checks
PR to add them to the two linux based queues: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/27535
I think for OSX, we're going to have to get ddfun involved
Opened https://portal.microsofticm.com/imp/v3/incidents/details/349676322/home to get llvm added to the OSX queue.
(Moved to tracking while we wait for DDFun to update the systems)
(Moved to tracking while we wait for DDFun to update the systems)
@michellemcdaniel do we know the time estimate until DDFun to update the system?
I do not. I know it's been assigned, but I haven't seen any movement on it. I will ping the ICM
In general, it takes 1-2 weeks to get this many systems updated (100ish machines), and next week is Thanksgiving, so it's likely going to be at the longer end of that estimate.
PR to add them to the two linux based queues: https://dev.azure.com/dnceng/internal/_git/dotnet-helix-machines/pullrequest/27535
Does this rollout llvm
to our linux helix queues? I kicked off a run on #77578 that would consume it and still see failure about llvm-symbolizer
not present. See https://dev.azure.com/dnceng-public/public/_build/results?buildId=94545&view=ms.vss-test-web.build-test-results-tab .
We did not have a rollout last week due to the US holiday. The linux changes should rollout this week.
Heads up: DDFun says the OSX queue has been updated to have llvm on them
I tried this out but seems there is still some issue.
Test Infrastructure Failure: System.ComponentModel.Win32Exception (2): An error occurred trying to start process 'llvm-symbolizer' with working directory '/private/tmp/helix/working/ADD7099B/w/A75E0909/e'. No such file or directory
@kunalspathak the job was executed in the queue osx.1200.amd64.open but the request was to install llvm in OSX.1200.ARM64 so it is expected for it to not be available in the amd64 queue. In which queue do you need it?
was executed in the queue osx.1200.amd64.open but the request was to install llvm in OSX.1200.ARM64 so it is expected for it to not be available in the amd64 queue. In which queue do you need it?
I just noticed this from @hoyosjs . I think we also need it for OSX x64, right @hoyosjs ?
Updating the queues the runtime uses directly would be the first priority:
- Ubuntu.1804.Amd64.Open
- RedHat.7.Amd64.Open
- OSX.1200.ARM64
Yes, sorry - it would be needed on osx.*.*.open
Results of investigation into creating a brewless LLVM artifact:
LLVM distributes a tarball of binaries for ARM64 macOS but not amd64. The only idea I have is that we could produce our own tar.xz or even our own pkg installer of amd64 darwin binaries (either built from source or brew installed locally) but that would be a massive pain to keep up-to-date since I don't think the vendors have access to mac hardware and I don't know that it's reasonable to have an FTE with a mac build and/or install llvm every three months.
cc/ @Chrisboh
Let's add the install of this as part of the work DDFun has to do manually to setup a machine. @hoyosjs / @kunalspathak do understand that any time we need to change / update this it will take a considerable amount of time to change. Do you think this is something that will need to change often?
Barring format changes on apple's behalf, I don't expect this to change often at all.
Created https://portal.microsofticm.com/imp/v3/incidents/details/358905819/home to have DDFun do this for all mac open queues.
@jonfortescue should this be closed and/or superseded by @ulisesh 's FR work?
The IcM I created for #12495 was to install mono-libgdiplus in the machines where it was missing. I didn't do any work for llvm-symbolizer
The IcM I created for #12495 was to install mono-libgdiplus in the machines where it was missing. I didn't do any work for llvm-symbolizer
Understood. since all three (llvm/lldb, openSSL, and GDI) are installed via Brew I had assumed we'd check all of them. Either way since this has been opened for over a month and the IcM is "complete" I'm closing it; product teams can complain if it's missing somewhere.
The IcM is nearing completion, but isn't technically complete. However, it's probably fine to keep this closed.
@AlitzelMendez @missymessa this is the issue that's behind https://github.com/dotnet/runtime/issues/91975. It's missing in the following queues:
The query is
AzureDevOpsTests
| where Repository == 'dotnet/runtime' and RunCompleted > ago(10d)
| where Message contains "An error occurred trying to start process 'llvm-symbolizer' with working directory"
| extend QueueAndContainer = trim(' ', substring(TestRunName, indexof(TestRunName, '@') + 1))
| summarize count() by QueueAndContainer
Ping @dotnet/dnceng to take care of this issue ASAP please.
@hoyosjs What is this blocking for y'all right now?
@missymessa These three queues represent the largest buckets that don't have crashes automatically symbolicated. This is due to a lack of llvm-symbolyzer in the base queue which we can't modify. Without this, native crashes don't have a good way of surfacing information for build analysis.
Thanks @hoyosjs and @JulieLeeMSFT. To be clear, this isn't blocking builds or preventing releases, but it is making it hard to diagnose test failures. Is there anything else we should know to help set priority? (Unfortunately our Ops team has a rather large backlog right now and we need to be very crisp to be sure we're handling issues in the best order.)
These three and https://github.com/dotnet/arcade/issues/11868 are queues where we can't enable blocking on build analysis for runtime easily, since no crash info will be available for those.
Is it correct that the llvm
package on, for example, Ubuntu would include llvm-symbolizer?
Ah, I misunderstood. I see that Ubuntu.2204.Amd64.Open
was not part of the original request, so it's a "new install" rather than "why are these missing" for that queue.
As for the state of the MacOS queues... I'll have to dig a bit deeper there.
We are blocking all PR merge on red from 3/19 in dotnet/runtime. It will be a big pain to developers if they don't get traces to debug the failure and unblock themselves to merge on green. We have worked on this feature for almost 2 years, and this is the last piece that needs to be in place to ensure smooth developer experience when we enforce merge on green on 3/19. We have been requesting this feature for so many months. So, please prioritize this support.
Thanks @hoyosjs and @JulieLeeMSFT. To be clear, this isn't blocking builds or preventing releases, but it is making it hard to diagnose test failures. Is there anything else we should know to help set priority? (Unfortunately our Ops team has a rather large backlog right now and we need to be very crisp to be sure we're handling issues in the best order.)
Is it correct that the
llvm
package on, for example, Ubuntu would include llvm-symbolizer?
On ubuntu that's likely enough for now. But for macOS it's likely very different :)
AzureDevOpsTests | where Repository == 'dotnet/runtime' and RunCompleted > ago(10d) | where Message contains "An error occurred trying to start process 'llvm-symbolizer' with working directory" | extend QueueAndContainer = trim(' ', substring(TestRunName, indexof(TestRunName, '@') + 1)) | summarize count() by QueueAndContainer
@hoyosjs I'm not seeing any results from this query. Should it still be working?
I don't have bandwidth to take up this issue yet, but in an effort to speed things up a bit I've opened a request to DDFUN asking them to check on the MacOS systems in question. I'll follow-up here with the results. -- ICM 479938683
@hoyosjs DDFUN spot checked a few machines in the MacOS queues and have confirmed that llvm-symbolizer is installed and should be available on the path. I asked them for the specific path to the bins and they found these:
AMD64: /usr/local/opt/llvm/bin/llvm-symbolizer
ARM64: /opt/homebrew/Cellar/llvm/15.0.7_1/bin/llvm-symbolizer
Does this match what you're seeing in your builds?
Are these on the path? I still see hits on runs from today:
AzureDevOpsTests
| where Repository endswith('runtime') and RunCompleted > ago(10d)
| where Message contains "'llvm-symbolizer' with working directory"
| extend QueueAndContainer = trim(' ', substring(TestRunName, indexof(TestRunName, '@') + 1))
Processing /cores/coredump.96726.dmp.crashreport.json
Printing stacktrace from '/cores/coredump.96726.dmp.crashreport.json'
Invoking llvm-symbolizer --pretty-print
Errors while running llvm-symbolizer --pretty-print
System.ComponentModel.Win32Exception (2): An error occurred trying to start process 'llvm-symbolizer' with working directory '/private/tmp/helix/working/B2090961/w/A22F08D1/e/Interop/Interop'. No such file or directory
at System.Diagnostics.Process.ForkAndExecProcess(ProcessStartInfo startInfo, String resolvedFilename, String[] argv, String[] envp, String cwd, Boolean setCredentials, UInt32 userId, UInt32 groupId, UInt32[] groups, Int32& stdinFd, Int32& stdoutFd, Int32& stderrFd, Boolean usesTerminal, Boolean throwOnNoExec) in /_/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/Process.Unix.cs:line 496
at System.Diagnostics.Process.StartCore(ProcessStartInfo startInfo) in /_/src/libraries/System.Diagnostics.Process/src/System/Diagnostics/Process.Unix.cs:line 456
at CoreclrTestLib.CoreclrTestWrapperLib.TryPrintStackTraceFromCrashReport(String crashReportJsonFile, TextWriter outputWriter)
Are these on the path? I still see hits on runs from today:
They've confirmed the right path is listed in /etc/paths
.
I've extracted a random sample of failing machines and asked for those to be checked to rule out an inconsistent configuration.
Your query gives a good view of failing cases but I wonder if we can establish if there have been any successful cases. Do you know of a message that would be printed if it was successful?
I tried looking - I see no successful invocations of it on macOS. On linux containers it looks like:
Processing /home/helixbot/dotnetbuild/dumps/coredump.2203.dmp.crashreport.json
Printing stacktrace from '/home/helixbot/dotnetbuild/dumps/coredump.2203.dmp.crashreport.json'
Invoking llvm-symbolizer --pretty-print
Stack trace:
----------------------------------
Thread Id: 0x89b
Child SP IP Call Site
0x7ffca315f5f0 0x7f8585cfb1d8 libclrjit.so!?? at ??:0:0
0x7ffca315f710 0x7f8585ea7d39 libclrjit.so!Compiler::impImportBlockCode(BasicBlock*) at /__w/1/s/src/coreclr/jit/importer.cpp:7987:56
0x7ffca315f8f0 0x7f8585d0448a libclrjit.so!insTupleTypeInfos at emitxarch.cpp:0:0
0x7ffca315f9f0 0x7f8585cfb479 libclrjit.so!?? at ??:0:0
0x7ffca315fb10 0x7f8585e1ecce libclrjit.so!Compiler::fgSwitchToOptimized(char const*) at /__w/1/s/src/coreclr/jit/flowgraph.cpp:473:5
0x7ffca315fb80 0x7f8585f616fa libclrjit.so!Compiler::fgMorphExpandCast(GenTreeCast*) at /__w/1/s/src/coreclr/jit/morph.cpp:562:9
0x7ffca315fbb0
...
Instead of symbolizer, macOS has atos
tool. An old note from my personal onenote has an example:
atos -o artifacts/bin/coreclr/OSX.arm64.Debug/libcoreclr.dylib.dwarf 0x7ac654
EEStartupHelper() (in libcoreclr.dylib.dwarf) (ceemain.cpp:1001)
(use the dwarf file to get the source line)
Or
atos -o artifacts/bin/coreclr/OSX.arm64.Debug/libcoreclr.dylib.dwarf 0x7ac654 -fullPath
EEStartupHelper() (in libcoreclr.dylib.dwarf) (/Users/janvorli/git/runtime/src/coreclr/vm/ceemain.cpp:1001)
Build
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-77578-merge-965165820fec43e19e/JIT.Stress/1/console.f7c5d70b.log?helixlogtype=result
https://dev.azure.com/dnceng-public/public/_build/results?buildId=82793&view=ms.vss-test-web.build-test-results-tab&runId=1731386&resultId=102137&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab
Pull Request
https://github.com/dotnet/runtime/pull/77578
Action required for the engineering services team
Additional information about the issue reported
To triage this issue (First Responder / @dotnet/dnceng):
In https://github.com/dotnet/runtime/pull/77578, we are trying to generate the crash stacktrace using
llvm-symbolizer
. While it is present in containers, the base Linux and macOS queues doesn't have it and we see error using it. See the logs I referenced in the issue. Can we get it and lldb installed on base image?CC: @hoyosjs @JulieLeeMSFT
Release Note Category
Release Note Description
Add llvm and llvm-symbolizer to Ubunut.1804.Amd64 and RedHat.7.Amd64