Closed peter-glotfelty closed 1 year ago
Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.
Author: | peter-glotfelty |
---|---|
Assignees: | - |
Labels: | `area-Diagnostics-coreclr`, `untriaged` |
Milestone: | - |
Is there any particular user you're using for this? Also, what's the CRI implementation and OCI runtime? Is it containerd/runc? Under regular docker it works fine, so my first hunch is any access issues - potentially because the user isn't PTRACE enabled or because the seccomp profile disallows it.
Hello @hoyosjs, the app runs as the default user in the mcr image which I assume is root. We're using containerd + runc:
$ /usr/bin/containerd --version
containerd github.com/containerd/containerd 1.5.11+azure-1 3df54a852345ae127d1fa3092b95168e4a88e2f8
$ /usr/bin/runc --version
runc version 1.0.3
commit: f46b6ba2c9314cfc8caae24a32ec5fe9ef1059fe
spec: 1.0.2-dev
go: go1.16.12
libseccomp: 2.5.1
Did a little more investigating of this issue:
Adding PTRACE to the container spec doesn't change anything as best I can tell.
securityContext:
capabilities:
add:
- SYS_PTRACE
I notice that if dotnet is not running in a shell, it looks like it runs createdump twice (we don't get either though 😔):
[createdump] Gathering state for process 1 dotnet
[createdump] Crashing thread 00000001 signal 00000006
[createdump] Writing full dump to file /watson/cores/<process-name>-1
[createdump] Written 121438208 bytes (29648 pages) to core file
[createdump] Dump successfully written
[createdump] Gathering state for process 1 dotnet
[createdump] Crashing thread 00000001 signal 0000000b
[createdump] Writing full dump to file /watson/cores/<process-name>-1
[createdump] Written 121438208 bytes (29648 pages) to core file
[createdump] Dump successfully written
However, if it crashes in a start.sh script, it only prints that code out once.
[createdump] Gathering state for process 7 dotnet
[createdump] Crashing thread 00000007 signal 00000006
[createdump] Writing full dump to file /watson/cores/<process-name>-7
[createdump] Written 121442304 bytes (29649 pages) to core file
[createdump] Dump successfully written
/app/start.sh: line 19: 7 Aborted (core dumped) dotnet <process>.dll
This might be a logging issue and nothing else, but it seems like it's maybe worth mentioning.
Are you exporting COMPlus_DbgMiniDumpName? Is /watson/cores/<process-name>-1
the actually core dump file path? Or was it edited to remove the actual process name?
Not sure why createdump is being run twice for the same process.
FYI, the /app/start.sh: line 19: 7 Aborted (core dumped) dotnet <process>.dll
message is from enable system core dumps.
This issue has been marked needs-author-action
and may be missing some important information.
We are setting COMPlusDbgMiniDumpName in the pod spec, and yes, I did format the above lines to remove the actual application names.
The full name includes the container and pod name as requested by the Azure Watson folks in this document
/watson/cores/$(CONTAINER_NAME)_$(POD_NAME)()$(CONTAINER_NAME)-%d
*Note: The empty()
are intentional and suggested in the aforementioned document.
The reason I'm asking is that maybe the coredump isn't being written because of some invalid char in the name or path, but Linux has very few invalid file name chars and it should fail to open. I didn't see where that doc recommended the () in the name. Could you try a simpler file name (no ())? You may want to use %e
for process id instead of %d
(this is supported for backwards compatibility).
I did see this:
- name: COMPlus_DbgMiniDumpName
value: "/cores/$(CONTAINER_POD_NAME_KEY)$(CONTAINER_NAME_SEPARATOR)<application name>-%d"
Oops, I didn't see what CONTAINER_NAME_SEPARATOR was defined to ().
We can definitely switch to %e as part of general housekeeping. I think we onboarded when most of our services were still on 3.1, and the guidance may have been different.
Just to check nothing weird was happening with the names, I changed ()
to __
, and I still see the issue so I don't think it's the naming convention.
If you are still on 3.1 (I assumed 6.0), then you should continue use %d.
Later .NET Core versions do have better diagnostic logging. The only thing I can come up with on this containerd/runc issue, is that the createdump process doesn't have sufficient permissions to write the dump to the target directory even though the dump open/writes don't fail. I'm grasping at straws here because I don't know this container stuff.
We are on 6.0 now. (Sorry for the confusion)
I'm also not sure what else we can add. I added the COMPlus_CreateDumpDiagnostics
and created a few more dumps; I'll attach the logs below, but I don't think there's a smoking gun there either. Interesting-ish snippets
// Broken Dumps
// .... lots of stuff before
[createdump] MODULE: 00007f283daa0000 dyn 0 inmem 0 file 0 pe 000056389bf21260 pdb 0000000000000000[createdump] MODULE: timestamp bc25072e size 00058c00 869c333324504c78a0e7cd8cde34b6ac /usr/share/dotnet/shared/Microsoft.NETCore.App/6.0.7/System.Memory.dll
[createdump] MODULE: 00007f28b2f2d000 dyn 0 inmem 0 file 1 pe 000056389bf2e010 pdb 0000000000000000[createdump] MODULE: timestamp 81f69e3c size 00008000 0ddc11c76170457a97cb32b070521b69 /usr/share/dotnet/shared/Microsoft.NETCore.App/6.0.7/System.Text.Encoding.Extensions.dll
[createdump] EnumerateManagedModules: Module enumeration FINISHED
[createdump] Unwind: thread 0001
[createdump] GetMemoryRegionFlags: FAILED
[createdump] Unwind: managed frames
[createdump] Unwind: found managed exception
[createdump] Unwind: exception object 0x7f28140096a8 exception hresult 80131500
[createdump] Unwind: exception type System.Exception
[createdump] GetMemoryRegionFlags: FAILED
[createdump] Unwind: thread 0008
[createdump] Unwind: thread 0009
[createdump] Unwind: thread 000a
[createdump] Unwind: thread 000b
[createdump] Unwind: thread 000c
[createdump] Unwind: managed frames
[createdump] Unwind: thread 000e
[createdump] Unwind: managed frames
[createdump] CombineMemoryRegions: STARTED
[createdump] CombineMemoryRegions: FINISHED
[createdump] Writing full dump to file /watson/cores/telemetry-test-driver_telemetry-test-driver-599b8d654b-g8fjh()telemetry-test-driver-1
[createdump] Writing memory region headers to core file
[createdump] Writing process information to core file
[createdump] Writing 20 auxv entries to core file
[createdump] Writing 141 NT_FILE entries to core file
[createdump] Writing 7 thread entries to core file
[createdump] Writing 169 memory regions to core file
[createdump] Written 121458688 bytes (29653 pages) to core file
[createdump] Dump successfully written
[createdump] Gathering state for process 1 dotnet // <---- Goes immediately into taking a second dump.
[createdump] Crashing thread 00000001 signal 0000000b
[createdump] Thread 0001 RIP 00007f28b6f12207 RSP 00007f28b73654f0
[createdump] Thread 0008 RIP 00007f28b6f3a3ff RSP 00007f28b6637da0
// .... lots more stuff
[createdump] MODULE: 00007f283daa0000 dyn 0 inmem 0 file 0 pe 000056389bf21260 pdb 0000000000000000[createdump] MODULE: timestamp bc25072e size 00058c00 869c333324504c78a0e7cd8cde34b6ac /usr/share/dotnet/shared/Microsoft.NETCore.App/6.0.7/System.Memory.dll
[createdump] MODULE: 00007f28b2f2d000 dyn 0 inmem 0 file 1 pe 000056389bf2e010 pdb 0000000000000000[createdump] MODULE: timestamp 81f69e3c size 00008000 0ddc11c76170457a97cb32b070521b69 /usr/share/dotnet/shared/Microsoft.NETCore.App/6.0.7/System.Text.Encoding.Extensions.dll
[createdump] EnumerateManagedModules: Module enumeration FINISHED
[createdump] Unwind: thread 0001
[createdump] Unwind: managed frames // <--- Skips "GetMemoryRegionFlags the second time"
[createdump] Unwind: found managed exception
[createdump] Unwind: exception object 0x7f28140096a8 exception hresult 80131500
[createdump] Unwind: exception type System.Exception
[createdump] Unwind: thread 0008
[createdump] Unwind: thread 0009
[createdump] Unwind: thread 000a
[createdump] Unwind: thread 000b
[createdump] Unwind: thread 000c
[createdump] Unwind: managed frames
[createdump] Unwind: thread 000e
[createdump] Unwind: managed frames
[createdump] CombineMemoryRegions: STARTED
[createdump] CombineMemoryRegions: FINISHED
[createdump] Writing full dump to file /watson/cores/telemetry-test-driver_telemetry-test-driver-599b8d654b-g8fjh()telemetry-test-driver-1
[createdump] Writing memory region headers to core file
[createdump] Writing process information to core file
[createdump] Writing 20 auxv entries to core file
[createdump] Writing 141 NT_FILE entries to core file
[createdump] Writing 7 thread entries to core file
[createdump] Writing 169 memory regions to core file
[createdump] Written 121458688 bytes (29653 pages) to core file
[createdump] Dump successfully written
createdumplogs-without-errors.txt createdumplogs-with-errors.txt
Hi @peter-glotfelty sorry for the delay on this issue. It seems there are two issues in this issue:
createdump
(presumably this is because the right env variables are not set?)
- When the container is running with a shell, createdump is executed twice. As far as we know there is no code in the runtime that could explain this behavior. Would it be possible to share the start.sh script?
Not quite, I only see this behavior when the container is running without a shell. Basically, it's another symptom we see in addition to your first bullet point so I suspect they are related.
Our startup script is pretty simple:
#!/bin/bash
_term() {
kill "$child"
}
# We need to make sure that when k8s terminates the pod, we stop
# the child process
trap _term TERM
# Original Entrypoint.
dotnet TelemetryTestService.dll &
child="$!"
wait "$!"
@peter-glotfelty
I tried reproing this under AKS on an Ubuntu node with the following settings:
Container recipe:
FROM mcr.microsoft.com/dotnet/runtime:6.0-bullseye-slim
ARG source=./
WORKDIR /app
# This is just the result of `dotnet publish`
COPY $source .
ENTRYPOINT ["dotnet", "MyApplication.dll"]
Container spec:
containers:
- name: consoleapp
image: hoyosjs.azurecr.io/dotnetdump/container-shell:entry
env:
- name: COMPlus_DbgEnableMiniDump
value: "1"
And I get:
juhoyosa@TARDIS-DEV::publish> kubectl logs consoleapp-deployment-5b685b47cf-bqd26
Unhandled exception. System.Exception: Crashed on startup
at MyApplication.Program.Main(String[] args) in /home/mikem/builds/dockertest/Program.cs:line 10
[createdump] Gathering state for process 1 dotnet
[createdump] Crashing thread 00000001 signal 00000006
[createdump] Writing minidump with heap to file /tmp/coredump.1
[createdump] Written 62087168 bytes (15158 pages) to core file
[createdump] Target process is alive
[createdump] Dump successfully written
With a dotnet entry point. So somethings seems to be different. I couldn't use mariner since I'd have to use a DaemonSet to work around https://github.com/dotnet/diagnostics/issues/3423. The AKS team is working on deploying the containerd fix now. Do you think it could be related to that?
No longer repro.
Description
We're a MSFT internal team using dotnet on linux in AKS. We're onboarded to Azure Watson and some of our teams are looking to migrate to distroless containers for our applications. We have noticed that collecting core dumps on crashes does not work correctly unless we start dotnet inside a shell script, aka this doesn't generate the correct core dumps:
however this does:
where
start.sh
is a small wrapper that forwards signals and callsdotnet
itself.We're expecting 2 core dumps to be taken:
However, if we don't run with a shell, we typically see 0 dumps when the app crashes (we do occasionally see 1 dump, which is weird).
Reproduction Steps
The app code seems to be irrelevant, anything that crashes the process seems to work:
Then a dockerfile like this one:
We then deploy it to AKS in a pod with
COMPlus_DbgEnableMiniDump=1
andCOMPlus_DbgMiniDumpType=4
, and the azure-watson agent running on the node.Expected behavior
createdump takes a dump of the managed heap and the watson agent finds it and uploads it to the portal.
Actual behavior
Generally, no dump appears. Occassionally, a dump will show up without symbols. One thing that is a little notable is that these dumps typically are in a SIGSEGV bucket, whereas dumps that come from containers running from a shell are almost always in a SIGABRT bucket.
Regression?
n/a
Known Workarounds
As mentioned above. The issues only comes up if we are running dotnet as the entrypoint in our container. If the entrypoint is a bash script that starts dotnet. Everything works as expected, and our core dumps are properly taken and uploaded. We haven't tried other shells or init exe's
Configuration
We're using
mcr.microsoft.com/dotnet/runtime:6.0-bullseye-slim
and I believe we've seen this with theaspnet
version as well.Host OS :
Ubuntu 18.04.6 LTS
Kernel version:
5.4.0-1078-azure
Arch:x64
Other information
No response