Closed aullom1 closed 2 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
We observe the exact same behaviour, even for a minimal console application printing nothing but "Hello World" to the console.
Edit: this might be related: https://giters.com/Azure/azure-functions-dotnet-worker/issues/727
We observe the exact same behaviour, even for a minimal console application printing nothing but "Hello World" to the console.
Edit: this might be related: https://giters.com/Azure/azure-functions-dotnet-worker/issues/727
Interesting. As of 12 days ago the image for function apps was fixed, but no other images have been fixed.
We will try to repro this issue. Does this repro even if you configure with 1000+ millicores. Wonder if something under a 1000 is causing an issue.
We will try to repro this issue. Does this repro even if you configure with 1000+ millicores. Wonder if something under a 1000 is causing an issue.
I tried it again with req/limit of 1000/3000 with the same result. I wasn't too hopeful since the nonprod env has no trouble with 100/300.
I am not able to repro this on a simple console app. Is it possible to capture a dump using dotnet-dump so we could investigate further?
This is the config I tried:
apiVersion: apps/v1
kind: Deployment
metadata:
name: testalloc
labels:
app: k8stest
spec:
replicas: 1
template:
metadata:
name: k8stest
labels:
app: k8stest
spec:
containers:
- name: k8stest
image: <youracr>.azurecr.io/k8stest:v7
resources:
limits:
cpu: 3000m
memory: 7Gi
requests:
cpu: 1000m
memory: 7Gi
ports:
- containerPort: 80
selector:
matchLabels:
app: k8stest
We also have a sample app which has been ported over to use .NET 6: https://github.com/dotnet-architecture/eShopOnContainers
I am not able to repro this on a simple console app. Is it possible to capture a dump using dotnet-dump so we could investigate further?
This is the config I tried:
apiVersion: apps/v1 kind: Deployment metadata: name: testalloc labels: app: k8stest spec: replicas: 1 template: metadata: name: k8stest labels: app: k8stest spec: containers: - name: k8stest image: <youracr>.azurecr.io/k8stest:v7 resources: limits: cpu: 3000m memory: 7Gi requests: cpu: 1000m memory: 7Gi ports: - containerPort: 80 selector: matchLabels: app: k8stest
We also have a sample app which has been ported over to use .NET 6: https://github.com/dotnet-architecture/eShopOnContainers
When log in to the pod and try to install dotnet-dump (https://docs.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-dump) using dotnet tool install --global dotnet-dump
I get the same problem :-) Nothing seems to happen: there is no output and the process never finishes...
I get that behavior for any dotnet
command that isn't related to getting help or printing info about the runtime. It's like something is messed up with certain configurations of AKS.
I pulled some more info on various clusters. I'm wondering if the problem is with the node image version. None of my clusters have the same node image version.
Working
Kubernetes | 1.21.7 |
Node image version | AKSUbuntu-1804gen2containerd-2021.12.07 |
Container runtime version | containerd://1.4.9+azure |
Kubernetes | 1.21.7 |
Node image version | AKSUbuntu-1804gen2containerd-2022.01.19 |
Container runtime version | containerd://1.4.9+azure |
Kubernetes | 1.20.9 |
Node image version | AKSUbuntu-1804gen2containerd-2021.10.02 |
Container runtime version | containerd://1.4.9+azure |
Not working
Kubernetes | 1.21.7 |
Node image version | AKSUbuntu-1804gen2containerd-2022.01.07 |
Container runtime version | containerd://1.4.9+azure |
Kubernetes | 1.20.9 |
Node image version | AKSUbuntu-1804gen2containerd-2021.10.19 |
Container runtime version | containerd://1.4.9+azure |
yeah would be interesting to check if there any correlation where the issue occurs. @CaveSven could you please check if your node versions match with any from @aullom1 ?
We do not see such a clear distinction, meaning we have combinations of node version/ container runtime version that seem to succeed and fail:
Working Kubernetes: 1.13 Node image version: Ubuntu 16.04.7 LTS 4.15.0-1113-azure Container Runtime: docker://19.3.14
Not Working
Kubernetes: 1.13 Node image version: Ubuntu 16.04.6 LTS 4.15.0-1077-azure Container Runtime: docker://3.0.10+azure
Kubernetes: 1.13 Node image version: Ubuntu 16.04.7 LTS 4.15.0-1113-azure Container Runtime: docker://19.3.14 --> Notice: same as "working"
Kubernetes: 1.13 Node image version: Ubuntu 16.04.6 LTS 4.15.0-1113-azure Container Runtime: docker://3.0.10+azure
Edit: Upon checking a second time, I do get different results, see https://github.com/dotnet/runtime/issues/64760#issuecomment-1047581461
Related: https://stackoverflow.com/questions/70973535/net-6-container-not-starting-properly-in-aks
That was me :)
I also have a support request open with Microsoft. My last resort is to ask Batman for help.
Adding @tommcdon @mikem8361 if they have ideas to get dumps in an AKS container since dotnet tool install
isnt working due to the same issue. Perhaps you can copy createdump
manually to the container?
createdump
is part of the .NET runtime install not the tool install. It can be executed directly with sudo createdump <pid>
. See https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/botr/xplat-minidump-generation.md#configurationpolicy "Command Line Usage" for more information.
My container is locked down, and I can't run anything with sudo
. And ps
is not installed, so I couldn't get the pid. Will need to work on deploying a new container, but I'm guessing my container security solution is going to block anything that is excessively permissive.
If the process is throwing a managed unhandled exception at startup: https://docs.microsoft.com/en-us/dotnet/core/diagnostics/dumps#collect-dumps-on-crash
Environment variable | Description | Default value |
---|---|---|
COMPlus_DbgEnableMiniDump | If set to 1, enable core dump generation. | 0 |
COMPlus_DbgMiniDumpType | Type of dump to be collected. For more information, see the table below | 2 (MiniDumpWithPrivateReadWriteMemory) |
COMPlus_DbgMiniDumpName | Path to a file to write the dump to. | /tmp/coredump. |
COMPlus_CreateDumpDiagnostics | If set to 1, enable diagnostic logging of dump process. | 0 |
@tommcdon I edited my configmap to include those settings, but I didn't get a coredump log. I also tried running the app from within the container to see if that would give me anything. The dotnet
command just hung, and I still didn't get a coredump log.
What do you mean "coredump log"? The actual coredump file? Or the logging output from the createdump coredump generation? For the later, the console output of the app process needs to be captured.
There was no console output, and there was no file generated at /tmp/coredump.
I'm assuming this means the app never even attempted to start.
@tommcdon I edited my configmap to include those settings, but I didn't get a coredump log. I also tried running the app from within the container to see if that would give me anything. The
dotnet
command just hung, and I still didn't get a coredump log.
@aullom1 Thanks - I wasn't sure if this was a crash or hang. Can you try running dotnet-dump collect --process-id pid inside the pod to collect a dump of the process? The documentation page contains single-file framework dependent download links . The runtime can be installed using dotnet-install and as long as it is on the path, dotnet-dump collect should hopefully be able to collect a dump.
I'm assuming this means the app never even attempted to start.
@aullom1 this could point towards a configuration issue inside the pod. Do you see any useful system log output using journalctl?
@tommcdon I'll need to work with someone else to get an updated image pushed to our ACR. It probably won't be until next week that it happens.
For anyone following along who wants to add dotnet-dump
to their final image, you might need to do something like this from the SDK image.
FROM mcr.microsoft.com/dotnet/aspnet:6.0 AS base
...
FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build
...
FROM build AS publish
# Get the dotnet-dump command
RUN wget -O /bin/dotnet-dump https://aka.ms/dotnet-dump/linux-x64 && \
chmod +x /bin/dotnet-dump
...
FROM base AS final
COPY --from=publish /bin/dotnet-dump /bin/dotnet-dump
@tommcdon It seems I can't run dotnet-dump in my container. I either get kicked out, or the process gets killed. I tried it in another cluster where .NET 6 works, and I had no problem.
This was my attempt in the non-functional cluster.
AtlasAdmin@atlastestpage-6795df47b6-jb2n4:/app$ dotnet-dump ps
command terminated with exit code 137
aullom1@WMDW10750825:/c/git/AaronTest/AaronTest.Api$ k -n atlastestpage exec --tty --stdin atlastestpage-6795df47b6-jb2n4 -- /bin/bash
AtlasAdmin@atlastestpage-6795df47b6-jb2n4:/app$ dotnet-dump --help
Killed
Do you happen to know what version of the kernel and what seccomp/apparmor profiles have been use for deployment? dotnet dump requires ptrace and it tries to use a couple of read calls using process_vm_readv
that decay to pread64
if the call is not allowed in the seccomp profile.
I don't think we're using apparmor. I should see an annotation for something like this, right? I don't see one in the pod spec.
container.apparmor.security.beta.kubernetes.io/v5: runtime/default
RE: dotnet-dump Do you know if you modify the seccomp profile at all? I need to check what AKS's OCR does for the defaults (I can't recall what containerd does here). Also, you mentioned that your test environment works just fine. Does dotnet-dump work there? And is this test environment also in AKS? If so, what's the difference between that and prod? Mostly just PrismaCloud and such?
As for your startup: The table you posted seems to hint that same version of K8's + same OCR works on some and fails on others, right? What are the differences between the container images?
As far as I know we are not modifying seccomp. I don't really know what this is/does, so I checked with the team that defined the AKS infra, and they are of the opinion that we don't mess with it.
In the test environment (also AKS) dotnet-dump works just as expected. I have compared the ARM between a working and not working AKS environment, and the only differences are with names, IDs, and other unique fields. None of the meaningful configuration is different. I repeated the process with various manifests from Kubernetes (app deployments, ingresses, services, configmaps, pod identities), and again there are no differences.
As for PrismaCloud, nothing stays the same in IT, and we have now migrated to Aqua. Our test and prod environments are covered by the same policies in Aqua, and we see nothing in Aqua to indicate the image or container were blocked. The image itself resides in our private ACR, and it is pulled from that ACR for both test and prod, so there should be no differences between container images.
I see messages like this from dmesg
in pods that don't work. I can't find anything similar in pods that do work.
Memory cgroup out of memory: Killed process 9378 (dotnet) total-vm:5225372kB, anon-rss:296300kB, file-rss:73236kB, shmem-rss:48kB, UID:101 pgtables:1272kB oom_score_adj:990
I don't understand how it could be out of memory. The pod is using less than 300Mi of the 500Mi requested.
One theory to try is that something makes GC think there is a lot of resources on this machine/container so it tries to grab too much memory upfront and gets OOM-killed.
Is there a memory limit where the app is able to run?
Is there a memory limit where the app is able to run?
Not in the prod environment. I bumped up to more than 1GB, and it still behaves the same.
And just to confirm, that you are able to run .NET 5 apps on the same container correct?
And just to confirm, that you are able to run .NET 5 apps on the same container correct?
Yes
Ok thanks, is it possible to get lldb
installed on the container? That might provide some diagnosability.
cc @janvorli if there are other ideas to investigate these failures.
I might need some help with this. Is it going to work with my published artifacts, or do I need to use an alternate build process? I'm not familiar with lldb, so reading up on it, but a "do this" would help.
I have been digging more into how OOM Killer works, and it appears that it is possible for Kubernetes to OOM Kill a process within a container without killing the actual container. This can happen if the process being killed is not the main process of the container. It seems like there is a bit of disagreement in how/what should be OOM Killed.
I'm not sure what to make of all this, and I'm not sure this is even what's going on with my issue (but it seems likely). Why is the dotnet process not the main process of the container? Why does it behave differently in different clusters?
References
Another github issue I found: https://github.com/kubernetes/kubernetes/issues/100483#issuecomment-1026743616
It appears that this issue is bigger than just .NET 6. But that still doesn't explain why the problem only exists in specific environments.
I don't expect LLDB to work properly unless you can run it under LLDB by doing a manual invocation of the sorts of lldb -- <commandline>
. Attaching to a running process requires sudo
. Launching requires you to be able to use ptrace, just like dotnet-dump, and requires you to have it continue on signals that get captured. For the 1GB memory limit case in prod, do you still see the message? Is the message existent before triggering dotnet-dump? Once you start dotnet-dump do you still get it? As for your app being the main one, not sure. It should be executing directly given how you invoke it.
I'm not sure why, but I am no longer seeing references to dotnet from dmesg. It was very consistent last week, and this week I can't reproduce it. I tried a rollout restart of the deployment, manually starting the application, and running dotnet-dump, but nothing is producing the OOM message. And the app still isn't starting up, and there are no logs. WTH?
Another thing to try. Is there any profiling/monitoring enabled? Like Dynatrace or something like that. If that is the case, can it be turned off? - just to see if that makes a difference.
For what it's worth: I rechecked the pods and have a cleaner signature now.
I installed the sdk according to https://docs.microsoft.com/en-us/dotnet/core/install/linux-ubuntu#install-the-sdk and run dotnet foobar
. In two cases, this lead to the error signature (in the other cases I would just get the 'unknown command' message)
k8s version | OS image | Kernel version | Container runtime | .Net6 running |
---|---|---|---|---|
v1.13.12 | Ubuntu 16.04.7 LTS | 4.15.0-1113-azure | docker://19.3.14 | (/) |
v1.13.12 | Ubuntu 16.04.6 LTS | 4.15.0-1077-azure | docker://3.0.10+azure | (x) |
v1.13.12 | Ubuntu 16.04.7 LTS | 4.15.0-1113-azure | docker://19.3.14 | (/) |
v1.13.12 | Ubuntu 16.04.6 LTS | 4.15.0-1113-azure | docker://3.0.10+azure | (x) |
v1.13.12 | Ubuntu 16.04.7 LTS | 4.15.0-1113-azure | docker://19.3.14 | (/) |
v1.13.12 | Ubuntu 16.04.7 LTS | 4.15.0-1113-azure | docker://19.3.14 | (/) |
@CaveSven / @aullom1, are you guys running Dynatrace in these clusters?
I got our Dynatrace admins to exclude a couple namespaces, and what I found is that as long as Dynatrace is not injecting into a .NET 6 pod, the pod will start up and work as expected. What I am confused about is that Dynatrace is on all of the clusters (even the ones where .NET 6 always worked), but Dynatrace only needs to be told to ignore the namespace on some of the clusters.
Saw in an email from Microsoft that another customer was facing issues with a specific version of Dynatrace.
Hmm, interesting. If you find any info about the root cause from Dynatrace please share it here.
I got our Dynatrace admins to exclude a couple namespaces, and what I found is that as long as Dynatrace is not injecting into a .NET 6 pod, the pod will start up and work as expected. What I am confused about is that Dynatrace is on all of the clusters (even the ones where .NET 6 always worked), but Dynatrace only needs to be told to ignore the namespace on some of the clusters.
From what we have seen so far, it depends on the Dynatrace version. Can you check the versions between clusters to see if they are different?
Not working dynatrace version: 1.219.134.20210624-153812
Working dynatrace version: 1.231.245.20220112-085307
This jives perfectly with what we've seen as well.
Not Working: 1.215.192.20210519-150820 Working: 1.231.288.20220218-170442
We are also running Dynatrace. I am trying to get some information on the version.
Description
I have a .NET 6 API that I would like to run in AKS. Locally it works just fine. In AKS the pod will start up, but the container produces no logs and refuses all connections. I can exec into the container and run
dotnet MyApp.Api.dll
but the command will just hang and not write any output.Exec'ing into the container and manually running
dotnet MyApp.Api.dll
will just hang and produce no output.Reproduction Steps
Generate a minimal .NET 6 webapi (.e.g
dotnet new webapi -minimal -o MyApp.Api
).Create a Dockerfile and build and push to a repo.
Create an AKS cluster. Obviously there can be a ton of variability here. I am using:
Deploy your app. The important bits about the manifest are:
Expected behavior
The pod should start up, and the logs should show the typical startup output (e.g. Now listening on port...).
Actual behavior
The pod reports that it is running, but there are no logs, and all connections are refused.
Regression?
Prior versions of .NET Core do not have this problem.
Known Workarounds
No response
Configuration
No response
Other information
No response