dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.99k stars 4.67k forks source link

.NET 6 with specific Dynatrace versions does not start up in a AKS, and produces no logs #64760

Closed aullom1 closed 2 years ago

aullom1 commented 2 years ago

Description

I have a .NET 6 API that I would like to run in AKS. Locally it works just fine. In AKS the pod will start up, but the container produces no logs and refuses all connections. I can exec into the container and run dotnet MyApp.Api.dll but the command will just hang and not write any output.

Exec'ing into the container and manually running dotnet MyApp.Api.dll will just hang and produce no output.

Reproduction Steps

Generate a minimal .NET 6 webapi (.e.g dotnet new webapi -minimal -o MyApp.Api).

Create a Dockerfile and build and push to a repo.

FROM mcr.microsoft.com/dotnet/aspnet:6.0 AS base
WORKDIR /app

# Switch the default port from 80 to 5000
ENV ASPNETCORE_URLS=http://+:5000 \
    # Enable detection of running in a container
    DOTNET_RUNNING_IN_CONTAINER=true

EXPOSE 5000

# Create a group and user to run the ASP.NET Core application
RUN addgroup --system dotnetappgroup && \
    adduser --system --ingroup dotnetappgroup --shell /bin/sh dotnetappuser && \
    mkdir -p /local/ && \
    chown dotnetappuser:dotnetappgroup /local && \
    chown dotnetappuser:dotnetappgroup /app

# Update Docker to run all future commands as the appuser user.
USER dotnetappuser

FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build
ARG NuGetConfig=NuGet.Config
WORKDIR /src
COPY . .
RUN dotnet restore "MyApp.Api/MyApp.Api.csproj"
WORKDIR "/src/MyApp.Api"
RUN dotnet build "MyApp.Api.csproj" -c Release -o /app/build --no-restore

FROM build AS publish
RUN dotnet publish "MyApp.Api.csproj" -c Release -o /app/publish --no-restore

FROM base AS final
WORKDIR /app
COPY --from=publish /app/publish .
ENTRYPOINT ["dotnet", "MyApp.Api.dll"]

Create an AKS cluster. Obviously there can be a ton of variability here. I am using:

Kubernetes version 1.20.9
Node image version AKSUbuntu-1804gen2containerd-2021.10.19
Container runtime version containerd://1.4.9+azure

Deploy your app. The important bits about the manifest are:

image: <YOUR_IMAGE_REFERENCE>
imagePullPolicy: Always
name: myapp
ports:
- containerPort: 5000
  protocol: TCP
resources:
  limits:
    cpu: 200m
    memory: 500Mi
  requests:
    cpu: 100m
    memory: 200Mi

Expected behavior

The pod should start up, and the logs should show the typical startup output (e.g. Now listening on port...).

Actual behavior

The pod reports that it is running, but there are no logs, and all connections are refused.

Regression?

Prior versions of .NET Core do not have this problem.

Known Workarounds

No response

Configuration

No response

Other information

No response

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

CaveSven commented 2 years ago

We observe the exact same behaviour, even for a minimal console application printing nothing but "Hello World" to the console.

Edit: this might be related: https://giters.com/Azure/azure-functions-dotnet-worker/issues/727

aullom1 commented 2 years ago

We observe the exact same behaviour, even for a minimal console application printing nothing but "Hello World" to the console.

Edit: this might be related: https://giters.com/Azure/azure-functions-dotnet-worker/issues/727

Interesting. As of 12 days ago the image for function apps was fixed, but no other images have been fixed.

mangod9 commented 2 years ago

We will try to repro this issue. Does this repro even if you configure with 1000+ millicores. Wonder if something under a 1000 is causing an issue.

aullom1 commented 2 years ago

We will try to repro this issue. Does this repro even if you configure with 1000+ millicores. Wonder if something under a 1000 is causing an issue.

I tried it again with req/limit of 1000/3000 with the same result. I wasn't too hopeful since the nonprod env has no trouble with 100/300.

mangod9 commented 2 years ago

I am not able to repro this on a simple console app. Is it possible to capture a dump using dotnet-dump so we could investigate further?

This is the config I tried:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: testalloc
  labels:
    app: k8stest
spec:
  replicas: 1
  template:
    metadata:
      name: k8stest
      labels:
        app: k8stest
    spec:
      containers:
      - name: k8stest
        image: <youracr>.azurecr.io/k8stest:v7
        resources:
          limits:
            cpu: 3000m
            memory: 7Gi
          requests:
            cpu: 1000m
            memory: 7Gi
        ports:
          - containerPort: 80
  selector:
    matchLabels:
      app: k8stest

We also have a sample app which has been ported over to use .NET 6: https://github.com/dotnet-architecture/eShopOnContainers

CaveSven commented 2 years ago

I am not able to repro this on a simple console app. Is it possible to capture a dump using dotnet-dump so we could investigate further?

This is the config I tried:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: testalloc
  labels:
    app: k8stest
spec:
  replicas: 1
  template:
    metadata:
      name: k8stest
      labels:
        app: k8stest
    spec:
      containers:
      - name: k8stest
        image: <youracr>.azurecr.io/k8stest:v7
        resources:
          limits:
            cpu: 3000m
            memory: 7Gi
          requests:
            cpu: 1000m
            memory: 7Gi
        ports:
          - containerPort: 80
  selector:
    matchLabels:
      app: k8stest

We also have a sample app which has been ported over to use .NET 6: https://github.com/dotnet-architecture/eShopOnContainers

When log in to the pod and try to install dotnet-dump (https://docs.microsoft.com/en-us/dotnet/core/diagnostics/dotnet-dump) using dotnet tool install --global dotnet-dump I get the same problem :-) Nothing seems to happen: there is no output and the process never finishes...

aullom1 commented 2 years ago

I get that behavior for any dotnet command that isn't related to getting help or printing info about the runtime. It's like something is messed up with certain configurations of AKS.

aullom1 commented 2 years ago

I pulled some more info on various clusters. I'm wondering if the problem is with the node image version. None of my clusters have the same node image version.

Working

Kubernetes 1.21.7
Node image version AKSUbuntu-1804gen2containerd-2021.12.07
Container runtime version containerd://1.4.9+azure
Kubernetes 1.21.7
Node image version AKSUbuntu-1804gen2containerd-2022.01.19
Container runtime version containerd://1.4.9+azure
Kubernetes 1.20.9
Node image version AKSUbuntu-1804gen2containerd-2021.10.02
Container runtime version containerd://1.4.9+azure

Not working

Kubernetes 1.21.7
Node image version AKSUbuntu-1804gen2containerd-2022.01.07
Container runtime version containerd://1.4.9+azure
Kubernetes 1.20.9
Node image version AKSUbuntu-1804gen2containerd-2021.10.19
Container runtime version containerd://1.4.9+azure
mangod9 commented 2 years ago

yeah would be interesting to check if there any correlation where the issue occurs. @CaveSven could you please check if your node versions match with any from @aullom1 ?

CaveSven commented 2 years ago

We do not see such a clear distinction, meaning we have combinations of node version/ container runtime version that seem to succeed and fail:

Working Kubernetes: 1.13 Node image version: Ubuntu 16.04.7 LTS 4.15.0-1113-azure Container Runtime: docker://19.3.14

Not Working

Kubernetes: 1.13 Node image version: Ubuntu 16.04.6 LTS 4.15.0-1077-azure Container Runtime: docker://3.0.10+azure

Kubernetes: 1.13 Node image version: Ubuntu 16.04.7 LTS 4.15.0-1113-azure Container Runtime: docker://19.3.14 --> Notice: same as "working"

Kubernetes: 1.13 Node image version: Ubuntu 16.04.6 LTS 4.15.0-1113-azure Container Runtime: docker://3.0.10+azure

Edit: Upon checking a second time, I do get different results, see https://github.com/dotnet/runtime/issues/64760#issuecomment-1047581461

CaveSven commented 2 years ago

Related: https://stackoverflow.com/questions/70973535/net-6-container-not-starting-properly-in-aks

aullom1 commented 2 years ago

Related: https://stackoverflow.com/questions/70973535/net-6-container-not-starting-properly-in-aks

That was me :)

I also have a support request open with Microsoft. My last resort is to ask Batman for help.

mangod9 commented 2 years ago

Adding @tommcdon @mikem8361 if they have ideas to get dumps in an AKS container since dotnet tool install isnt working due to the same issue. Perhaps you can copy createdump manually to the container?

tommcdon commented 2 years ago
mikem8361 commented 2 years ago

createdump is part of the .NET runtime install not the tool install. It can be executed directly with sudo createdump <pid>. See https://github.com/dotnet/runtime/blob/main/docs/design/coreclr/botr/xplat-minidump-generation.md#configurationpolicy "Command Line Usage" for more information.

aullom1 commented 2 years ago

My container is locked down, and I can't run anything with sudo. And ps is not installed, so I couldn't get the pid. Will need to work on deploying a new container, but I'm guessing my container security solution is going to block anything that is excessively permissive.

tommcdon commented 2 years ago

If the process is throwing a managed unhandled exception at startup: https://docs.microsoft.com/en-us/dotnet/core/diagnostics/dumps#collect-dumps-on-crash

Environment variable Description Default value
COMPlus_DbgEnableMiniDump  If set to 1, enable core dump generation. 0
COMPlus_DbgMiniDumpType  Type of dump to be collected. For more information, see the table below 2 (MiniDumpWithPrivateReadWriteMemory)
COMPlus_DbgMiniDumpName  Path to a file to write the dump to. /tmp/coredump.
COMPlus_CreateDumpDiagnostics  If set to 1, enable diagnostic logging of dump process. 0
aullom1 commented 2 years ago

@tommcdon I edited my configmap to include those settings, but I didn't get a coredump log. I also tried running the app from within the container to see if that would give me anything. The dotnet command just hung, and I still didn't get a coredump log.

mikem8361 commented 2 years ago

What do you mean "coredump log"? The actual coredump file? Or the logging output from the createdump coredump generation? For the later, the console output of the app process needs to be captured.

aullom1 commented 2 years ago

There was no console output, and there was no file generated at /tmp/coredump.

aullom1 commented 2 years ago

I'm assuming this means the app never even attempted to start.

tommcdon commented 2 years ago

@tommcdon I edited my configmap to include those settings, but I didn't get a coredump log. I also tried running the app from within the container to see if that would give me anything. The dotnet command just hung, and I still didn't get a coredump log.

@aullom1 Thanks - I wasn't sure if this was a crash or hang. Can you try running dotnet-dump collect --process-id pid inside the pod to collect a dump of the process? The documentation page contains single-file framework dependent download links . The runtime can be installed using dotnet-install and as long as it is on the path, dotnet-dump collect should hopefully be able to collect a dump.

I'm assuming this means the app never even attempted to start.

@aullom1 this could point towards a configuration issue inside the pod. Do you see any useful system log output using journalctl?

aullom1 commented 2 years ago

@tommcdon I'll need to work with someone else to get an updated image pushed to our ACR. It probably won't be until next week that it happens.

For anyone following along who wants to add dotnet-dump to their final image, you might need to do something like this from the SDK image.

FROM mcr.microsoft.com/dotnet/aspnet:6.0 AS base
...
FROM mcr.microsoft.com/dotnet/sdk:6.0 AS build
...
FROM build AS publish
# Get the dotnet-dump command
RUN wget -O /bin/dotnet-dump  https://aka.ms/dotnet-dump/linux-x64 && \
    chmod +x /bin/dotnet-dump
...
FROM base AS final
COPY --from=publish /bin/dotnet-dump /bin/dotnet-dump
aullom1 commented 2 years ago

@tommcdon It seems I can't run dotnet-dump in my container. I either get kicked out, or the process gets killed. I tried it in another cluster where .NET 6 works, and I had no problem.

This was my attempt in the non-functional cluster.

AtlasAdmin@atlastestpage-6795df47b6-jb2n4:/app$ dotnet-dump ps
command terminated with exit code 137
aullom1@WMDW10750825:/c/git/AaronTest/AaronTest.Api$ k -n atlastestpage exec --tty --stdin atlastestpage-6795df47b6-jb2n4 -- /bin/bash
AtlasAdmin@atlastestpage-6795df47b6-jb2n4:/app$ dotnet-dump --help
Killed
hoyosjs commented 2 years ago

Do you happen to know what version of the kernel and what seccomp/apparmor profiles have been use for deployment? dotnet dump requires ptrace and it tries to use a couple of read calls using process_vm_readv that decay to pread64 if the call is not allowed in the seccomp profile.

aullom1 commented 2 years ago

I don't think we're using apparmor. I should see an annotation for something like this, right? I don't see one in the pod spec.

container.apparmor.security.beta.kubernetes.io/v5: runtime/default

hoyosjs commented 2 years ago

RE: dotnet-dump Do you know if you modify the seccomp profile at all? I need to check what AKS's OCR does for the defaults (I can't recall what containerd does here). Also, you mentioned that your test environment works just fine. Does dotnet-dump work there? And is this test environment also in AKS? If so, what's the difference between that and prod? Mostly just PrismaCloud and such?

As for your startup: The table you posted seems to hint that same version of K8's + same OCR works on some and fails on others, right? What are the differences between the container images?

aullom1 commented 2 years ago

As far as I know we are not modifying seccomp. I don't really know what this is/does, so I checked with the team that defined the AKS infra, and they are of the opinion that we don't mess with it.

In the test environment (also AKS) dotnet-dump works just as expected. I have compared the ARM between a working and not working AKS environment, and the only differences are with names, IDs, and other unique fields. None of the meaningful configuration is different. I repeated the process with various manifests from Kubernetes (app deployments, ingresses, services, configmaps, pod identities), and again there are no differences.

As for PrismaCloud, nothing stays the same in IT, and we have now migrated to Aqua. Our test and prod environments are covered by the same policies in Aqua, and we see nothing in Aqua to indicate the image or container were blocked. The image itself resides in our private ACR, and it is pulled from that ACR for both test and prod, so there should be no differences between container images.

aullom1 commented 2 years ago

I see messages like this from dmesg in pods that don't work. I can't find anything similar in pods that do work.

Memory cgroup out of memory: Killed process 9378 (dotnet) total-vm:5225372kB, anon-rss:296300kB, file-rss:73236kB, shmem-rss:48kB, UID:101 pgtables:1272kB oom_score_adj:990

I don't understand how it could be out of memory. The pod is using less than 300Mi of the 500Mi requested.

VSadov commented 2 years ago

One theory to try is that something makes GC think there is a lot of resources on this machine/container so it tries to grab too much memory upfront and gets OOM-killed.

Is there a memory limit where the app is able to run?

aullom1 commented 2 years ago

Is there a memory limit where the app is able to run?

Not in the prod environment. I bumped up to more than 1GB, and it still behaves the same.

mangod9 commented 2 years ago

And just to confirm, that you are able to run .NET 5 apps on the same container correct?

aullom1 commented 2 years ago

And just to confirm, that you are able to run .NET 5 apps on the same container correct?

Yes

mangod9 commented 2 years ago

Ok thanks, is it possible to get lldb installed on the container? That might provide some diagnosability.

cc @janvorli if there are other ideas to investigate these failures.

aullom1 commented 2 years ago

I might need some help with this. Is it going to work with my published artifacts, or do I need to use an alternate build process? I'm not familiar with lldb, so reading up on it, but a "do this" would help.

aullom1 commented 2 years ago

I have been digging more into how OOM Killer works, and it appears that it is possible for Kubernetes to OOM Kill a process within a container without killing the actual container. This can happen if the process being killed is not the main process of the container. It seems like there is a bit of disagreement in how/what should be OOM Killed.

I'm not sure what to make of all this, and I'm not sure this is even what's going on with my issue (but it seems likely). Why is the dotnet process not the main process of the container? Why does it behave differently in different clusters?

References

aullom1 commented 2 years ago

Another github issue I found: https://github.com/kubernetes/kubernetes/issues/100483#issuecomment-1026743616

It appears that this issue is bigger than just .NET 6. But that still doesn't explain why the problem only exists in specific environments.

hoyosjs commented 2 years ago

I don't expect LLDB to work properly unless you can run it under LLDB by doing a manual invocation of the sorts of lldb -- <commandline>. Attaching to a running process requires sudo. Launching requires you to be able to use ptrace, just like dotnet-dump, and requires you to have it continue on signals that get captured. For the 1GB memory limit case in prod, do you still see the message? Is the message existent before triggering dotnet-dump? Once you start dotnet-dump do you still get it? As for your app being the main one, not sure. It should be executing directly given how you invoke it.

aullom1 commented 2 years ago

I'm not sure why, but I am no longer seeing references to dotnet from dmesg. It was very consistent last week, and this week I can't reproduce it. I tried a rollout restart of the deployment, manually starting the application, and running dotnet-dump, but nothing is producing the OOM message. And the app still isn't starting up, and there are no logs. WTH?

VSadov commented 2 years ago

Another thing to try. Is there any profiling/monitoring enabled? Like Dynatrace or something like that. If that is the case, can it be turned off? - just to see if that makes a difference.

CaveSven commented 2 years ago

For what it's worth: I rechecked the pods and have a cleaner signature now.

I installed the sdk according to https://docs.microsoft.com/en-us/dotnet/core/install/linux-ubuntu#install-the-sdk and run dotnet foobar. In two cases, this lead to the error signature (in the other cases I would just get the 'unknown command' message)

k8s version OS image Kernel version Container runtime .Net6 running
v1.13.12 Ubuntu 16.04.7 LTS 4.15.0-1113-azure docker://19.3.14 (/)
v1.13.12 Ubuntu 16.04.6 LTS 4.15.0-1077-azure docker://3.0.10+azure (x)
v1.13.12 Ubuntu 16.04.7 LTS 4.15.0-1113-azure docker://19.3.14 (/)
v1.13.12 Ubuntu 16.04.6 LTS 4.15.0-1113-azure docker://3.0.10+azure (x)
v1.13.12 Ubuntu 16.04.7 LTS 4.15.0-1113-azure docker://19.3.14 (/)
v1.13.12 Ubuntu 16.04.7 LTS 4.15.0-1113-azure docker://19.3.14 (/)
lynn-orrell commented 2 years ago

@CaveSven / @aullom1, are you guys running Dynatrace in these clusters?

aullom1 commented 2 years ago

I got our Dynatrace admins to exclude a couple namespaces, and what I found is that as long as Dynatrace is not injecting into a .NET 6 pod, the pod will start up and work as expected. What I am confused about is that Dynatrace is on all of the clusters (even the ones where .NET 6 always worked), but Dynatrace only needs to be told to ignore the namespace on some of the clusters.

aullom1 commented 2 years ago

Saw in an email from Microsoft that another customer was facing issues with a specific version of Dynatrace.

mangod9 commented 2 years ago

Hmm, interesting. If you find any info about the root cause from Dynatrace please share it here.

lynn-orrell commented 2 years ago

I got our Dynatrace admins to exclude a couple namespaces, and what I found is that as long as Dynatrace is not injecting into a .NET 6 pod, the pod will start up and work as expected. What I am confused about is that Dynatrace is on all of the clusters (even the ones where .NET 6 always worked), but Dynatrace only needs to be told to ignore the namespace on some of the clusters.

From what we have seen so far, it depends on the Dynatrace version. Can you check the versions between clusters to see if they are different?

aullom1 commented 2 years ago

Not working dynatrace version: 1.219.134.20210624-153812

Working dynatrace version: 1.231.245.20220112-085307

lynn-orrell commented 2 years ago

This jives perfectly with what we've seen as well.

Not Working: 1.215.192.20210519-150820 Working: 1.231.288.20220218-170442

CaveSven commented 2 years ago

We are also running Dynatrace. I am trying to get some information on the version.