dotnet / docs

This repository contains .NET Documentation.
https://learn.microsoft.com/dotnet
Creative Commons Attribution 4.0 International
4.27k stars 5.9k forks source link

Unable to run Perfcollector .NET 8 core #38951

Open samyaghosh2 opened 10 months ago

samyaghosh2 commented 10 months ago

Type of issue

Missing information

Description

To enable the perf collector on K8 Linux container, on Linux VM (in AKS Linux Gen05 nodes) for .NET Core 8, we followed the following steps (following this guide https://learn.microsoft.com/en-us/dotnet/core/diagnostics/trace-perfcollect-lttng) still we are getting following error.

Running "/usr/bin/lttng add-context --kernel -t pid -t procname" Error: pid: Kernel tracer not available Error: procname: Kernel tracer not available

Here are the steps for reference:

1 curl -OL https://aka.ms/perfcollect 2 apt updat 3 apt update 4 apt install curl 5 curl -OL https://aka.ms/perfcollect 6 chmod +x perfcollect 7 sudo ./perfcollect install 8 dotnet tool install --global dotnet-symbol 9 mkdir mySymbolsNet8 10 ~/.dotnet/tools/dotnet-symbol --output ./mySymbolsNet8/ /usr/share/dotnet/shared/Microsoft.NETCore.App/8.0.0/lib.so 11 cp ./mySymbolsNet8/ /usr/share/dotnet/shared/Microsoft.NETCore.App/8.0.0/ 12 apt-get install linux-tools-5.15.0-1041-azure -y 13 cp /usr/lib/linux-tools/5.15.0-1041-azure/perf /usr/bin/perf 14 ps -ef | grep -i dotnet 15 export DOTNET_EnableWriteXorExecute=0 16 sudo ./perfcollect collect a -pid 166 perfcollector.log

Please provide us the missing/updated steps

Page URL

https://learn.microsoft.com/en-us/dotnet/core/diagnostics/trace-perfcollect-lttng

Content source URL

https://github.com/dotnet/docs/blob/main/docs/core/diagnostics/trace-perfcollect-lttng.md

Document Version Independent Id

ac6cc555-a033-766d-0597-bd9bda98820f

Article author

@tommcdon

Metadata

tommcdon commented 10 months ago

Hi @samyaghosh2, please see https://learn.microsoft.com/en-us/dotnet/core/diagnostics/trace-perfcollect-lttng#collect-in-a-docker-container for information on collecting diagnostics in a container. Please let me know if this resolves the issue. cc @brianrob

samyaghosh2 commented 10 months ago

Hi @tommcdon cc @brianrob ,

We tried exact steps in the document (https://learn.microsoft.com/en-us/dotnet/core/diagnostics/trace-perfcollect-lttng#collect-in-a-docker-container), still it is giving the same error as mentioned earlier .

One more information we are using "containerd" as runtime our AKS cluster.

tommcdon commented 9 months ago

@samyaghosh2 can you verify:

  1. The container has the SYS_ADMIN capability
  2. DOTNET_PerfMapEnabled is set to 1 for the target process
  3. If the above is all correct, please try adding the -nolttng argument to the perfcollect script.
ankishagarwal commented 9 months ago

Tried with the flag -nolttng

Yes the container has the SYS_ADMIN capabilities, Also all the four following env variables are set: export DOTNET_PerfMapEnabled=1 export DOTNET_EnableEventLog=1 export DOTNET_EnableWriteXorExecute=0 export DOTNET_ZapDisable=1

We got the following error:

Running "/usr/lib/linux-tools/5.15.0-1053-azure/perf record -k 1 -g -a -F 1000 -e cpu-clock" Error: Access to performance monitoring and observability operations is limited. Consider adjusting /proc/sys/kernel/perf_event_paranoid setting to open access to performance monitoring and observability operations for processes without CAP_PERFMON, CAP_SYS_PTRACE or CAP_SYS_ADMIN Linux capability. More information can be found at 'Perf events and tool security' document: https://www.kernel.org/doc/html/latest/admin-guide/perf-security.html perf_event_paranoid setting is 4: -1: Allow use of (almost) all events by all users Ignore mlock limit after perf_event_mlock_kb without CAP_IPC_LOCK

= 0: Disallow raw and ftrace function tracepoint access sudo ./perfcollect collect sampleTrace>= 1: Disallow CPU event access = 2: Disallow kernel profiling To make the adjusted perf_event_paranoid setting permanent preserve it in /etc/sysctl.conf (e.g. kernel.perf_event_paranoid = )

brianrob commented 9 months ago

If I recall correctly, the behavior you're seeing is expected because the kernel module cannot be built and installed inside the container. If you need kernel-level data, then you will need to capture on the node instead of inside the container.

tommcdon commented 9 months ago

@brianrob should we update the docs to clearly state that perfcollect is not supported in containers or are there other caveats to be aware of?

brianrob commented 9 months ago

@tommcdon, no we should not need to do that - it is supported inside containers. You just don't get some of the context information because the kernel module isn't loaded:

Running "/usr/bin/lttng add-context --kernel -t pid -t procname"
Error: pid: Kernel tracer not available
Error: procname: Kernel tracer not available

It might be worth noting that this error is expected if collecting inside of containers. Perfcollect will still run. You'll just see this if you look in the log.

tommcdon commented 9 months ago

@brianrob please see https://github.com/dotnet/docs/issues/38951#issuecomment-1883037526. The perf tool seems to be failing to collect CPU samples. If the reason is "you will need to capture on the node instead of inside the container" then we should document that.

brianrob commented 9 months ago

Ah, sorry, I got stuck on the stuff up above. Collecting inside of a container is supported. I suspect that this has something to do with the configuration of the container, and what perf requires in order to profile. Can you please try capturing with a container with all permissions? I'd like to try process of elimination to address this.

As a test, I just did a test in a new container with all privileges, and it worked:

docker run -it --privileged --security-opt seccomp=unconfined ubuntu:latest
samyaghosh2 commented 9 months ago

@brianrob, thanks for your inputs, however I am wondering why does it only run for only privileged mode, as for perf security document, perf_events scope and access control for unprivileged processes is governed by perf_event_paranoid (https://docs.kernel.org/admin-guide/perf-security.html#id33) setting. Ideally, it should run in unprivileged mode, as in privileged mode container is a container that has access to all host system resources, including kernel features and devices, this mode could lead to unintended consequences to the host.

brianrob commented 9 months ago

Agreed that this level of access may not be the best choice. However, perfcollect doesn't choose the configuration that perf requires. Instead, perfcollect is just showing the error from perf. The error that you're seeing here is from perf, and so the configuration of the container is such that there is not enough privileges to capture a perf trace.

Going forward, I would recommend starting from a point where things work, and then dropping permissions until to find that it no longer works, in order to find the minimal set of permissions required. Generally speaking, the guidance provided here has been sufficient, but it's possible that something has changed.

samyaghosh2 commented 9 months ago

@brianrob, right exactly that's the point request you to update the guide which I believe would be referenced by other teams as well with detailing on the minimum required permission set to make this running. As currently nothing is running until it is in privilege mode.

brianrob commented 9 months ago

Doing a bit more investigation, I am able to capture a trace within a container by adding just CAP_SYS_ADMIN. I do get a warning that kernel symbols might not resolved based on permissions. However, I do get a trace.

It sounds like the environment that you're in has additional security measures enabled, though I don't know what they are, and so you may need to run with more privileges to capture within the container, or capture directly on the node. It seems reasonable to want some guidance specific to AKS, but I wouldn't change the baseline guidance, because it does apply to non-AKS containerized environments.

ankishagarwal commented 9 months ago

@brianrob and @tommcdon another experiment we did.

Base Image aspnet:6.0 dotnet version 6.0.418 --> Perfcollection worked when the container was started in privileged mode. Base Image aspnet:8.0 dotnet version 8.0.1 --> Perfcollection didnt work work when the container was started in privileged mode. We also did set the 3 env variables, DOTNET_PerfMapEnabled=1, DOTNET_EnableEventLog=1, DOTNET_EnableWriteXorExecute=0

Please help with resolving this for Dotnet 8. Thanks.

Here is the deployment.yaml used for dotnet-6 apiVersion: apps/v1 kind: Deployment metadata: name: simplerest6 spec: replicas: 1 selector: matchLabels: app: simplerest6 template: metadata: labels: app: simplerest6 spec: containers:

adegeo commented 1 month ago

@tommcdon Where do you want to go with this issue?