Sysinternals / ProcDump-for-Linux

A Linux version of the ProcDump Sysinternals tool
MIT License
2.94k stars 304 forks source link

Procdump -c does not work in k8s #240

Open ximi522 opened 6 months ago

ximi522 commented 6 months ago

Expected behavior

In a Kubernetes environment, when using procdump with the command 'procdump -c 10 -s 1 -w XXX', it doesn't generate a dump file when the CPU usage of the pod exceeds 10%. This might be because procdump monitors the CPU usage of the host machine instead of the pod itself. Could you consider adding monitoring for the pod's CPU and memory usage in future versions? It would greatly assist in troubleshooting .NET applications in Kubernetes.

System information (e.g., distro, kernel version, etc.)

pod docker image based on mcr.microsoft.com/dotnet/aspnet:7.0-bullseye-slim-amd64.

MarioHewardt commented 6 months ago

Hi - thanks for the feedback. I wrote a post on this a while back. Let me know if that helps answer your question and if not, please don't hesitate to reach back out.

https://medium.com/@marioh_78322/sysinternals-procdump-for-linux-and-cloud-native-applications-404d0351f1ea

ximi522 commented 6 months ago

I have deployed my pod following the instructions in this post (https://medium.com/@marioh_78322/sysinternals-procdump-for-linux-and-cloud-native-applications-404d0351f1ea) and monitored my process using procdump -c 10 -m 200 -s 1 -w GMTools /dump-data. However, when I tested the CPU load exceeding 10%, procdump did not generate the expected dump. The article did not attempt to monitor the CPU threshold using -c in the pod, so I suspect that there might be an issue in obtaining the correct CPU load in a Docker environment. img_v3_029i_32766050-ae38-4e89-8679-51507c2b070g ae70d914-2524-45eb-970e-75fb13f26707 `RUN apt-get update && \ apt-get install -y wget RUN wget -q https://packages.microsoft.com/config/ubuntu/22.04/packages-microsoft-prod.deb -O packages-microsoft-prod.deb RUN dpkg -i packages-microsoft-prod.deb RUN apt-get update && \ apt-get install -y procdump && \ apt-get clean

WORKDIR /app

ENTRYPOINT ["./start.sh"]`

start.sh: #!/bin/bash procdump -c 1 -m 200 -s 1 -w GMTools /dump-data

MarioHewardt commented 6 months ago

Thanks for the detailed information. Could you add the -log switch to the procdump command line? This will send extended logging to syslog. Please share the procdump related log entries (there can be quite a few).

ximi522 commented 5 months ago

I discovered while reading the code that the CPU usage is obtained and calculated from /proc/[pid]/stat. However, in a docker environment, the CPU usage obtained here is relative to the CPU of the actual host machine, which is not very meaningful for program monitoring. We would rather obtain the CPU usage relative to this docker container. I found a method to obtain the CPU usage in a docker container by reading this article [https://chengdol.github.io/2021/09/19/k8s-container-mem-cpu/], and I have written a shell script based on it for reference.

#!/bin/bash
while true; do
    # get dotnet process id
    pid=$DOTNET_PID
    # get dotnet process cgroup path
    cgroup_path=/proc/$pid/root/sys/fs/cgroup
    # check if cgroup path exists
    if [ ! -d $cgroup_path ]; then
        sleep 1
        continue
    fi
    # cpu, cpuacct dir are softlinks
    # cpuacct.stat:
    # Reports the total CPU time in nanoseconds
    # spent in user and system mode by all tasks in the cgroup.
    utime_start=$(cat $cgroup_path/cpu,cpuacct/cpuacct.stat| grep user | awk '{print $2}')
    stime_start=$(cat $cgroup_path/cpu,cpuacct/cpuacct.stat| grep system | awk '{print $2}')
    sleep 1
    utime_end=$(cat $cgroup_path/cpu,cpuacct/cpuacct.stat| grep user | awk '{print $2}')
    stime_end=$(cat $cgroup_path/cpu,cpuacct/cpuacct.stat| grep system | awk '{print $2}')
    # getconf CLK_TCK aka sysconf(_SC_CLK_TCK) returns USER_HZ
    # aka CLOCKS_PER_SEC which seems to be always
    # 100 independent of the kernel configuration.
    HZ=$(getconf CLK_TCK)

    # get cpu core number
    cfs_quota_us=$(cat $cgroup_path/cpu/cpu.cfs_quota_us)
    cfs_period_us=$(cat $cgroup_path/cpu/cpu.cfs_period_us)
    cpu_core_num=$((cfs_quota_us/cfs_period_us))

    # get container cpu usage
    # on top of user/system cpu time
    cpu_percent=$(( (utime_end+stime_end-utime_start-stime_start)*100/HZ/cpu_core_num ))

    # memory in Mib: used - inactive(cache)
    used=$(cat $cgroup_path/memory/memory.usage_in_bytes)
    inactive=$(grep -w inactive_file $cgroup_path/memory/memory.stat | awk {'print $2'})
    # numfmt: readable format

    mem_usage=$(cat $cgroup_path/memory/memory.usage_in_bytes)
    total_mem=$(cat $cgroup_path/memory/memory.limit_in_bytes)
    # local memory info
    local_mem_usage=$(cat /sys/fs/cgroup/memory/memory.usage_in_bytes)
    local_total_mem=$(cat /sys/fs/cgroup/memory/memory.limit_in_bytes)
    mem_percent=$(echo "scale=2; ($mem_usage + $local_mem_usage) * 100 / ($total_mem + $local_total_mem)" | bc)

    if (( $(echo "$cpu_percent > $CPU_THRESHOLD" | bc -l) )) || (( $(echo "$mem_percent > $MEM_THRESHOLD" | bc -l) )); then
        if [ ! -f "/app/create_dump.lock" ];then
            echo $cpu_percent $mem_percent
            echo $(($used)) | numfmt --to=iec
            echo $(($total_mem)) | numfmt --to=iec
            ./procdump -pgid $pid /app/dump
            touch /app/create_dump.lock
        fi
    fi

done