wget zombie processes - Githubissues

somera commented 9 months ago

Current Behavior

Today I saw this

=> There are 2 zombie processes.

after login to my system. This is new. I don't have any zombie processes on my system.

ps shows

$ ps aux | grep 'Z'
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
xxxxxx      6717  0.0  0.0      0     0 ?        Z    20:17   0:00 [wget] <defunct>
xxxxxx      7857  0.0  0.0      0     0 ?        Z    20:18   0:00 [wget] <defunct>

next step was

           ├─containerd-shim(3136)─┬─java(3940)─┬─wget(6717)
           │                       │            ├─wget(7857)
           │                       │            ├─{java}(4781)
           │                       │            ├─{java}(4812)
           │                       │            ├─{java}(4884)
           │                       │            ├─{java}(4893)

And

$ ps aux | grep 3940
xxxxxx      3940 19.9  2.6 15333688 849660 ?     Ssl  20:16   1:49 java -XX:+UseParallelGC -XX:MaxRAMPercentage=90.0 --add-opens java.base/java.util.concurrent=ALL-UNNAMED -Dlogback.configurationFile=logback.xml -DdependencyTrack.logging.level=DEBUG -jar dependency-track-apiserver.jar -context /
xxxxxx     16571  0.0  0.0   6608  2432 pts/0    S+   20:26   0:00 grep --color=auto 3940

After I stopped the DT container zombie processes are gone. And DT was in idle. After reboot my system I saw the same.

What happens here? Why the zombie processes?

My docker-compose:

version: '3.7'

services:
  dtrack-apiserver:
    image: dependencytrack/apiserver
    environment:
      - TZ=Europe/Berlin
      # Database Properties
      - ALPINE_DATABASE_MODE=external
      - ALPINE_DATABASE_URL=jdbc:postgresql://xx.xx.xx.xx:5432/dtrack
      - ALPINE_DATABASE_DRIVER=org.postgresql.Driver
      - ALPINE_DATABASE_USERNAME=dtrack
      - ALPINE_DATABASE_PASSWORD=xxx
      - ALPINE_DATABASE_POOL_ENABLED=true
      - ALPINE_DATABASE_POOL_MAX_SIZE=10
      - ALPINE_DATABASE_POOL_MIN_IDLE=2
      - ALPINE_DATABASE_POOL_IDLE_TIMEOUT=300000
      - ALPINE_DATABASE_POOL_MAX_LIFETIME=600000

      - LOGGING_LEVEL=DEBUG
    deploy:
      resources:
        limits:
          memory: 12288m
        reservations:
          memory: 8192m
      restart_policy:
        condition: on-failure
    ports:
      - '7071:8080'
    volumes:
      - "/data-files/data/docker/dependency-track:/data"
      - "/etc/timezone:/etc/timezone:ro"
      - "/etc/localtime:/etc/localtime:ro"
    restart: unless-stopped

  dtrack-frontend:
    image: dependencytrack/frontend
    depends_on:
      - dtrack-apiserver
    environment:
      - TZ=Europe/Berlin
      - API_BASE_URL=http://xx.xx.xx.xx:7071
    volumes:
      - "/etc/timezone:/etc/timezone:ro"
      - "/etc/localtime:/etc/localtime:ro"
    ports:
      - "7070:8080"
    restart: unless-stopped

Steps to Reproduce

Just start DT in docker and check the system.

Expected Behavior

No zombie processes.

Dependency-Track Version

4.9.1

Dependency-Track Distribution

Container Image

Database Server

PostgreSQL

Database Server Version

16

Browser

Google Chrome

Checklist

[X] I have read and understand the contributing guidelines
[X] I have checked the existing issues for whether this defect was already reported

somera commented 9 months ago

The first zombie process was 10hours old. After reboot I had two zombie processes.

And now only one after I started the docker DT container again.

$ ps aux | grep 'Z'
USER         PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
xxxxxx     48069  0.0  0.0      0     0 ?        Z    21:03   0:00 [wget] <defunct>

somera commented 9 months ago

In dependencytrack/apiserver container with more /proc/<pid>/status I can see

Name:   wget
State:  Z (zombie)
Tgid:   65
Ngid:   0
Pid:    65
PPid:   1
TracerPid:      0
Uid:    1000    1000    1000    1000
Gid:    1000    1000    1000    1000
FDSize: 0
Groups: 1000
NStgid: 65
NSpid:  65
NSpgid: 59
NSsid:  59
Threads:        1
SigQ:   0/126315
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000001000
SigCgt: 0000000008000201
CapInh: 0000000000000000
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        2
Seccomp_filters:        1
Speculation_Store_Bypass:       thread vulnerable
SpeculationIndirectBranch:      conditional enabled
Cpus_allowed:   f
Cpus_allowed_list:      0-3
Mems_allowed:   00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000000,000000
00,00000000,00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        2
nonvoluntary_ctxt_switches:     6

nscuro commented 9 months ago

wget is used for the container's health check: https://github.com/DependencyTrack/dependency-track/blob/d8464427fa993404961563e97bdd5f2564b4f7ce/src/main/docker/Dockerfile#L76

somera commented 9 months ago

thx. But it looks like this will be not finished. And at the end it should be not detected as zombie process.

Right?

nscuro commented 9 months ago

So, any idea why this would happen? I'd expect the container runtime to ensure that health check processes are properly terminated. The command we use has a timeout of 3s, there is nothing in it that should cause it to stay around for longer than that.

somera commented 9 months ago

Hm ... perhaps add tmeout to wget too?

wget --connect-timeout=5 htt

or add

HEALTHCHECK --interval=30s --timeout=3s --start-period=XXs

or switch to curl

HEALTHCHECK --interval=30s --timeout=3s --start-period=15s CMD curl --fail localhost:8080/health || exit 1

You see this zombie process in your environment too?

somera commented 9 months ago

I this I found the problem. On my NUC it needs longer.

I started the container and than I call this:

$ time wget http://192.168.178.30:7071/health
--2023-11-27 22:07:45--  http://192.168.178.30:7071/health
Connecting to 192.168.178.30:7071... connected.
HTTP request sent, awaiting response... 200 OK
Length: 124 [application/json]
Saving to: ‘health’

health                                                               100%[=====================>]     124  --.-KB/s    in 0s

2023-11-27 22:08:21 (8,88 MB/s) - ‘health’ saved [124/124]

real    0m36,389s
user    0m0,001s
sys     0m0,007s

I called this from outside the docker container. How can I see in the log, that the health endpoint is available?

nscuro commented 9 months ago

You see this zombie process in your environment too?

Nope. My production systems run in k8s which invokes health checks from outside the container. I am also not seeing this locally on my laptop.

How can I see in the log, that the health endpoint is available?

All HTTP endpoints including /health will be available once this is logged:

INFO [AlpineServlet] Dependency-Track is ready

I think adding a timeout to the wget command itself, and adding --start-period 60s are valid additions though.

somera commented 9 months ago

Thx. You should increase the values or make this configurable in docker-compose.

Start time

Finish start

Not everyone has an Threadripper. ;)

nscuro commented 9 months ago

You should increase the values or make this configurable in docker-compose.

It already is configurable: https://docs.docker.com/compose/compose-file/compose-file-v3/#healthcheck

Not everyone has an Threadripper. ;)

Neither do I :)

somera commented 9 months ago

Thx. I didn't know that. Now I added

    healthcheck:
      #disable: true
      test: wget --no-verbose --tries=1 --spider http://127.0.0.1:8080/health || exit 1
      interval: 2m
      timeout: 3s
      retries: 3
      start_period: 15s

And it looks better now. I can't see the zombie process.

gray380 commented 9 months ago

I've faced the same issue with zombies after updating from 4.8.1 to 4.9.1. The difference in the healthchecks:

4.8.1
HEALTHCHECK &{["CMD-SHELL" "wget --no-proxy -q -O /dev/null http://127.0.0.1:8080${CONTEXT}health || exit 1"] "30s" "3s" "0s" '\x00'}

4.9.1
HEALTHCHECK &{["CMD-SHELL" "wget --no-proxy -q -O /dev/null http://127.0.0.1:8080${CONTEXT}health || exit 1"] "30s" "3s" "0s" "0s" '\x00'}

somera commented 9 months ago

I updated today to 4.9.1 and I removed my fix to test https://github.com/DependencyTrack/dependency-track/pull/3245 and I see the zombie wget process again. The fix is not working.

DependencyTrack / dependency-track

wget zombie processes #3243

Current Behavior

Steps to Reproduce

Expected Behavior

Dependency-Track Version

Dependency-Track Distribution

Database Server

Database Server Version

Browser

Checklist