caddyserver / caddy-docker

Source for the official Caddy v2 Docker Image
https://hub.docker.com/_/caddy
Apache License 2.0
405 stars 74 forks source link

Caddy generating bunches of ssl_client zombie processes #276

Closed joshavant closed 1 year ago

joshavant commented 1 year ago

After launching a Caddy deployment that's exposed to the public internet, it seems that Caddy is leaving failed ssl_client processes open and not cleaning them up.

When I run ps -ax on my Docker Host, I see thousands of these:

  73220 ?        Z      0:00 [ssl_client] <defunct>
  73240 ?        Z      0:00 [ssl_client] <defunct>
  73257 ?        Z      0:00 [ssl_client] <defunct>
  73276 ?        Z      0:00 [ssl_client] <defunct>
  73294 ?        Z      0:00 [ssl_client] <defunct>
  73312 ?        Z      0:00 [ssl_client] <defunct>
  73330 ?        Z      0:00 [ssl_client] <defunct>
  73350 ?        Z      0:00 [ssl_client] <defunct>
  73368 ?        Z      0:00 [ssl_client] <defunct>
  73388 ?        Z      0:00 [ssl_client] <defunct>
  73409 ?        Z      0:00 [ssl_client] <defunct>
  73425 ?        Z      0:00 [ssl_client] <defunct>
  73442 ?        Z      0:00 [ssl_client] <defunct>
  73461 ?        Z      0:00 [ssl_client] <defunct>
  73477 ?        Z      0:00 [ssl_client] <defunct>
  73495 ?        Z      0:00 [ssl_client] <defunct>
  73513 ?        Z      0:00 [ssl_client] <defunct>

When I check what the parent process of all of these, it's Caddy:

ubuntu@server:~$ cat /proc/73220/status | grep PPid
PPid:   1371
ubuntu@server:~$ cat /proc/1371/status
Name:   caddy

As well, my motd tells me on SSH login that: => There are 3821 zombie processes.

These thousands of zombie processes are causing my host server to use lots of memory and, eventually, grind to a halt, requiring a restart.

System + Version Information

Relevant docker-compose.yml section:

services:
  caddy:
    image: caddy:2.6.2
    restart: always
    environment:
      - INTERNET_HOST
      - DOCKER_HOST
      - DOCKER_PORT
    ports:
      - 80:80
      - 443:443
    volumes:
      - ./Caddyfile:/etc/caddy/Caddyfile
    healthcheck:
      test: [ "CMD", "wget", "--spider", "http://localhost/caddy-health-check" ]
      interval: 10s
      timeout: 5s
      retries: 5

Caddyfile

{$INTERNET_HOST} {
    reverse_proxy {$DOCKER_HOST}:{$DOCKER_PORT}
    respond /caddy-health-check 200
}

Any thoughts how I could stop this?

mholt commented 1 year ago

It seems that Caddy is leaving these failed SSH processes open and not cleaning them up.

What evidence do you have that Caddy is starting SSH processes?

So far I'm not convinced this is an issue with Caddy or has anything to do with Caddy at all, as Caddy doesn't invoke or use SSH.

joshavant commented 1 year ago

@mholt When I check what the parent process of all of these, it's Caddy:

ubuntu@server:~$ cat /proc/73220/status | grep PPid
PPid:   1371
ubuntu@server:~$ cat /proc/1371/status
Name:   caddy
mholt commented 1 year ago

But where are you seeing SSH processes?

joshavant commented 1 year ago

When I run ps -ax on my Docker Host, I see thousands of these:

  73220 ?        Z      0:00 [ssl_client] <defunct>
  73240 ?        Z      0:00 [ssl_client] <defunct>
  73257 ?        Z      0:00 [ssl_client] <defunct>
  73276 ?        Z      0:00 [ssl_client] <defunct>
  73294 ?        Z      0:00 [ssl_client] <defunct>
  73312 ?        Z      0:00 [ssl_client] <defunct>
  73330 ?        Z      0:00 [ssl_client] <defunct>
  73350 ?        Z      0:00 [ssl_client] <defunct>
  73368 ?        Z      0:00 [ssl_client] <defunct>
  73388 ?        Z      0:00 [ssl_client] <defunct>
  73409 ?        Z      0:00 [ssl_client] <defunct>
  73425 ?        Z      0:00 [ssl_client] <defunct>
  73442 ?        Z      0:00 [ssl_client] <defunct>
  73461 ?        Z      0:00 [ssl_client] <defunct>
  73477 ?        Z      0:00 [ssl_client] <defunct>
  73495 ?        Z      0:00 [ssl_client] <defunct>
  73513 ?        Z      0:00 [ssl_client] <defunct>
mholt commented 1 year ago

But where do you see SSH? Sorry, I don't understand.

joshavant commented 1 year ago

Ack, sorry. I have horribly misread that log. Updating my report. 😅

mholt commented 1 year ago

Ah...

Well, I don't know what the ssl_client process is. Caddy doesn't invoke external processes like that directly.

A google search suggests that maybe something about healthchecks? I'm not a Docker user though. Check your config, perhaps: https://github.com/authelia/authelia/issues/1605

francislavoie commented 1 year ago

I concur with Matt, that's not evidence that it's SSH. I can guarantee that with vanilla Caddy, no SSH is happening.

Also FYI, you're missing volumes for /data and /config, so you're at risk of data loss when recreating Caddy's containers. That means your managed certs and keys will be forced to be reissued. That's not good.

And you should add - 443:443/udp to your port mappings to allow UDP traffic for HTTP/3.

And I recommend using the unless_stopped restart policy instead of always, so that you can manually stop Caddy when you need to fix something and need intentional downtime. It's just more flexible in general.

joshavant commented 1 year ago

@mholt These zombie processes are all owned by the Caddy process. See the following actual bash output:

Note the command in the middle of the following output (cat /proc/104115/status | grep PPid) which checks the parent process of the ssl_client zombie process, returns process ID 1371 which, when checked, is caddy.

ubuntu@server:~$ ps -ax
<thousands of duplicate lines truncated for readability>
 104115 ?        Z      0:00 [ssl_client] <defunct>
ubuntu@server:~$ cat /proc/104115/status
Name:   ssl_client
State:  Z (zombie)
Tgid:   104115
Ngid:   0
Pid:    104115
PPid:   1371
TracerPid:  0
Uid:    0   0   0   0
Gid:    0   0   0   0
FDSize: 0
Groups: 0 1 2 3 4 6 10 11 20 26 27 
NStgid: 104115  37411
NSpid:  104115  37411
NSpgid: 104109  37405
NSsid:  104109  37405
Threads:    1
SigQ:   0/7241
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp:    2
Seccomp_filters:    1
Speculation_Store_Bypass:   thread vulnerable
SpeculationIndirectBranch:  unknown
Cpus_allowed:   3
Cpus_allowed_list:  0-1
Mems_allowed:   00000000,00000001
Mems_allowed_list:  0
voluntary_ctxt_switches:    2
nonvoluntary_ctxt_switches: 7
ubuntu@server:~$ cat /proc/104115/status | grep PPid
PPid:   1371
ubuntu@server:~$ cat /proc/1371/status
Name:   caddy
Umask:  0022
State:  S (sleeping)
Tgid:   1371
Ngid:   0
Pid:    1371
PPid:   1295
TracerPid:  0
Uid:    0   0   0   0
Gid:    0   0   0   0
FDSize: 64
Groups: 0 1 2 3 4 6 10 11 20 26 27 
NStgid: 1371    1
NSpid:  1371    1
NSpgid: 1371    1
NSsid:  1371    1
VmPeak:   751968 kB
VmSize:   751968 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:     37004 kB
VmRSS:     20148 kB
RssAnon:       12052 kB
RssFile:        8096 kB
RssShmem:          0 kB
VmData:    53924 kB
VmStk:       132 kB
VmExe:     18052 kB
VmLib:         4 kB
VmPTE:       172 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
CoreDumping:    0
THP_enabled:    1
Threads:    8
SigQ:   0/7241
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: fffffffd7fc1feff
CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp:    2
Seccomp_filters:    1
Speculation_Store_Bypass:   thread vulnerable
SpeculationIndirectBranch:  unknown
Cpus_allowed:   3
Cpus_allowed_list:  0-1
Mems_allowed:   00000000,00000001
Mems_allowed_list:  0
voluntary_ctxt_switches:    126778
nonvoluntary_ctxt_switches: 14678
ubuntu@server:~$ 

@francislavoie - Thanks for the helpful reply. Re: /data and /config - I use custom env variables for that and, so, edited those out for clarity. Re: HTTP/3 traffic - Thanks! Is it possible that is related to this issue? Re: unless_stopped - Thanks. Still learning some Docker Compose things.

francislavoie commented 1 year ago

Please don't "clean up" or omit anything when asking for support. It might be relevant without you realizing it. We've seen that happen way too often. You'll tell yourself "well it can't be this", but... it often ends up being that.

Turn on the debug global option in your Caddyfile, then reload Caddy. Show us what's in your logs. Without seeing what Caddy is actually doing, we can't suggest anything else.

Also please show us what the values of your environment variables look like, because it has an effect on the generated JSON config Caddy actually runs with.

mholt commented 1 year ago

Did the advice in the issue I linked above really not have any affect on things?

joshavant commented 1 year ago

I will try disabling the health check and report back.

Or, if that fails, I'll turn on the debug global option, collect data, and then report back with that.

mholt commented 1 year ago

Great, thank you!

jjlin commented 1 year ago

Your healthcheck command doesn't actually work. If you run it from inside the container, you probably get something like

/srv # wget --spider http://localhost/caddy-health-check
Connecting to localhost (127.0.0.1:80)
Connecting to localhost (127.0.0.1:443)
140586838326088:error:14094438:SSL routines:ssl3_read_bytes:tlsv1 alert internal error:ssl/record/rec_layer_s3.c:1543:SSL alert number 80
ssl_client: SSL_connect
wget: error getting response: Connection reset by peer

which explains where the ssl_client comes from.

Orthogonal to the correctness of the healthcheck, you should be able to "fix" the zombie issue by adding init: true to your Compose file.

francislavoie commented 1 year ago

Thanks @jjlin, that makes sense -- making a request to http://localhost will only work if you actually have a site block http://localhost in your Caddyfile config. If you don't, then Caddy will redirect it to HTTPS, which is an "SSL" connection (although SSL is a deprecated term, it's called TLS now; pet peeve of mine) and fail to connect because Caddy doesn't have a certificate ready for localhost.

Shouldn't init: true only be needed if a non-default command is used for running the container? The default command is caddy run and it handles signals properly on its own.

jjlin commented 1 year ago

I haven't looked into what Caddy does, but I doubt it handles everything that a typical init replacement liketini or dumb-init would. I don't think Caddy would fork-and-exec anything normally, so it shouldn't really be an issue anyway.

joshavant commented 1 year ago

After removing the health check, this issue was resolved.

Thanks to everyone for the helpful attention on this!