Closed joshavant closed 1 year ago
It seems that Caddy is leaving these failed SSH processes open and not cleaning them up.
What evidence do you have that Caddy is starting SSH processes?
So far I'm not convinced this is an issue with Caddy or has anything to do with Caddy at all, as Caddy doesn't invoke or use SSH.
@mholt When I check what the parent process of all of these, it's Caddy:
ubuntu@server:~$ cat /proc/73220/status | grep PPid
PPid: 1371
ubuntu@server:~$ cat /proc/1371/status
Name: caddy
But where are you seeing SSH processes?
When I run ps -ax
on my Docker Host, I see thousands of these:
73220 ? Z 0:00 [ssl_client] <defunct>
73240 ? Z 0:00 [ssl_client] <defunct>
73257 ? Z 0:00 [ssl_client] <defunct>
73276 ? Z 0:00 [ssl_client] <defunct>
73294 ? Z 0:00 [ssl_client] <defunct>
73312 ? Z 0:00 [ssl_client] <defunct>
73330 ? Z 0:00 [ssl_client] <defunct>
73350 ? Z 0:00 [ssl_client] <defunct>
73368 ? Z 0:00 [ssl_client] <defunct>
73388 ? Z 0:00 [ssl_client] <defunct>
73409 ? Z 0:00 [ssl_client] <defunct>
73425 ? Z 0:00 [ssl_client] <defunct>
73442 ? Z 0:00 [ssl_client] <defunct>
73461 ? Z 0:00 [ssl_client] <defunct>
73477 ? Z 0:00 [ssl_client] <defunct>
73495 ? Z 0:00 [ssl_client] <defunct>
73513 ? Z 0:00 [ssl_client] <defunct>
But where do you see SSH? Sorry, I don't understand.
Ack, sorry. I have horribly misread that log. Updating my report. 😅
Ah...
Well, I don't know what the ssl_client
process is. Caddy doesn't invoke external processes like that directly.
A google search suggests that maybe something about healthchecks? I'm not a Docker user though. Check your config, perhaps: https://github.com/authelia/authelia/issues/1605
I concur with Matt, that's not evidence that it's SSH. I can guarantee that with vanilla Caddy, no SSH is happening.
Also FYI, you're missing volumes for /data
and /config
, so you're at risk of data loss when recreating Caddy's containers. That means your managed certs and keys will be forced to be reissued. That's not good.
And you should add - 443:443/udp
to your port mappings to allow UDP traffic for HTTP/3.
And I recommend using the unless_stopped
restart policy instead of always
, so that you can manually stop Caddy when you need to fix something and need intentional downtime. It's just more flexible in general.
@mholt These zombie processes are all owned by the Caddy process. See the following actual bash output:
Note the command in the middle of the following output (cat /proc/104115/status | grep PPid
) which checks the parent process of the ssl_client
zombie process, returns process ID 1371
which, when checked, is caddy
.
ubuntu@server:~$ ps -ax
<thousands of duplicate lines truncated for readability>
104115 ? Z 0:00 [ssl_client] <defunct>
ubuntu@server:~$ cat /proc/104115/status
Name: ssl_client
State: Z (zombie)
Tgid: 104115
Ngid: 0
Pid: 104115
PPid: 1371
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 0
Groups: 0 1 2 3 4 6 10 11 20 26 27
NStgid: 104115 37411
NSpid: 104115 37411
NSpgid: 104109 37405
NSsid: 104109 37405
Threads: 1
SigQ: 0/7241
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000000000000
CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 2
Seccomp_filters: 1
Speculation_Store_Bypass: thread vulnerable
SpeculationIndirectBranch: unknown
Cpus_allowed: 3
Cpus_allowed_list: 0-1
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 2
nonvoluntary_ctxt_switches: 7
ubuntu@server:~$ cat /proc/104115/status | grep PPid
PPid: 1371
ubuntu@server:~$ cat /proc/1371/status
Name: caddy
Umask: 0022
State: S (sleeping)
Tgid: 1371
Ngid: 0
Pid: 1371
PPid: 1295
TracerPid: 0
Uid: 0 0 0 0
Gid: 0 0 0 0
FDSize: 64
Groups: 0 1 2 3 4 6 10 11 20 26 27
NStgid: 1371 1
NSpid: 1371 1
NSpgid: 1371 1
NSsid: 1371 1
VmPeak: 751968 kB
VmSize: 751968 kB
VmLck: 0 kB
VmPin: 0 kB
VmHWM: 37004 kB
VmRSS: 20148 kB
RssAnon: 12052 kB
RssFile: 8096 kB
RssShmem: 0 kB
VmData: 53924 kB
VmStk: 132 kB
VmExe: 18052 kB
VmLib: 4 kB
VmPTE: 172 kB
VmSwap: 0 kB
HugetlbPages: 0 kB
CoreDumping: 0
THP_enabled: 1
Threads: 8
SigQ: 0/7241
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: fffffffd7fc1feff
CapInh: 0000000000000000
CapPrm: 00000000a80425fb
CapEff: 00000000a80425fb
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
NoNewPrivs: 0
Seccomp: 2
Seccomp_filters: 1
Speculation_Store_Bypass: thread vulnerable
SpeculationIndirectBranch: unknown
Cpus_allowed: 3
Cpus_allowed_list: 0-1
Mems_allowed: 00000000,00000001
Mems_allowed_list: 0
voluntary_ctxt_switches: 126778
nonvoluntary_ctxt_switches: 14678
ubuntu@server:~$
@francislavoie - Thanks for the helpful reply.
Re: /data
and /config
- I use custom env variables for that and, so, edited those out for clarity.
Re: HTTP/3 traffic - Thanks! Is it possible that is related to this issue?
Re: unless_stopped
- Thanks. Still learning some Docker Compose things.
Please don't "clean up" or omit anything when asking for support. It might be relevant without you realizing it. We've seen that happen way too often. You'll tell yourself "well it can't be this", but... it often ends up being that.
Turn on the debug
global option in your Caddyfile, then reload Caddy. Show us what's in your logs. Without seeing what Caddy is actually doing, we can't suggest anything else.
Also please show us what the values of your environment variables look like, because it has an effect on the generated JSON config Caddy actually runs with.
Did the advice in the issue I linked above really not have any affect on things?
I will try disabling the health check and report back.
Or, if that fails, I'll turn on the debug
global option, collect data, and then report back with that.
Great, thank you!
Your healthcheck command doesn't actually work. If you run it from inside the container, you probably get something like
/srv # wget --spider http://localhost/caddy-health-check
Connecting to localhost (127.0.0.1:80)
Connecting to localhost (127.0.0.1:443)
140586838326088:error:14094438:SSL routines:ssl3_read_bytes:tlsv1 alert internal error:ssl/record/rec_layer_s3.c:1543:SSL alert number 80
ssl_client: SSL_connect
wget: error getting response: Connection reset by peer
which explains where the ssl_client
comes from.
Orthogonal to the correctness of the healthcheck, you should be able to "fix" the zombie issue by adding init: true
to your Compose file.
Thanks @jjlin, that makes sense -- making a request to http://localhost
will only work if you actually have a site block http://localhost
in your Caddyfile config. If you don't, then Caddy will redirect it to HTTPS, which is an "SSL" connection (although SSL is a deprecated term, it's called TLS now; pet peeve of mine) and fail to connect because Caddy doesn't have a certificate ready for localhost
.
Shouldn't init: true
only be needed if a non-default command is used for running the container? The default command is caddy run
and it handles signals properly on its own.
I haven't looked into what Caddy does, but I doubt it handles everything that a typical init
replacement liketini
or dumb-init
would. I don't think Caddy would fork-and-exec anything normally, so it shouldn't really be an issue anyway.
After removing the health check, this issue was resolved.
Thanks to everyone for the helpful attention on this!
After launching a Caddy deployment that's exposed to the public internet, it seems that Caddy is leaving failed
ssl_client
processes open and not cleaning them up.When I run
ps -ax
on my Docker Host, I see thousands of these:When I check what the parent process of all of these, it's Caddy:
As well, my motd tells me on SSH login that:
=> There are 3821 zombie processes.
These thousands of zombie processes are causing my host server to use lots of memory and, eventually, grind to a halt, requiring a restart.
System + Version Information
Relevant
docker-compose.yml
section:Caddyfile
Any thoughts how I could stop this?