Open krm3 opened 5 months ago
Hi,
On Wed, Jan 17, 2024 at 03:16:28PM -0800, krm3 wrote:
We use the openvpn packages for Debian bookworm from https://build.openvpn.net/debian/openvpn/. As we ran in #449 with 2.6.7 (thanks @patcable for reporting) we noticed that in the systemd unit file for openvpn KillMode ist set to 'process' and not 'control-group'. Therefore after every segfault there were zombies left and when MaxTasks was reached (which is set to 10) the openvpn service could not start again. That's why the segfault behaviour led to a complete openvpn service outage for us.
This sounds not like what should happen. If OpenVPN crashes, and has current child processes (like for auth plugin, or anything else), these should be re-parented to systemd, and no zombies should ever happen.
Zombie processes happen if the parent process is still there and is not properly calling wait() on its child processes - but if the parent process dies (SIGSEGV), this scenario can not happen.
My question is: what is the reason that KillMode is set to 'process' here? systemd manual page is saying: "Note that it is not recommended to set KillMode= to process or even none, as this allows processes to escape the service manager's lifecycle and resource management, and to remain running even while their service is considered stopped and is assumed to not consume any resources."
No process OpenVPN starts is expected to live for a long time or even beyond OpenVPN ending, so it's somewhat moot whether the primary process or everything is signalled.
gert
-- "If was one thing all people took for granted, was conviction that if you feed honest figures into a computer, honest figures come out. Never doubted it myself till I met a computer with a sense of humor." Robert A. Heinlein, The Moon is a Harsh Mistress
Gert Doering - Munich, Germany @.***
This sounds not like what should happen. If OpenVPN crashes, and has current child processes (like for auth plugin, or anything else), these should be re-parented to systemd, and no zombies should ever happen. Zombie processes happen if the parent process is still there and is not properly calling wait() on its child processes - but if the parent process dies (SIGSEGV), this scenario can not happen.
I think "zombie" was the wrong term. I think the processes were still parented to systemd. I will try to reproduce this with 2.6.7 and investigate further and then come back.
Klara
We have four openvpn services on one node (udp/ipv6, udp/ipv4 and the same with pushing no default route but split routes). On 2024-01-26 01:31 I installed 2.6.7 again and started the services. Soon after, segfaults must have been happened (but I see nothing in the logs) . When I looked at the services about 7 hours later timestamps were showing that services had restarted and one service had already 7 Tasks (normal state is 2). I deactivated the node in the loadbalancer so no new sessions could be established. Recent output cutout from systemctl status:
● openvpn@tun6u.service - OpenVPN connection to tun6u
Active: active (running) since Fri 2024-01-26 11:53:39 CET; 3 days ago
Tasks: 3 (limit: 10)
● openvpn@tun4u.service - OpenVPN connection to tun4u
Active: active (running) since Fri 2024-01-26 08:16:37 CET; 3 days ago
Tasks: 7 (limit: 10)
● openvpn@tun6us.service - OpenVPN connection to tun6us
Active: active (running) since Fri 2024-01-26 01:31:13 CET; 3 days ago
Tasks: 2 (limit: 10)
● openvpn@tun4us.service - OpenVPN connection to tun4us
Active: active (running) since Fri 2024-01-26 01:31:13 CET; 3 days ago
Tasks: 2 (limit: 10)
Whole output for openvpn@tun4u.service:
● openvpn@tun4u.service - OpenVPN connection to tun4u
Loaded: loaded (/lib/systemd/system/openvpn@.service; enabled; preset: enabled)
Active: active (running) since Fri 2024-01-26 08:16:37 CET; 3 days ago
Docs: man:openvpn(8)
https://community.openvpn.net/openvpn/wiki/Openvpn24ManPage
https://community.openvpn.net/openvpn/wiki/HOWTO
Main PID: 11345 (openvpn)
Status: "Initialization Sequence Completed"
Tasks: 7 (limit: 10)
Memory: 11.6M
CPU: 15min 45.190s
CGroup: /system.slice/system-openvpn.slice/openvpn@tun4u.service
├─ 1338 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
├─ 7956 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
├─ 8327 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
├─10482 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
├─11117 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
├─11345 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
└─11347 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
I think this is not what should happen. When the limit of Tasks is reached the service cannot start again. This is what happened to us after we upgraded to 2.6.7.
Are you using plugin
in your configs? If yes, which plugin is used?
The fact that you have "Tasks: 2" in steady state is unusual, but is normal when using plugin-auth-pam
(for example) because that one forks, to keep root privs (and do deferred auth).
So I guess there is a plugin bug involved, not noticing if OpenVPN dies - and thus not exiting. So, not a Zombie in the unix sense ("a process that has already exited, and no parent calling wait()
to reap the status"), because a Zombie wouldn't have a command line visible anymore.
So we should see if this plugin bug can be fixed (and of course see that OpenVPN won't SIGSEGV again...) - but this said, it does make sense for systemd to kill all child processes as well, in this case.
Depending on the source of the debian unit file, it won't be on us (upstream) to fix it... I'll ping the debian maintainer for his opinion.
Yes, we are using plugin-auth-pam
. Thank you very much, that makes sense and sounds good to me.
Debian Maintainer here. You are using openvpn@.service which is a unit shipped only by Debian, but the upstream provided openvpn-server@.service have the same issue. I agree that we should probably just change the KillMode. However, I'm not sure why the processes are stuck here at all. I have only seen that with DCO when the kernel module hung, and in that case changing the KillMode will probably not help you (the processes are unkillable).
Can you kill the processes manually by PID? Does it help to locally override KillMode=control-group
(the default)?
I'm fairly sure that this is a bug / misfeature in plugin-auth-pam
- it forks, and both parts talk via a socketpair, but I'm not sure the client side ever notices if the parent goes away. So it should be killable just fine, but I'll look into fixing this.
I do wonder if there is a possible drawback on changing the KillMode
in the general case, like a sub-process failing to clean up "something" when being killed by systemd, instead of signalled by OpenVPN. I do knot know anything, though.
Can you kill the processes manually by PID?
Yes, it works:
root@ovpn-l3-mgmt-110:~# ps waux | grep openvpn | grep tun4u | grep -v tun4us
root 1338 0.0 0.0 13344 4264 ? S Jan26 0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root 7956 0.0 0.0 13344 4160 ? S Jan26 0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root 8327 0.0 0.0 13344 4232 ? S Jan26 0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root 10482 0.0 0.0 13344 4244 ? S Jan26 0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root 11117 0.0 0.0 13344 4212 ? S Jan26 0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
openvpn 11345 0.2 0.0 15560 11448 ? Ss Jan26 17:34 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root 11347 0.0 0.0 13344 4312 ? S Jan26 0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root@ovpn-l3-mgmt-110:~# kill 1338
root@ovpn-l3-mgmt-110:~# ps waux | grep openvpn | grep tun4u | grep -v tun4us
root 7956 0.0 0.0 13344 4160 ? S Jan26 0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root 8327 0.0 0.0 13344 4232 ? S Jan26 0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root 10482 0.0 0.0 13344 4244 ? S Jan26 0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root 11117 0.0 0.0 13344 4212 ? S Jan26 0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
openvpn 11345 0.2 0.0 15560 11448 ? Ss Jan26 17:35 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root 11347 0.0 0.0 13344 4312 ? S Jan26 0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
Does it help to locally override
KillMode=control-group
(the default)?
Just implemented this on another node. We will see if the number of tasks remains 2.
Does it help to locally override
KillMode=control-group
(the default)?Just implemented this on another node. We will see if the number of tasks remains 2.
Seems to help. The number of tasks is still 2 for all services although the services have obviously been restarted i.e. segfaults have occurred.
We use the openvpn packages for Debian bookworm from https://build.openvpn.net/debian/openvpn/. As we ran in #449 with 2.6.7 (thanks @patcable for reporting) we noticed that in the systemd unit file for openvpn KillMode ist set to 'process' and not 'control-group'. Therefore after every segfault there were zombies left and when MaxTasks was reached (which is set to 10) the openvpn service could not start again. That's why the segfault behaviour led to a complete openvpn service outage for us.
My question is: what is the reason that KillMode is set to 'process' here? systemd manual page is saying: "Note that it is not recommended to set KillMode= to process or even none, as this allows processes to escape the service manager's lifecycle and resource management, and to remain running even while their service is considered stopped and is assumed to not consume any resources."
Thanks in advance for your explanation.