Question about openvpn systemd unit file

krm3 commented 5 months ago

We use the openvpn packages for Debian bookworm from https://build.openvpn.net/debian/openvpn/. As we ran in #449 with 2.6.7 (thanks @patcable for reporting) we noticed that in the systemd unit file for openvpn KillMode ist set to 'process' and not 'control-group'. Therefore after every segfault there were zombies left and when MaxTasks was reached (which is set to 10) the openvpn service could not start again. That's why the segfault behaviour led to a complete openvpn service outage for us.

My question is: what is the reason that KillMode is set to 'process' here? systemd manual page is saying: "Note that it is not recommended to set KillMode= to process or even none, as this allows processes to escape the service manager's lifecycle and resource management, and to remain running even while their service is considered stopped and is assumed to not consume any resources."

Thanks in advance for your explanation.

cron2 commented 5 months ago

Hi,

On Wed, Jan 17, 2024 at 03:16:28PM -0800, krm3 wrote:

We use the openvpn packages for Debian bookworm from https://build.openvpn.net/debian/openvpn/. As we ran in #449 with 2.6.7 (thanks @patcable for reporting) we noticed that in the systemd unit file for openvpn KillMode ist set to 'process' and not 'control-group'. Therefore after every segfault there were zombies left and when MaxTasks was reached (which is set to 10) the openvpn service could not start again. That's why the segfault behaviour led to a complete openvpn service outage for us.

This sounds not like what should happen. If OpenVPN crashes, and has current child processes (like for auth plugin, or anything else), these should be re-parented to systemd, and no zombies should ever happen.

Zombie processes happen if the parent process is still there and is not properly calling wait() on its child processes - but if the parent process dies (SIGSEGV), this scenario can not happen.

My question is: what is the reason that KillMode is set to 'process' here? systemd manual page is saying: "Note that it is not recommended to set KillMode= to process or even none, as this allows processes to escape the service manager's lifecycle and resource management, and to remain running even while their service is considered stopped and is assumed to not consume any resources."

No process OpenVPN starts is expected to live for a long time or even beyond OpenVPN ending, so it's somewhat moot whether the primary process or everything is signalled.

gert

-- "If was one thing all people took for granted, was conviction that if you feed honest figures into a computer, honest figures come out. Never doubted it myself till I met a computer with a sense of humor." Robert A. Heinlein, The Moon is a Harsh Mistress

Gert Doering - Munich, Germany @.***

krm3 commented 5 months ago

This sounds not like what should happen. If OpenVPN crashes, and has current child processes (like for auth plugin, or anything else), these should be re-parented to systemd, and no zombies should ever happen. Zombie processes happen if the parent process is still there and is not properly calling wait() on its child processes - but if the parent process dies (SIGSEGV), this scenario can not happen.

I think "zombie" was the wrong term. I think the processes were still parented to systemd. I will try to reproduce this with 2.6.7 and investigate further and then come back.

Klara

krm3 commented 5 months ago

We have four openvpn services on one node (udp/ipv6, udp/ipv4 and the same with pushing no default route but split routes). On 2024-01-26 01:31 I installed 2.6.7 again and started the services. Soon after, segfaults must have been happened (but I see nothing in the logs) . When I looked at the services about 7 hours later timestamps were showing that services had restarted and one service had already 7 Tasks (normal state is 2). I deactivated the node in the loadbalancer so no new sessions could be established. Recent output cutout from systemctl status:

● openvpn@tun6u.service - OpenVPN connection to tun6u
     Active: active (running) since Fri 2024-01-26 11:53:39 CET; 3 days ago
      Tasks: 3 (limit: 10)
● openvpn@tun4u.service - OpenVPN connection to tun4u
     Active: active (running) since Fri 2024-01-26 08:16:37 CET; 3 days ago
      Tasks: 7 (limit: 10)
● openvpn@tun6us.service - OpenVPN connection to tun6us
     Active: active (running) since Fri 2024-01-26 01:31:13 CET; 3 days ago
      Tasks: 2 (limit: 10)
● openvpn@tun4us.service - OpenVPN connection to tun4us
     Active: active (running) since Fri 2024-01-26 01:31:13 CET; 3 days ago
      Tasks: 2 (limit: 10)

Whole output for openvpn@tun4u.service:

● openvpn@tun4u.service - OpenVPN connection to tun4u
     Loaded: loaded (/lib/systemd/system/openvpn@.service; enabled; preset: enabled)
     Active: active (running) since Fri 2024-01-26 08:16:37 CET; 3 days ago
       Docs: man:openvpn(8)
             https://community.openvpn.net/openvpn/wiki/Openvpn24ManPage
             https://community.openvpn.net/openvpn/wiki/HOWTO
   Main PID: 11345 (openvpn)
     Status: "Initialization Sequence Completed"
      Tasks: 7 (limit: 10)
     Memory: 11.6M
        CPU: 15min 45.190s
     CGroup: /system.slice/system-openvpn.slice/openvpn@tun4u.service
             ├─ 1338 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
             ├─ 7956 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
             ├─ 8327 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
             ├─10482 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
             ├─11117 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
             ├─11345 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
             └─11347 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid

I think this is not what should happen. When the limit of Tasks is reached the service cannot start again. This is what happened to us after we upgraded to 2.6.7.

cron2 commented 5 months ago

Are you using plugin in your configs? If yes, which plugin is used?

The fact that you have "Tasks: 2" in steady state is unusual, but is normal when using plugin-auth-pam (for example) because that one forks, to keep root privs (and do deferred auth).

So I guess there is a plugin bug involved, not noticing if OpenVPN dies - and thus not exiting. So, not a Zombie in the unix sense ("a process that has already exited, and no parent calling wait() to reap the status"), because a Zombie wouldn't have a command line visible anymore.

So we should see if this plugin bug can be fixed (and of course see that OpenVPN won't SIGSEGV again...) - but this said, it does make sense for systemd to kill all child processes as well, in this case.

Depending on the source of the debian unit file, it won't be on us (upstream) to fix it... I'll ping the debian maintainer for his opinion.

krm3 commented 5 months ago

Yes, we are using plugin-auth-pam. Thank you very much, that makes sense and sounds good to me.

bernhardschmidt commented 5 months ago

Debian Maintainer here. You are using openvpn@.service which is a unit shipped only by Debian, but the upstream provided openvpn-server@.service have the same issue. I agree that we should probably just change the KillMode. However, I'm not sure why the processes are stuck here at all. I have only seen that with DCO when the kernel module hung, and in that case changing the KillMode will probably not help you (the processes are unkillable).

Can you kill the processes manually by PID? Does it help to locally override KillMode=control-group (the default)?

cron2 commented 5 months ago

I'm fairly sure that this is a bug / misfeature in plugin-auth-pam- it forks, and both parts talk via a socketpair, but I'm not sure the client side ever notices if the parent goes away. So it should be killable just fine, but I'll look into fixing this.

I do wonder if there is a possible drawback on changing the KillMode in the general case, like a sub-process failing to clean up "something" when being killed by systemd, instead of signalled by OpenVPN. I do knot know anything, though.

krm3 commented 5 months ago

Can you kill the processes manually by PID?

Yes, it works:

root@ovpn-l3-mgmt-110:~# ps waux | grep openvpn | grep tun4u | grep -v tun4us
root        1338  0.0  0.0  13344  4264 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root        7956  0.0  0.0  13344  4160 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root        8327  0.0  0.0  13344  4232 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root       10482  0.0  0.0  13344  4244 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root       11117  0.0  0.0  13344  4212 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
openvpn    11345  0.2  0.0  15560 11448 ?        Ss   Jan26  17:34 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root       11347  0.0  0.0  13344  4312 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root@ovpn-l3-mgmt-110:~# kill 1338
root@ovpn-l3-mgmt-110:~# ps waux | grep openvpn | grep tun4u | grep -v tun4us
root        7956  0.0  0.0  13344  4160 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root        8327  0.0  0.0  13344  4232 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root       10482  0.0  0.0  13344  4244 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root       11117  0.0  0.0  13344  4212 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
openvpn    11345  0.2  0.0  15560 11448 ?        Ss   Jan26  17:35 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid
root       11347  0.0  0.0  13344  4312 ?        S    Jan26   0:00 /usr/sbin/openvpn --daemon ovpn-tun4u --status /run/openvpn/tun4u.status 10 --cd /etc/openvpn --config /etc/openvpn/tun4u.conf --writepid /run/openvpn/tun4u.pid

Does it help to locally override KillMode=control-group (the default)?

Just implemented this on another node. We will see if the number of tasks remains 2.

krm3 commented 5 months ago

Does it help to locally override KillMode=control-group (the default)?

Just implemented this on another node. We will see if the number of tasks remains 2.

Seems to help. The number of tasks is still 2 for all services although the services have obviously been restarted i.e. segfaults have occurred.

OpenVPN / openvpn

Question about openvpn systemd unit file #485