balena-os / meta-balena

A collection of Yocto layers used to build balenaOS images
https://www.balena.io/os
967 stars 115 forks source link

openvpn deescalates privileges which causes a hard failure on reconnect to different endpoint #1779

Open xginn8 opened 4 years ago

xginn8 commented 4 years ago

In https://github.com/balena-os/meta-balena/commit/12205bf2ae99efdb6dd96ad4f76e1c28001aff7a we configured openvpn to de-escalate privileges and run as the openvpn user & group.

As noted in the openvpn docs, this de-escalation causes a hard failure (https://openvpn.net/community-resources/reference-manual-for-openvpn-2-0/):

Note the following corner case: If you use multiple –remote options, AND you are dropping root privileges on the client with –user and/or –group, AND the client is running a non-Windows OS, if the client needs to switch to a different server, and that server pushes back different TUN/TAP or route settings, the client may lack the necessary privileges to close and reopen the TUN/TAP interface. This could cause the client to exit with a fatal error.

The daemon crashes with the following error after it receives a different address PUSHed to it from the openvpn server:

Dec 17 23:15:15 XXXXXX openvpn[946]: Tue Dec 17 23:15:15 2019 ERROR: Cannot ioctl TUNSETIFF resin-vpn: Operation not permitted (errno=1)

Specifically, this issue is related to https://github.com/balena-os/meta-balena/issues/1776 as upon the ping-restart described in the ticket, the remote PUSHes a new address which the daemon cannot apply.

cc @wrboyce @afitzek

balena-ci commented 4 years ago

[xginn8] This issue has attached support thread https://jel.ly.fish/#/support-thread~b16dd5c1-4ef9-4972-8dac-ce50ea9dcaf6

jellyfish-bot commented 4 years ago

[saintaardvark] This issue has attached support thread https://jel.ly.fish/38ab2df4-55c2-4171-8ac0-485ba067438d

markcorbinuk commented 4 years ago

This issue occurred overnight last night on my RPI3 - strangely enough it was at 41 minutes past the hour as previously reported. Log file here --> openvpn-unit-journal.txt

Don't know why the network is dropping, but managed to recreate the openvpn behaviour by temporarily blocking incoming traffic from port 443 using an iptables rule.

jellyfish-bot commented 3 years ago

[xginn8] This issue has attached support thread https://jel.ly.fish/50a38ebb-7b5d-4708-ae46-5230910ec13e

alexgg commented 3 years ago

This should have been fixed with the merge of #2014 in v2.60. Please only re-open if this same problem is reported above that version.

markcorbinuk commented 3 years ago

The merge of #2014 will only fix this for instantaneous reboots where the outage time is <60 seconds

jellyfish-bot commented 2 years ago

[majorz] This issue has attached support thread https://jel.ly.fish/1b57a2f7-e2b2-4658-94ef-0a35bef04f4b

jellyfish-bot commented 2 years ago

[majorz] This issue has attached support thread https://jel.ly.fish/78547810-74ac-4ae3-b854-60727ac077c0

majorz commented 2 years ago

I investigated a couple of instances of the ioctl TUNSETIFF. It happens quite frequently when a VPN connection is dropped between the device and our servers and the OpenVPN client tries to reconnect. When it succeeds to reestablish connection to our servers, the issue occurs, then the client restarts itself and afterwards on the next attempt everything is fine. So it is not such a severe issue, but it still has to be solved since that restart should not happen.

20k-ultra commented 2 years ago

Found a rpi4 with Host OS version balenaOS 2.95.8 exhibiting this issue

Apr 06 19:50:50 xxxxxxx openvpn[1806517]: Wed Apr  6 19:50:50 2022 ERROR: Cannot ioctl TUNSETIFF resin-vpn: Operation not permitted (errno=1)
Apr 06 19:50:50 xxxxxxx openvpn[1806517]: Wed Apr  6 19:50:50 2022 Exiting due to fatal error
Apr 06 19:50:50 xxxxxxx systemd[1]: openvpn.service: Main process exited, code=exited, status=1/FAILURE
Apr 06 19:50:50 xxxxxxx systemd[1]: openvpn.service: Failed with result 'exit-code'.

restarting the vpn service does not resolve this issue.

jellyfish-bot commented 2 years ago

[rhampt] This issue has attached support thread https://jel.ly.fish/c550cc88-af96-4e61-b5e8-a81dd0f47f07

majorz commented 1 year ago

Note that this issue is probably causing https://github.com/balena-io/open-balena-vpn/issues/313, so it is more severe than initially thought.

When OpenVPN is started it starts as root user. After it initializes and connects to the server it drops privileges and runs as openvpn user.

If for some reason the VPN connection stales and a server ping is not received for 60 seconds, the client will try to reestablish connection to the server after 5 seconds:

[vpn.balena-cloud.com] Inactivity timeout (--ping-restart), restarting
TCP/UDP: Closing socket
SIGUSR1[soft,ping-restart] received, process restarting
Restart pause, 5 second(s)
Re-using SSL/TLS context
...

As the process itself is not really restarted and does not exit, it fails in recreating the tun interface because it does not run as root at that point:

...
PUSH: Received control message: 'PUSH_REPLY,sndbuf 0,rcvbuf 0,route 52.4.252.97,ping 10,ping-restart 60,socket-flags TCP_NODELAY,ifconfig 10.242.26.155 52.4.252.97,peer-id 0,cipher AES-128-GCM'
...
NOTE: Pulled options changed on restart, will need to close and reopen TUN/TAP device.
...
Closing TUN/TAP interface
...
ERROR: Cannot ioctl TUNSETIFF resin-vpn: Operation not permitted (errno=1)
Exiting due to fatal error
Main process exited, code=exited, status=1/FAILURE
openvpn.service: Failed with result 'exit-code'.

The --ping-restart option is being pushed by the server to the client (see above first line). If instead the server pushes ping-exit the process will just terminate and will be restarted by systemd. When it starts, it will start as root and it will not run into the same problem.

If such a change could not be made on the server side currently, the alternative to this is to remap the SIGUSR1 signal to SIGTERM by passing --remap-usr1 SIGTERM to the client arguments in openvpn.service. In that case the process will exit instead of trying to reinstate the connection:

[vpn.balena-cloud.com] Inactivity timeout (--ping-restart), restarting
...
Closing TUN/TAP interface
...
SIGTERM[soft,ping-restart] received, process exiting

If the change is done on the server side, currently deployed devices will no longer incorrectly report heartbeat only mode. If the change is done on the OS side, the problem will be solved for devices running newer OS version.

Other methods also exist for addressing this problem (https://community.openvpn.net/openvpn/wiki/UnprivilegedUser), but will require a lot more substantial changes both on the client and server side, which includes adjusting openvpn.conf on the client side. Since openvpn.conf is currently retrieved online by os-config, that will make this even more difficult as we have to preserve backwards compatibility with it.

jellyfish-bot commented 1 year ago

[thgreasi] This has attached https://jel.ly.fish/e7abfa7a-59e7-4326-ae22-1d5c77ef7348

klutchell commented 1 year ago

Resolved by https://github.com/balena-io/open-balena-vpn/pull/314

open-balena-vpn v11.19.0 is now in balenaCloud production

majorz commented 1 year ago

We have seen a new instance of this error - this time not as severe, but we will have to fix it on the OS side this time.

The VPN connection was reset for some unknown reason (previously we did not receive ping messages from the server).

Mar 01 14:39:37 b65a222 openvpn[2640]: Wed Mar  1 14:39:37 2023 Connection reset, restarting [0]
Mar 01 14:39:37 b65a222 openvpn[2640]: Wed Mar  1 14:39:37 2023 /etc/openvpn-misc/downscript.sh resin-vpn 1500 1555 10.246.107.185 52.4.252.97 restart
Mar 01 14:39:37 b65a222 openvpn[2640]: Wed Mar  1 14:39:37 2023 SIGUSR1[soft,connection-reset] received, process restarting

The solution for fixing this scenario was explained in the previous message: this is to remap the SIGUSR1 signal to SIGTERM by passing --remap-usr1 SIGTERM to the client arguments in openvpn.service.

majorz commented 1 year ago

Encountered another instance of this, but this time leading the VPN unavailability, so I will look into fixing this with more priority:

Mar 07 09:41:49 f450cbf openvpn[6998]: Tue Mar  7 09:41:49 2023 Connection reset, restarting [0]
Mar 07 09:41:49 f450cbf openvpn[6998]: Tue Mar  7 09:41:49 2023 /etc/openvpn-misc/downscript.sh resin-vpn 1500 1555 10.241.127.118 52.4.252.97 restart
Mar 07 09:41:49 f450cbf openvpn[6998]: Tue Mar  7 09:41:49 2023 SIGUSR1[soft,connection-reset] received, process restarting
Mar 07 09:41:49 f450cbf openvpn[6998]: Tue Mar  7 09:41:49 2023 Restart pause, 5 second(s)
...
Mar 07 09:41:58 f450cbf openvpn[6998]: Tue Mar  7 09:41:58 2023 ERROR: Cannot ioctl TUNSETIFF resin-vpn: Operation not permitted (errno=1)
Mar 07 09:41:58 f450cbf openvpn[6998]: Tue Mar  7 09:41:58 2023 Exiting due to fatal error
Mar 07 09:41:58 f450cbf systemd[1]: openvpn.service: Main process exited, code=exited, status=1/FAILURE
Mar 07 09:41:58 f450cbf systemd[1]: openvpn.service: Failed with result 'exit-code'.
jellyfish-bot commented 1 year ago

[majorz] This has attached https://jel.ly.fish/3b115ca6-f2ab-4ffc-a11c-54811547eb15

majorz commented 1 year ago

We may attempt to do a push "remap-usr1 SIGTERM" on the server side, similarly to how we handled ping-exit.

majorz commented 1 year ago

Remapping this may have possible side-effects, so may or may not be a good solution. Probably not.