Open xginn8 opened 4 years ago
[xginn8] This issue has attached support thread https://jel.ly.fish/#/support-thread~b16dd5c1-4ef9-4972-8dac-ce50ea9dcaf6
[saintaardvark] This issue has attached support thread https://jel.ly.fish/38ab2df4-55c2-4171-8ac0-485ba067438d
This issue occurred overnight last night on my RPI3 - strangely enough it was at 41 minutes past the hour as previously reported. Log file here --> openvpn-unit-journal.txt
Don't know why the network is dropping, but managed to recreate the openvpn behaviour by temporarily blocking incoming traffic from port 443 using an iptables rule.
[xginn8] This issue has attached support thread https://jel.ly.fish/50a38ebb-7b5d-4708-ae46-5230910ec13e
This should have been fixed with the merge of #2014 in v2.60. Please only re-open if this same problem is reported above that version.
The merge of #2014 will only fix this for instantaneous reboots where the outage time is <60 seconds
[majorz] This issue has attached support thread https://jel.ly.fish/1b57a2f7-e2b2-4658-94ef-0a35bef04f4b
[majorz] This issue has attached support thread https://jel.ly.fish/78547810-74ac-4ae3-b854-60727ac077c0
I investigated a couple of instances of the ioctl TUNSETIFF
. It happens quite frequently when a VPN connection is dropped between the device and our servers and the OpenVPN client tries to reconnect. When it succeeds to reestablish connection to our servers, the issue occurs, then the client restarts itself and afterwards on the next attempt everything is fine. So it is not such a severe issue, but it still has to be solved since that restart should not happen.
Found a rpi4 with Host OS version balenaOS 2.95.8 exhibiting this issue
Apr 06 19:50:50 xxxxxxx openvpn[1806517]: Wed Apr 6 19:50:50 2022 ERROR: Cannot ioctl TUNSETIFF resin-vpn: Operation not permitted (errno=1)
Apr 06 19:50:50 xxxxxxx openvpn[1806517]: Wed Apr 6 19:50:50 2022 Exiting due to fatal error
Apr 06 19:50:50 xxxxxxx systemd[1]: openvpn.service: Main process exited, code=exited, status=1/FAILURE
Apr 06 19:50:50 xxxxxxx systemd[1]: openvpn.service: Failed with result 'exit-code'.
restarting the vpn service does not resolve this issue.
[rhampt] This issue has attached support thread https://jel.ly.fish/c550cc88-af96-4e61-b5e8-a81dd0f47f07
Note that this issue is probably causing https://github.com/balena-io/open-balena-vpn/issues/313, so it is more severe than initially thought.
When OpenVPN is started it starts as root
user. After it initializes and connects to the server it drops privileges and runs as openvpn
user.
If for some reason the VPN connection stales and a server ping is not received for 60 seconds, the client will try to reestablish connection to the server after 5 seconds:
[vpn.balena-cloud.com] Inactivity timeout (--ping-restart), restarting
TCP/UDP: Closing socket
SIGUSR1[soft,ping-restart] received, process restarting
Restart pause, 5 second(s)
Re-using SSL/TLS context
...
As the process itself is not really restarted and does not exit, it fails in recreating the tun
interface because it does not run as root
at that point:
...
PUSH: Received control message: 'PUSH_REPLY,sndbuf 0,rcvbuf 0,route 52.4.252.97,ping 10,ping-restart 60,socket-flags TCP_NODELAY,ifconfig 10.242.26.155 52.4.252.97,peer-id 0,cipher AES-128-GCM'
...
NOTE: Pulled options changed on restart, will need to close and reopen TUN/TAP device.
...
Closing TUN/TAP interface
...
ERROR: Cannot ioctl TUNSETIFF resin-vpn: Operation not permitted (errno=1)
Exiting due to fatal error
Main process exited, code=exited, status=1/FAILURE
openvpn.service: Failed with result 'exit-code'.
The --ping-restart
option is being pushed by the server to the client (see above first line). If instead the server pushes ping-exit
the process will just terminate and will be restarted by systemd. When it starts, it will start as root
and it will not run into the same problem.
If such a change could not be made on the server side currently, the alternative to this is to remap the SIGUSR1 signal to SIGTERM by passing --remap-usr1 SIGTERM
to the client arguments in openvpn.service
. In that case the process will exit instead of trying to reinstate the connection:
[vpn.balena-cloud.com] Inactivity timeout (--ping-restart), restarting
...
Closing TUN/TAP interface
...
SIGTERM[soft,ping-restart] received, process exiting
If the change is done on the server side, currently deployed devices will no longer incorrectly report heartbeat only mode. If the change is done on the OS side, the problem will be solved for devices running newer OS version.
Other methods also exist for addressing this problem (https://community.openvpn.net/openvpn/wiki/UnprivilegedUser), but will require a lot more substantial changes both on the client and server side, which includes adjusting openvpn.conf
on the client side. Since openvpn.conf
is currently retrieved online by os-config
, that will make this even more difficult as we have to preserve backwards compatibility with it.
[thgreasi] This has attached https://jel.ly.fish/e7abfa7a-59e7-4326-ae22-1d5c77ef7348
Resolved by https://github.com/balena-io/open-balena-vpn/pull/314
open-balena-vpn v11.19.0 is now in balenaCloud production
We have seen a new instance of this error - this time not as severe, but we will have to fix it on the OS side this time.
The VPN connection was reset for some unknown reason (previously we did not receive ping messages from the server).
Mar 01 14:39:37 b65a222 openvpn[2640]: Wed Mar 1 14:39:37 2023 Connection reset, restarting [0]
Mar 01 14:39:37 b65a222 openvpn[2640]: Wed Mar 1 14:39:37 2023 /etc/openvpn-misc/downscript.sh resin-vpn 1500 1555 10.246.107.185 52.4.252.97 restart
Mar 01 14:39:37 b65a222 openvpn[2640]: Wed Mar 1 14:39:37 2023 SIGUSR1[soft,connection-reset] received, process restarting
The solution for fixing this scenario was explained in the previous message: this is to remap the SIGUSR1 signal to SIGTERM by passing --remap-usr1 SIGTERM to the client arguments in openvpn.service.
Encountered another instance of this, but this time leading the VPN unavailability, so I will look into fixing this with more priority:
Mar 07 09:41:49 f450cbf openvpn[6998]: Tue Mar 7 09:41:49 2023 Connection reset, restarting [0]
Mar 07 09:41:49 f450cbf openvpn[6998]: Tue Mar 7 09:41:49 2023 /etc/openvpn-misc/downscript.sh resin-vpn 1500 1555 10.241.127.118 52.4.252.97 restart
Mar 07 09:41:49 f450cbf openvpn[6998]: Tue Mar 7 09:41:49 2023 SIGUSR1[soft,connection-reset] received, process restarting
Mar 07 09:41:49 f450cbf openvpn[6998]: Tue Mar 7 09:41:49 2023 Restart pause, 5 second(s)
...
Mar 07 09:41:58 f450cbf openvpn[6998]: Tue Mar 7 09:41:58 2023 ERROR: Cannot ioctl TUNSETIFF resin-vpn: Operation not permitted (errno=1)
Mar 07 09:41:58 f450cbf openvpn[6998]: Tue Mar 7 09:41:58 2023 Exiting due to fatal error
Mar 07 09:41:58 f450cbf systemd[1]: openvpn.service: Main process exited, code=exited, status=1/FAILURE
Mar 07 09:41:58 f450cbf systemd[1]: openvpn.service: Failed with result 'exit-code'.
[majorz] This has attached https://jel.ly.fish/3b115ca6-f2ab-4ffc-a11c-54811547eb15
We may attempt to do a push "remap-usr1 SIGTERM"
on the server side, similarly to how we handled ping-exit.
Remapping this may have possible side-effects, so may or may not be a good solution. Probably not.
In https://github.com/balena-os/meta-balena/commit/12205bf2ae99efdb6dd96ad4f76e1c28001aff7a we configured openvpn to de-escalate privileges and run as the
openvpn
user & group.As noted in the openvpn docs, this de-escalation causes a hard failure (https://openvpn.net/community-resources/reference-manual-for-openvpn-2-0/):
The daemon crashes with the following error after it receives a different address PUSHed to it from the openvpn server:
Specifically, this issue is related to https://github.com/balena-os/meta-balena/issues/1776 as upon the ping-restart described in the ticket, the remote PUSHes a new address which the daemon cannot apply.
cc @wrboyce @afitzek