Unrecoverable disconnects when using NetworkManager+OpenVPN

cryzed commented 8 years ago

I'm not sure if I'm the only one with this issue, but when using NetworkManager + OpenVPN and the configuration files provided by this package, I get regular disconnects every few hours after which OpenVPN doesn't manage to reconnect.

I suspect it is because NetworkManager thinks the VPN tunnel device is still connected and when OpenVPN attempts to reconnect to the VPN server it fails. This can happen for two reasons:

The resolved DNS address isn't cached and can't be resolved because there's no valid network configuration to talk to a DNS server.
The DNS can be resolved because a DNS resolver (such as Unbound) is running locally or the resolved address is still cached; however the connection to the VPN IP can not be made over the disconnected VPN tunnel.

I am pretty sure this is a NetworkManager-only issue. It might be worth considering using the pia Python script to automatically remove or comment out the "persist-tun" setting in case NetworkManager is used on the system it is running on.

To elaborate, "persist-tun" does the following:

Don't close and reopen TUN/TAP device or run up/down scripts across SIGUSR1 or --ping-restart restarts. SIGUSR1 is a restart signal similar to SIGHUP, but which offers finer-grained control over reset options.

(Source: https://community.openvpn.net/openvpn/wiki/Openvpn23ManPage)

However closing and reopening the VPN tunnel would almost certainly work around this issue in NetworkManager. An alternative would be some custom scripts which monitors the log and is somehow able to tell NetworkManager over DBUS to drop the tunnel connection, although I'm not sure if this is really possible.

Alternatively, one might consider not using "dev tun" at all and let NetworkManager+OpenVPN configure the routing accordingly without using the VPN tunnel device (I haven't tested this at all!).

flamusdiu commented 8 years ago

From what I can see, it may or may not be a NM problem. I haven't used the VPN for a very long 12hr+ session. I could just add something in the configuration file to allow toggling it on or off if people have odd drop out issues. I wouldn't want to change the way PIA is setup to work out-of-the-box without user interactions or something seriously wrong with the configurations--which, in that case, information would need to be provided to PIA.

cryzed commented 8 years ago

I could just add something in the configuration file to allow toggling it on or off if people have odd drop out issues

This sounds exactly like what I had in mind. I'll try the modified configuration files for a bit on my system now and let you know if it improves the situation.

flamusdiu commented 8 years ago

Let me know what you find out. I am in the process of switching jobs/moving and it wouldn't be a bad idea to wait to see if the change in options would help you.

cryzed commented 8 years ago

Will do, and good luck!

cryzed commented 8 years ago

The issue still exists:

16.06.16 19:07  nm-openvpn  [Private Internet Access] Inactivity timeout (--ping-restart), restarting
16.06.16 19:07  nm-openvpn  SIGUSR1[soft,ping-restart] received, process restarting
16.06.16 19:07  nm-openvpn  NOTE: the current --script-security setting may allow this configuration to call user-defined scripts
16.06.16 19:07  nm-openvpn  UDPv4 link local: [undef]
16.06.16 19:07  nm-openvpn  UDPv4 link remote: [AF_INET]46.166.188.193:1194
16.06.16 19:08  nm-openvpn  TLS Error: TLS key negotiation failed to occur within 60 seconds (check your network connectivity)
16.06.16 19:08  nm-openvpn  TLS Error: TLS handshake failed
16.06.16 19:08  nm-openvpn  SIGUSR1[soft,tls-error] received, process restarting
16.06.16 19:08  nm-openvpn  NOTE: the current --script-security setting may allow this configuration to call user-defined scripts

At a loss at what to do, honestly. The output changed slightly: it at least attempts to do the handshake now, but because all traffic is still being routed over tun0 (which is down) it won't ever succeed.

I can't believe I am the only one with this issue, especially because my setup isn't really special, however I have a hard time finding similar bug reports. This whole thing would actually fail even a step before all this, if I weren't using a local DNS resolver (Unbound) because NetworkManager would attempt to resolve the VPN domain in the configuration file using the broken tunnel...

EDIT: Regarding the DNS resolution, adding this:

--persist-remote-ip Preserve most recently authenticated remote IP address and port number across SIGUSR1 or --ping-restart restarts.

to the client configuration should help, however the other issue (actually establishing a connection to the VPN IP still exists).

EDIT 2: I created an issue with NetworkManager here, which goes into a bit more detail.

EDIT 3: After some more investigation NetworkManager already adds a direct route to the VPN IP via the default gateway:

46.166.188.241 via 192.168.0.1 dev enp7s0 proto static metric 100

The only thing that I can imagine happening right now that would prevent reconnecting to the server, is that the DNS resolver returns a different A-record for the VPN domain at the time of reconnection, which does not have an extra route configured -- thus it would try to connect to it over the VPN tunnel instead of the previous default gateway.

If this is true, adding persist-remote-ip should magically "fix" everything. I'll have to wait and see. However if this works, it would just be a workaround -- ideally NetworkManager would add direct routing via the previous default gateway for every A-record of the VPN's domain. I could easily write a script that would take care of adding routing for all records manually...

flamusdiu commented 8 years ago

Have you check the static route it adds to see if it is different between the VPN being up and the VPN being disconnected?

cryzed commented 8 years ago

No, I didn't get a chance to test it yet after adding persist-remote-ip, but if I'm understanding you correctly, that's exactly the issue I think I found. The IP used in the static route that is added to the routing table during the initial connection to the VPN differs from the IP used when OpenVPN attempts to reconnect.

This happens because the configured DNS resolver returns any of the specified A-records for the VPN domain randomly for rudimentary load-balancing. This means that it isn't guaranteed that the IP in the added static route will be used to reconnect to the VPN -- a different one might be chosen (for which no such route is specified) which will be attempted to be routed through the broken VPN tunnel.

So specifying a static IP in the configuration file is one solution, a different one is specifying persist-remote-ip so that the initially resolved IP is kept across ping-restarts or (what I think should actually happen) adding static routes for all A-records of the VPN domain.

For now I'll try using persist-remote-ip to make sure that this is really the issue. If it is, I'll probably write a Python script that adds static routes for all A-records of the used VPN server or hope that the bug I filed with NetworkManager gets fixed.

cryzed commented 8 years ago

Adding persist-remote-ip didn't work:

19.06.16 15:24  nm-openvpn  OpenVPN 2.3.11 x86_64-unknown-linux-gnu [SSL (OpenSSL)] [LZO] [EPOLL] [MH] [IPv6] built on May 12 2016
19.06.16 15:24  nm-openvpn  library versions: OpenSSL 1.0.2h  3 May 2016, LZO 2.09
19.06.16 15:24  nm-openvpn  NOTE: the current --script-security setting may allow this configuration to call user-defined scripts
19.06.16 15:24  nm-openvpn  NOTE: UID/GID downgrade will be delayed because of --client, --pull, or --up-delay
19.06.16 15:24  nm-openvpn  UDPv4 link local: [undef]
19.06.16 15:24  nm-openvpn  UDPv4 link remote: [AF_INET]46.166.190.129:1194
19.06.16 15:24  nm-openvpn  [Private Internet Access] Peer Connection Initiated with [AF_INET]46.166.190.129:1194
19.06.16 15:24  nm-openvpn  TUN/TAP device tun0 opened
19.06.16 15:24  nm-openvpn  /usr/lib/networkmanager/nm-openvpn-service-openvpn-helper --bus-name org.freedesktop.NetworkManager.openvpn.Connection_13 --tun -- tun0 1500 1542 10.126.1.6 10.126.1.5 init
19.06.16 15:24  nm-openvpn  GID set to nm-openvpn
19.06.16 15:24  nm-openvpn  UID set to nm-openvpn
19.06.16 15:24  nm-openvpn  Initialization Sequence Completed
19.06.16 19:08  nm-openvpn  [Private Internet Access] Inactivity timeout (--ping-restart), restarting
19.06.16 19:08  nm-openvpn  SIGUSR1[soft,ping-restart] received, process restarting
19.06.16 19:08  nm-openvpn  NOTE: the current --script-security setting may allow this configuration to call user-defined scripts
19.06.16 19:08  nm-openvpn  UDPv4 link local: [undef]
19.06.16 19:08  nm-openvpn  UDPv4 link remote: [AF_INET]109.201.154.199:1194
19.06.16 19:09  nm-openvpn  TLS Error: TLS key negotiation failed to occur within 60 seconds (check your network connectivity)
19.06.16 19:09  nm-openvpn  TLS Error: TLS handshake failed

Compare:

19.06.16 15:24 nm-openvpn UDPv4 link remote: [AF_INET]46.166.190.129:1194
19.06.16 19:08 nm-openvpn UDPv4 link remote: [AF_INET]109.201.154.199:1194

And this is after I added persist-remote-ip to /etc/openvpn/Netherlands.conf. So either persist-remote-ip doesn't seem to do what I expect it to, or it doesn't work. I also tried manually adding remote nl.privateinternetaccess.com but only a single route is added. So now I am writing the custom up/down scripts in Python.

cryzed commented 8 years ago

This works if OpenVPN is used from the command-line, since the script is properly executed. When this configuration file is used through NetworkManager-OpenVPN the script instructions are completely ignored by the way! So now I have to find a different way to make this work with NetworkManager.

Slowly getting sick of this...

cryzed commented 8 years ago

Alright. So dispatcher scripts for NetworkManager get a fraction of the information OpenVPN scripts get through the environment variables, which makes running this script basically impossible.

When using OpenVPN directly, only a single initially resolved IP address is added to the routes. I am not sure in how far using OpenVPN directly changes things here (maybe persist-remote-ip is respected?), but it probably works without my script, since I seem to be one of the few that has issues with the VPN reconnection, unless I am the only one that is complaining about it publicly. The issue seems to be specific to the NetworkManager OpenVPN plugin.

What I ended up doing was using the parts from here prefixed with vpn-whitelist-domains. This contains a script which creates direct routes through the default gateway for certain domains, thus bypassing the VPN entirely; a systemd service that runs this script and whitelists domains contained in /etc/vpn-whitelist-domains/domains and a NetworkManager pre-up dispatcher script which runs the systemd service. To the whitelisted domains I simply added nl.privateinternetaccess.com so that it gets whitelisted when the network connection is established.

I am going to try running NetworkManager+OpenVPN for a while now once again and see if it works. But yes, this definitely is a bug with NetworkManager+OpenVPN as far as I can tell and upstream just doesn't seem to care -- if it wasn't, I'm sure OpenVPN would have long since fixed it since it's probably getting used directly much more.

Sorry for all this spam, I just needed to vent my frustrations a bit and explain the process of me figuring this out.

cryzed commented 8 years ago

I ended up writing fix-networkmanager-openvpn, which actually worked -- however due to networkmanager-openvpn dropping privileges, it later can't modify the tunnel anymore when it is needed. I might be able to work around this by allowing the nm-openvpn group access to /dev/net/tun0. Or I could just uninstall NetworkManager-OpenVPN and use OpenVPN via the systemd-service, which is exactly what I did.

Let it be noted that I think the OpenVPN plugin for NetworkManager is horrible.

flamusdiu commented 8 years ago

Do you think you can do a write on exactly how you fixed it? I might as well add that to the Wiki page on Arch Wiki (or you can yourself).

cryzed commented 8 years ago

I solved a part of the problem, but discovered another problem with NetworkManager-OpenVPN in the process: fix-networkmanager-openvpn is started as a systemd service on boot and monitors the nm-openvpn journal. As soon as an IP address for a remote link (i.e. the chosen IP address of the VPN server) appears, a route is added that directs all traffic for this IP over the default gateway. The script is basically doing what the OpenVPN plugin should have done itself.

However, after having fixed this and the connection finally being established, NetworkManager-OpenVPN runs into a problem because it downgraded its UID and GID after creating the tunnel, preventing read/write access to /dev/net/tun0 and possibly execution of other related processes when it needs to reconnect to the server.

I "solved" the problem by not using the honestly bad and unmaintained plugin anymore, and instead simply enabled OpenVPN at boot using systemd: systemctl enable openvpn@name where name is the filename of a client configuration file without the ".conf" extension located in /etc/openvpn.

cryzed commented 8 years ago

Turns out using systemctl enable openvpn@configuration has similar issues sooner or later, and the same set of scripts is needed to successfully work with a domain in your client configuration file: Meaning you probably need a local DNS resolver or at least cacher, something to keep those DNS entries regularly cached ("warm") if your resolver or cacher respects the domain's TTL and discards information after that time, and something to monitor the openvpn@configuration log that automatically adds missing routes for newly resolved VPN IPs.

I adapted the last script to do this here. Alternatively adding persist-remote-ip might work now if added to the client configuration file, but I haven't tested it so far -- and I honestly have to say that I don't really like this solution.

Here is an excerpt from a reddit post I made concerning this obvious problems:

If the TTL of the VPNs domain is low (it usually is ~300 seconds) and you don't attempt to re-resolve the domain in these intervals to keep it cached, [openvpn] will still fail to resolve the domain to an address because your DNS resolver invalidated it and subsequently attempt to resolve it by contacting external servers (root DNS server or whatever forward-zones you configured).

To work around this I added this to my unbound configuration:

cache-min-ttl: 3600
prefetch: yes

Which forces to cache all domain A-records for at least 1 hour and to recache them when the domain is attempted to be resolved with less than 10% of the TTL remaining (360 seconds or 6 minutes in this case). In addition to this, I run these script to keep my DNS cache "warmed up":

I didn't make these just for this problem, but because I wanted to always resolve domains as fast as possible -- however they are useful to work around this problem. You can see in the service file that the domains in a file ~/.warm-up-dns-resolver-domains of the user executing the service are resolved. This file contains the various VPN remote addresses of my VPN provider for me. This service is executed every 6 minutes (10% of the TTL) through the timer unit, so in theory my unbound instance will always have the address to the VPN cached and be able to resolve it, even without accessing the internet. This works flawlessly.

So the configured warm-up-dns-resolver with the configuration changes to unbound is needed and the new fix-openvpn systemd unit must be enabled and started for the chosen OpenVPN configuration. If you really want to add this to the Arch wiki, feel free to do so -- however the script probably isn't quite universal and won't work for all configurations. Additionally I still really hope that I'm doing something inherently wrong, because this whole thing just sucks.

flamusdiu commented 8 years ago

Umm...maybe we can work on this. Probably need to use a chat app.

cryzed commented 8 years ago

My Google Talk address is cryzed@googlemail.com. IRC is also possible, somewhere on Freenode maybe? Whatever you want.

flamusdiu commented 8 years ago

Wow, I just realized you responded. Sorry. =( I'll drop you a line this weekend.

cryzed commented 8 years ago

I have since removed the heavy plumbum dependency from the script. I also made sure, that my current OpenVPN configuration really only works after disconnects with it running: it does -- it's not a coincidence.

You can see exactly what is happening here. My ISP regularly disconnects my connections every few hours, this is what you see here, resulting in the inactivity timeout. It then attempts to reconnect, but fails with the TLS error below it, because the IP can't be reached -- no surprise there: at the beginning it was 46.166.190.237 and then 109.201.152.25: there is no explicit route for the new IP forwarding traffic through the default gateway.

You can then see another attempt at connecting, failing with the same error. At this point I manually run # fix-openvpn openvpn@Netherlands resulting in a successful connection. Later OpenVPN attempts to add the new route (which was already added by fix-openvpn), but fails because it already exists.

So the problem seems obvious: the route is added too late. It would have to be added before the connection attempt is made, not afterwards. This problem only occurs if you actually use a domain with several A-records, i.e. multiple IP addresses, in your VPN configuration; when a static IP is used you will never run into this issue -- which is the only way I can explain why this bug still exists or exists in the first place. I would wager this is the source of most problem descriptions you can find on the internet that claim that OpenVPN doesn't automatically reconnect after a disconnect.

The script also theoretically works with NetworkManager: # fix-openvpn nm-openvpn, however due to NetworkManager dropping root privileges after the first successfully established connection (in its standard configuration on Arch Linux at least), when the connection is successfully re-established after a disconnect (using fix-openvpn!) it fatally crashes due to a write permission error when accessing /dev/net/tun.

This, by the way, remove all routes added by OpenVPN from the routing table and allows all subsequent traffic through the previous default gateway -- a huge security issue if you haven't set up a firewall limiting traffic only to the VPN when it is active. This is why I just recently claimed to you, that "NetworkManager is still shit"; however there probably is some way to prevent NetworkManager from dropping root privileges or re-acquiring them which I haven't looked into as of yet. The root problem, requiring fix-openvpn exists for both however: when using OpenVPN's systemd openvpn@.service and the NetworkManager-openvpn plugin.

flamusdiu / python-pia

Unrecoverable disconnects when using NetworkManager+OpenVPN #8