OpenVPN / openvpn

OpenVPN is an open source VPN daemon
http://openvpn.net
Other
10.74k stars 2.99k forks source link

Avoid unlimited reconnects with failing client connections #525

Closed mika closed 6 months ago

mika commented 6 months ago

Describe the bug

A VPN client (OpenVPN v2.6.9 with Tunnelblick v4) fails to connect to a OpenVPN server, due to:

OPTIONS ERROR: failed to negotiate cipher with server.  Add the server's cipher ('AES-128-CBC') to --data-ciphers (currently 'AES-256-GCM:AES-128-GCM:CHACHA20-POLY1305') if you want to connect to this server.

But it then tries to reconnect again and again, showing up 30 times per minute, never stopping those connection attempts, and also without any rate limiting or exponential backoff or alike (in the end this might cause a potential overload of the system running the OpenVPN server, as has been actually observed in the relevant infrastructure where I had to debug this issue).

To Reproduce

I suppose the easiest option is by using a modern OpenVPN client with modern data-ciphers and an older/incompatible OpenVPN server (see above).

Expected behavior

If a connection fails, then the client can and should retry, but not by reconnecting every one to two seconds without any limits.

Version information (please complete the following information):

FTR, when using Tunnelblick v4.0.1 with OpenVPN v2.5.9 the cipher issue wasn't present, only when upgrading to more recent OpenVPN this behavior showed up and causing this misbehavior with the server side then.

Additional context

Corresponding client log (anonymized) in verb 4 mode for such a failing connection:

tunnelblick.log

Happy to provide any further information, do further client side tests/configuration changes/….

(Disclaimer: this are not my own server nor client systems, but I was involved in this incident as a troubleshooter and have access to the according folks behind those systems.)

Thanks for OpenVPN! :)

selvanair commented 6 months ago

We have a restart delay and a back-off mechanism to avoid such rapid reconnection attempts. The start and max values of delay can be customized by arguments to connect-retry, but the default values are reasonable.

However, this mechanism is not kicking in here which I think is due to the following: When a management client (i.e., a UI such as Tunnelblick) is present and has set management-hold, the delaying action is delegated to that UI. The UI will get the delay time in the message >HOLD:Waiting for hold release. Its up to the UI to enforce it to or do something else like exit or release right away -- it seems tunnelblick promptly releases the hold ignoring the suggested delay, causing an immediate reconnection attempt.

In Windows GUI also we do not process this "wait-time" argument, but there we do not keep hold on once the process has started. In such cases OpenVPN core handles the delay.

If a UI wants to keep hold on active, I think it should also process the delay time during hold release. Or, for finer control, implement its own back-off logic depending on the reason for the connection failure.

This is documented in management-notes.

cron2 commented 6 months ago

Oh, that is interesting. @selvanair so you say "openvpn core is doing the right thing, if left alone" but this is an unfortunate interaction with Tunnelblick overriding built-in delays? Copying in @jkbullard...

selvanair commented 6 months ago

but this is an unfortunate interaction with Tunnelblick overriding built-in delays?

Yes, the core will not enforce the delay by itself when management has set "hold on". Instead will pass the suggested delay time to Tunnelblick which can decide to release the hold right away or after a delay. @jkbullard may have some reason to keep the hold on, but overlooked the delay argument passed in the "waiting for hold release" message.

cron2 commented 6 months ago

Is this HOLD:Waiting for hold release:<nnn> message not logged by the core? I can only find the STATE updates

2024-03-20 10:13:47.274979 SIGUSR1[soft,process-push-msg-failed] received, process restarting
2024-03-20 10:13:47.274999 MANAGEMENT: >STATE:1710926027,RECONNECTING,process-push-msg-failed,,,,,
2024-03-20 10:13:47.275817 MANAGEMENT: CMD 'hold release'

need to have a look at the code, it seems :-)

cron2 commented 6 months ago

... ok it seems it is logged, but with M_CLIENT (which seems to only send it to the management interface), while the MANAGEMENT: >State message is logged with D_MANAGEMENT... this could use an extra msg() call...

selvanair commented 6 months ago

Would it be useful to impose the delay in the core to handle UI's that do not play nice? Something like, set an expiry time and wait until then if management hold returns too early. If the UI decides to exit, that can still get processed.

jkbullard commented 6 months ago

Thanks, @selvanair and @cron2. I have developed a fix for this Tunnelblick problem.

@mika and anyone who would like to test it may email developers@tunnelblick.net for a link to download a pre-beta version of Tunnelblick that I believe fixes the problem.

mika commented 6 months ago

Thanks @jkbullard for your fast response and the pre-beta version. :bow: One of the affected users gave it a try already. Sadly it doesn't seem to change the behavior, on the serve side we see 14 failing attempts within ~25 seconds (FTR, anonymized log):

2024:03:22-09:09:16 fw-1 openvpn[25919]: id="2202" severity="info" sys="SecureNet" sub="vpn" event="Connection terminated" username="username" variant="ssl" srcip="203.0.113.44" virtual_ip="10.242.2.16" rx="3824" tx="3939"
2024:03:22-09:09:19 fw-1 openvpn[25919]: id="2202" severity="info" sys="SecureNet" sub="vpn" event="Connection terminated" username="username" variant="ssl" srcip="203.0.113.44" virtual_ip="10.242.2.16" rx="3824" tx="3939"
2024:03:22-09:09:20 fw-1 openvpn[25919]: id="2202" severity="info" sys="SecureNet" sub="vpn" event="Connection terminated" username="username" variant="ssl" srcip="203.0.113.44" virtual_ip="10.242.2.16" rx="3824" tx="3939"
2024:03:22-09:09:21 fw-1 openvpn[25919]: id="2202" severity="info" sys="SecureNet" sub="vpn" event="Connection terminated" username="username" variant="ssl" srcip="203.0.113.44" virtual_ip="10.242.2.16" rx="3824" tx="3939"
2024:03:22-09:09:22 fw-1 openvpn[25919]: id="2202" severity="info" sys="SecureNet" sub="vpn" event="Connection terminated" username="username" variant="ssl" srcip="203.0.113.44" virtual_ip="10.242.2.16" rx="3824" tx="3939"
2024:03:22-09:09:24 fw-1 openvpn[25919]: id="2202" severity="info" sys="SecureNet" sub="vpn" event="Connection terminated" username="username" variant="ssl" srcip="203.0.113.44" virtual_ip="10.242.2.16" rx="3824" tx="3939"
2024:03:22-09:09:26 fw-1 openvpn[25919]: id="2202" severity="info" sys="SecureNet" sub="vpn" event="Connection terminated" username="username" variant="ssl" srcip="203.0.113.44" virtual_ip="10.242.2.16" rx="3824" tx="3939"
2024:03:22-09:09:32 fw-1 openvpn[25919]: id="2202" severity="info" sys="SecureNet" sub="vpn" event="Connection terminated" username="username" variant="ssl" srcip="203.0.113.44" virtual_ip="10.242.2.16" rx="3904" tx="3963"
2024:03:22-09:09:34 fw-1 openvpn[25919]: id="2202" severity="info" sys="SecureNet" sub="vpn" event="Connection terminated" username="username" variant="ssl" srcip="203.0.113.44" virtual_ip="10.242.2.16" rx="3824" tx="3939"
2024:03:22-09:09:36 fw-1 openvpn[25919]: id="2202" severity="info" sys="SecureNet" sub="vpn" event="Connection terminated" username="username" variant="ssl" srcip="203.0.113.44" virtual_ip="10.242.2.16" rx="3824" tx="3939"
2024:03:22-09:09:37 fw-1 openvpn[25919]: id="2202" severity="info" sys="SecureNet" sub="vpn" event="Connection terminated" username="username" variant="ssl" srcip="203.0.113.44" virtual_ip="10.242.2.16" rx="3824" tx="3939"
2024:03:22-09:09:39 fw-1 openvpn[25919]: id="2202" severity="info" sys="SecureNet" sub="vpn" event="Connection terminated" username="username" variant="ssl" srcip="203.0.113.44" virtual_ip="10.242.2.16" rx="3824" tx="3939"
2024:03:22-09:09:41 fw-1 openvpn[25919]: id="2202" severity="info" sys="SecureNet" sub="vpn" event="Connection terminated" username="username" variant="ssl" srcip="203.0.113.44" virtual_ip="10.242.2.16" rx="3824" tx="3939"
2024:03:22-09:09:41 fw-1 openvpn[25919]: id="2202" severity="info" sys="SecureNet" sub="vpn" event="Connection terminated" username="username" variant="ssl" srcip="203.0.113.44" virtual_ip="10.242.2.16" rx="3744" tx="3307"   

We therefore had to manually interrupt the connection to avoid overload of server side.

Corresponding client log (anonymized) in verb 4 mode for this connection:

tunnelblick.log

cron2 commented 6 months ago

Thanks for the good log, again.

The log now shows that Tunnelblick parses the extra argument and wants to do the right thing, but the timestamps show it's not working yet

2024-03-22 09:09:19.052620 MANAGEMENT: >STATE:1711094959,RECONNECTING,process-push-msg-failed,,,,,
2024-03-22 09:09:19.197020 *Tunnelblick: Delaying HOLD release for 1.000 seconds; main thread = YES
2024-03-22 09:09:19.197024 MANAGEMENT: CMD 'hold release'

actually I see new occurances of hold release in the log, in places where they shouldn't happen:

2024-03-22 09:09:19.532913 TLS: tls_multi_process: initial untrusted session promoted to trusted
2024-03-22 09:09:20.198884 MANAGEMENT: CMD 'hold release'
2024-03-22 09:09:20.198926 MANAGEMENT: >STATE:1711094960,GET_CONFIG,,,,,,

this is close to exactly 1 second after the first one - so my guess is that the code that triggers an immediate hold release is still happening, and Tunnelblick sets up a timer that does another(!) hold release after the specified delay... (this is becoming more prominent further down the log with 2s/4s delays, where the 2s/4s delayed hold releases show up 2 connection attempts later).

selvanair commented 6 months ago

Interesting observation about two hold release messages. Looking back at the original log, there also one sees this behaviour, but separated by a very short time.

2024-03-20 10:13:47.274999 MANAGEMENT: >STATE:1710926027,RECONNECTING,process-push-msg-failed,,,,,
2024-03-20 10:13:47.275817 MANAGEMENT: CMD 'hold release'
2024-03-20 10:13:47.275857 NOTE: the current --script-security setting may allow this configuration to call user-defined scripts
2024-03-20 10:13:47.275874 Re-using SSL/TLS context
...
2024-03-20 10:13:47.318408 TCP connection established with [AF_INET]203.0.113.42:443
2024-03-20 10:13:47.318460 TCPv4_CLIENT link local: (not bound)
2024-03-20 10:13:47.318484 TCPv4_CLIENT link remote: [AF_INET]203.0.113.42:443
2024-03-20 10:13:47.318540 MANAGEMENT: >STATE:1710926027,WAIT,,,,,,
2024-03-20 10:13:47.318754 MANAGEMENT: CMD 'hold release'

Looks like tunnelblick sends "hold release" when state changes to RECONNECTING as well as when prompted with "Waiting for hold release". Only one of them should be there.

@mika : you may want to report to tunnelblick support list for faster response from Jon.

jkbullard commented 6 months ago

I'm sorry to be debugging a Tunnelblick problem in an OpenVPN Issue, but it's very helpful to have the comments from the OpenVPN developers!

@cron2 wrote

my guess is that the code that triggers an immediate hold release is still happening, and Tunnelblick sets up a timer that does another(!) hold release after the specified delay

Thanks. That's a reasonable guess, but the problem is actually that Tunnelblick sends a hold release whenever it sees a RECONNECTING status! I apologize – it never occurred to me that Tunnelblick would do such a thing, so I didn't look for it! (Maybe it was needed many years ago?)


After fixing that, however, the delays in >HOLD:Waiting for hold release: are not backing off the way they are described to do on the OpenVPN 2.6 man page:

--connect-retry args Wait n seconds between connection attempts (default 1). Repeated reconnection attempts are slowed down after 5 retries per remote by doubling the wait time after each unsuccessful attempt.

What I see in the log below is that the delays in >HOLD:Waiting for hold release: aren't five 1 second delays and then doubling on every unsuccessful retry attempt, it is eleven 1 second delays, then two 2 second delays, then two 4 second delays, etc., doubling on every second unsuccessful attempt, not every unsuccessful attempt.

Is Tunnelblick still doing something wrong, or is there something else going on?

retry-log.txt

selvanair commented 6 months ago

I think tunnelblick is now doing the right thing.

As for this test, the unusual double attempts with same delay time may be related to the way this test is being done though I am not entirely sure how.

I see that the connection succeeds at first attempt and gets logged as

2024-03-22 12:25:19.543180 Initialization Sequence Completed
...
2024-03-22 12:25:52.770371 [Server] Inactivity timeout (--ping-restart), restarting
2024-03-22 12:25:52.771527 SIGUSR1[soft,ping-restart] received, process restarting

all with the instance name as "Server". Subsequent ping restarts will always have 1 second delay for ever if the reconnection succeeds each time but ping fails -- no backoff is expected.

But in this case, subsequent ping-restarts fail without reconnecting.

2024-03-22 12:26:23.051449 [UNDEF] Inactivity timeout (--ping-restart), restarting

Instance name "UNDEF".

How is this test done -- Is the server made unreachable by firewall or server is shutdown or something else?

jkbullard commented 6 months ago

@selvanair - Thanks for all your help!

How is this test done -- Is the server made unreachable by firewall or server is shutdown or something else?

After connecting, I disable the "network service" in the (virtual machine) Mac's system settings. It's the same as or similar to disconnecting the Ethernet plug.

selvanair commented 6 months ago

After connecting, I disable the "network service" in the (virtual machine) Mac's system settings. It's the same as or similar to disconnecting the Ethernet plug.

I could not reproduce this on Linux. I see delays of 2, 4, 8, 16 seconds... starting the 6th attempt, as expected.

mika commented 6 months ago

The latest OpenVPN beta build that @jkbullard provided to me was reported to be behaving as expected by the affected user, it no longer reconnects with same speed but slows down as expected. :partying_face:

If you want logfiles let me know so I'll ask for them.

cron2 commented 6 months ago

The latest OpenVPN beta build that @jkbullard provided to me was reported to be behaving as expected by the affected user, it no longer reconnects with same speed but slows down as expected. 🥳

Great news. Thanks, @jkbullard

If you want logfiles let me know so I'll ask for them.

I would find it interesting to see what the team of core+tunnelblick does in this situation now :-)

jkbullard commented 6 months ago

Thanks to all: @mika and his user, @selvanair, @cron2 -- I couldn't have fixed it without you!

I don't need log files, but thanks for offering.

As far as I'm concerned this Issue can be closed, because it's fixed but also because it wasn't an OpenVPN problem!

mika commented 6 months ago

Re @cron2:

I would find it interesting to see what the team of core+tunnelblick does in this situation now :-)

Sure, there you go:

tunnelblick.log

Re @jkbullard:

Thanks to all: @mika and his user, @selvanair, @cron2 -- I couldn't have fixed it without you!

You are more than welcome! Many thanks to you (@jkbullard) as well as @cron2 and @selvanair also from my side for the productive, welcoming and responsive issue handling and bug fix, this is highly appreciated! Many thanks!

Feel free to close this bug report also from my side, leaving it to the project maintainer though. :)

selvanair commented 6 months ago

Thanks for the logs: looks perfect with the expected 1, 1, 1, 1, 2, 4, 8,... seconds of delays between retries. Only four 1 sec delays will be seen as the count includes the first attempt which was also unsuccessful.

cron2 commented 6 months ago

So much positive vibe, I like that ;-)