OpenVPN / openvpn

OpenVPN is an open source VPN daemon
http://openvpn.net
Other
10.73k stars 2.98k forks source link

Suspected (very likely) bug in handling of socks-proxy setting when using UDP, results in connection loss and improper connection re-initialization (long connectivity loss periods) #379

Open MixyMisus opened 1 year ago

MixyMisus commented 1 year ago

Describe the bug

I am running OpenVPN through a SOCKS proxy (UDP mode). The connection is established normally and runs okay for minutes to hours. Periodically the connection tears down with timeout (I am fully confident it is not a network issue) and then OpenVPN attempts restart at which point it hangs for minutes. After running it with high verbosity logs I found out that while the "hang" part consists of two stages

stage 1: Logs do not show anomalies however the software communicating via OpenVPN's tun0 loses connectivity (timeout)

stage 2

For some reason, OpenVPN attempts to reconnect via TCP to the remote OpenVPN server (NOT THE SOCKS PROXY WHICH IS RUNNING ON LOCALHOST)

At this point it can spend minutes (unsuccessfully) trying to reach the remote server directly, finally concludes that network is unreachable and does full restart, which successfully restores connectivity

Here are the captured logs (exclamation marks added by me, interesting part also reproduced in github issue text below)


2023-08-02 20:50:17 us=37872 [se-mma-ovpn-018.mullvad.net] Inactivity timeout (--ping-restart), restarting
2023-08-02 20:50:17 us=50012 TCP/UDP: Closing socket
2023-08-02 20:50:17 us=50413 net_route_v4_del: 141.98.255.88/32 via 10.137.5.1 dev [NULL] table 0 metric -1
2023-08-02 20:50:17 us=50774 net_route_v4_del: 0.0.0.0/1 via 10.16.0.1 dev [NULL] table 0 metric -1
2023-08-02 20:50:17 us=50816 net_route_v4_del: 128.0.0.0/1 via 10.16.0.1 dev [NULL] table 0 metric -1
2023-08-02 20:50:17 us=50917 delete_route_ipv6(::/2)
2023-08-02 20:50:17 us=50929 net_route_v6_del: ::/2 via :: dev tun0 table 0 metric -1
2023-08-02 20:50:17 us=51010 delete_route_ipv6(4000::/2)
2023-08-02 20:50:17 us=51020 net_route_v6_del: 4000::/2 via :: dev tun0 table 0 metric -1
2023-08-02 20:50:17 us=51056 delete_route_ipv6(8000::/2)
2023-08-02 20:50:17 us=51067 net_route_v6_del: 8000::/2 via :: dev tun0 table 0 metric -1
2023-08-02 20:50:17 us=51103 delete_route_ipv6(c000::/2)
2023-08-02 20:50:17 us=51111 net_route_v6_del: c000::/2 via :: dev tun0 table 0 metric -1
2023-08-02 20:50:17 us=51149 Closing TUN/TAP interface
2023-08-02 20:50:17 us=51166 net_addr_v4_del: 10.16.0.3 dev tun0
2023-08-02 20:50:17 us=51544 net_addr_v6_del: fdda:d0d0:cafe:1302::1001/64 dev tun0
2023-08-02 20:50:17 us=57855 SIGUSR1[soft,ping-restart] received, process restarting
2023-08-02 20:50:17 us=59705 Restart pause, 1 second(s)
2023-08-02 20:50:18 us=62325 Re-using SSL/TLS context
2023-08-02 20:50:18 us=65480 Control Channel MTU parms [ mss_fix:0 max_frag:0 tun_mtu:1250 tun_max_mtu:0 headroom:126 payload:1600 tailroom:126 ET:0 ]
2023-08-02 20:50:18 us=65513 Data Channel MTU parms [ mss_fix:0 max_frag:0 tun_mtu:1450 tun_max_mtu:1600 headroom:136 payload:1800 tailroom:568 ET:32 ]
2023-08-02 20:50:18 us=65814 TCP/UDP: Preserving recently used remote address: [AF_INET]141.98.255.88:1302
2023-08-02 20:50:18 us=65856 Socket Buffers: R=[212992->425984] S=[212992->425984]
!!!!!!!!!!!!!!!
2023-08-02 20:55:11 us=65866 Attempting to establish TCP connection with [AF_INET]141.98.255.88:1302
!!!!!!!!!!!!
2023-08-02 20:55:11 us=66077 TCP: connect to [AF_INET]141.98.255.88:1302 failed: Network is unreachable
2023-08-02 20:55:11 us=66138 TCP/UDP: Closing socket
2023-08-02 20:55:11 us=66173 SIGUSR1[connection failed(soft),connection-failed] received, process restarting
2023-08-02 20:55:11 us=66259 Restart pause, 1 second(s)
2023-08-02 20:55:12 us=66354 Re-using SSL/TLS context
2023-08-02 20:55:12 us=66448 Control Channel MTU parms [ mss_fix:0 max_frag:0 tun_mtu:1250 tun_max_mtu:0 headroom:126 payload:1600 tailroom:126 ET:0 ]
2023-08-02 20:55:12 us=66466 Data Channel MTU parms [ mss_fix:0 max_frag:0 tun_mtu:1450 tun_max_mtu:1600 headroom:136 payload:1800 tailroom:568 ET:32 ]
2023-08-02 20:55:12 us=66499 TCP/UDP: Preserving recently used remote address: [AF_INET]127.0.0.1:2069
2023-08-02 20:55:12 us=66541 Socket Buffers: R=[212992->425984] S=[212992->425984]
2023-08-02 20:55:12 us=66550 Attempting to establish TCP connection with [AF_INET]127.0.0.1:2069
2023-08-02 20:55:12 us=66618 TCP connection established with [AF_INET]127.0.0.1:2069
2023-08-02 20:55:12 us=67384 SOCKS proxy wants us to send UDP to [AF_INET]127.0.0.1:2069
2023-08-02 20:55:12 us=67401 NOTE: setsockopt TCP_NODELAY=1 failed
2023-08-02 20:55:12 us=67414 UDPv4 link local: (not bound)
2023-08-02 20:55:12 us=67423 UDPv4 link remote: [AF_INET]185.213.154.138:1302
2023-08-02 20:55:13 us=58053 TLS: Initial packet from [AF_INET]185.213.154.138:1302, sid=56293a44 16504030
2023-08-02 20:55:13 us=105546 VERIFY OK: depth=2, C=SE, ST=Gotaland, L=Gothenburg, O=Amagicom AB, OU=Mullvad, CN=Mullvad Root CA v2, emailAddress=security@mullvad.net

To Reproduce Start OpenVPN in UDP mode with SOCKS5 enabled (socks-proxy in config, so socks proxy must support UDP. Ideally it should be on localhost since in my setup it is)

Run it for a while, with moderate load.

P.S. Upon further experiments I discovered that the "strange attempt to bypass socks proxy" also happens upon "normal" connectivity loss events (turning off the wifi router to which the machine with VPN client is connected, waiting for timeout then turning on)

Expected behavior Everything to work without weird connectivity losses when there is no good reason for connectivity loss.

If a connectivity loss happens for normal reasons (such as physical connection getting temporarily disconnected) do not perform useless attempt to connect to remote server bypassing the socks proxy and re-connect to socks proxy (which it eventually does but only after very long wait), when socks proxy option is set the re-connect attempt should be performed to the socks proxy (and not to the last remembered remote OpenVPN server as it appears to attempt in log above)

Version information (please complete the following information):

Tested on two following OpenVPN versions (reproduces):

OpenVPN 2.4.7 x86_64-pc-linux-gnu [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [PKCS11] [MH/PKTINFO] [AEAD] built on Apr 28 2021 library versions: OpenSSL 1.1.1n 15 Mar 2022, LZO 2.10

OpenVPN 2.6.5 x86_64-pc-linux-gnu [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [PKCS11] [MH/PKTINFO] [AEAD] [DCO] library versions: OpenSSL 3.1.1 30 May 2023, LZO 2.10

Additional context Socks capability is why I'm using openvpn in the first place so any suggestion on how to work around this would be extremely welcome

cron2 commented 1 year ago

Your description definitely sounds like a bug (and a long-standing one at that).

I tried to reproduce this with current master, connecting to a (non-local) SOCKs proxy, and then onward to a TCP server. Triggering the two different kind of reconnects with SIGUSR1 or SIGHUP did not do anything unusual (so it's not totally easily reproduced). Will try timeouting and network errors next.

cron2 commented 1 year ago

"Killing the SOCKS proxy" is also not sufficient, this will produce a correct reconnect towards the SOCKS proxy (and as soon as it's back, proper re-establishment). So, network errors between proxy and server, and timeout next...

cron2 commented 1 year ago

Mmmh, this is going to be a tougher nut. Tried --keepalive 10 60 on the client, without --ping on the server, which triggers a timeout restart after 60 seconds without user traffic

2023-08-03 12:26:49 SIGUSR1[soft,ping-restart] received, process restarting
2023-08-03 12:26:49 Restart pause, 1 second(s)
2023-08-03 12:26:50 TCP/UDP: Preserving recently used remote address: [AF_INET]194.97.144.213:1080
2023-08-03 12:26:50 Socket Buffers: R=[65536->65536] S=[32768->32768]
2023-08-03 12:26:50 Attempting to establish TCP connection with [AF_INET]194.97.144.213:1080
2023-08-03 12:26:50 TCP connection established with [AF_INET]194.97.144.213:1080
2023-08-03 12:26:50 TCPv4_CLIENT link local: (not bound)
2023-08-03 12:26:50 TCPv4_CLIENT link remote: [AF_INET]194.97.144.213:1080
2023-08-03 12:26:50 TLS: Initial packet from [AF_INET]194.97.144.213:1080, sid=1af58c31 a8b415b8

... which nicely reconnects to the SOCKs proxy...

@MixyMisus I wonder if we're doing something special to "localhost" here. Can you reproduce if the SOCKS proxy is accessed via the LAN IP (so, still running on the same machine, but no "127.0.0.1" in internal data structures)? Wild guess, though...

cron2 commented 1 year ago

Okay, ignore the tests above, this was all about TCP (I misread and tested TCP server, not UDP server) - so I retried with UDP, and I can reproduce this.

It does NOT happen on SIGUSR1 or SIGHUP restarts, but it DOES happen on ping-restart timeouts:

SIGHUP:

2023-08-03 12:32:34 TCP/UDP: Preserving recently used remote address: [AF_INET]194.97.144.213:1080
2023-08-03 12:32:34 Socket Buffers: R=[42080->42080] S=[9216->9216]
2023-08-03 12:32:34 Attempting to establish TCP connection with [AF_INET]194.97.144.213:1080
2023-08-03 12:32:34 TCP connection established with [AF_INET]194.97.144.213:1080
2023-08-03 12:32:34 SOCKS proxy wants us to send UDP to [AF_INET]194.97.144.213:26792
2023-08-03 12:32:34 UDPv4 link local: (not bound)
2023-08-03 12:32:34 UDPv4 link remote: [AF_INET]199.102.77.82:51194
2023-08-03 12:32:34 TLS: Initial packet from [AF_INET]199.102.77.82:51194, sid=a954d6f2 78b6de9a

ping-restart misconnects once, then does a full SIGUSR1 "socks-error" restart, and succeeds

2023-08-03 12:37:31 SIGUSR1[soft,ping-restart] received, process restarting
2023-08-03 12:37:31 Restart pause, 1 second(s)
2023-08-03 12:37:32 TCP/UDP: Preserving recently used remote address: [AF_INET]199.102.77.82:51194
2023-08-03 12:37:32 Socket Buffers: R=[42080->42080] S=[9216->9216]
2023-08-03 12:37:32 Attempting to establish TCP connection with [AF_INET]199.102.77.82:51194
2023-08-03 12:37:32 TCP connection established with [AF_INET]199.102.77.82:51194
2023-08-03 12:37:37 socks_handshake: TCP port read timeout expired: Operation now in progress (errno=36)
2023-08-03 12:37:37 SIGUSR1[soft,socks-error] received, process restarting
2023-08-03 12:37:37 Restart pause, 1 second(s)

2023-08-03 12:37:38 TCP/UDP: Preserving recently used remote address: [AF_INET]194.97.144.213:1080
2023-08-03 12:37:38 Socket Buffers: R=[42080->42080] S=[9216->9216]
2023-08-03 12:37:38 Attempting to establish TCP connection with [AF_INET]194.97.144.213:1080
2023-08-03 12:37:38 TCP connection established with [AF_INET]194.97.144.213:1080
2023-08-03 12:37:38 SOCKS proxy wants us to send UDP to [AF_INET]194.97.144.213:45751
2023-08-03 12:37:38 UDPv4 link local: (not bound)
2023-08-03 12:37:38 UDPv4 link remote: [AF_INET]199.102.77.82:51194
2023-08-03 12:37:39 TLS: Initial packet from [AF_INET]199.102.77.82:51194, sid=a1e70ebb a057c8ad

(not sure why it failed so fast, and takes longer for @MixyMisus to give up and retry)

Summoning @schwabe - do you have an idea? I guess we're caching something somewhere that we shouldn't (easy guess), but why is this reset on normal SIGUSR1 restarts, and not on ping-restart SIGUSR1?

MixyMisus commented 1 year ago

@cron2 Ha! Caught a real bug! Lucky me!

Upon some thought, I think that I've encountered at least two more odd quirks (possible bugs) in OpenVPN(UDP)+SOCKS5 behavior, and one is probably related here (the other I'll submit separately if it "keeps happening")

So... The relevant possible additional bug here is odd timeout disconnects with SOCKS proxy (have reproduced it with localhost SOCKS5 and with a remote SOCKS on a VPS) and remote VPN servers (which is relevant by virtue of triggering the above bug)

It is most reliably reproducible when socks proxy is part of proxy chain or a Shadowsocks-rust setup (but I've reproduced it with a vanilla server/client hev-socks5-server )

The setup goes for shadowsocks

OpenVPN(UDP)=SOCKS5(sslocal process)=[remote sserver instance]=[remote OpenVPN*]

For HEV I reproduced like in the following manner: OpenVPN(UDP)=SOCKS5(remote VPS running hev-socks5-server)==[remote OpenVPN*]

*Remote VPN for reproducing can be just any VPN service such a Mullvad VPN service or a free tier Proton VPN server

After some time the connection will timeout, triggering a reset and thus by extension the originally reported bug.

MEANWHILE an OpenVPN instance that connects to the remote server without SOCKS, a-la OpenVPN(UDP)=[remote OpenVPN*] runs just fine for hours and hours and hours (both test boxes share same wifi, so it's unlikely to be a network problem)

LindaFerum commented 7 months ago

Hi! Just found this thing via googling around and reporting that 1) it reproduces on all my OpenVPN installs and 2) it is somewhat disruptive to my stuff. I understand there are higher priority concerns and this is open source product so like, no hard feelings on my part, just leaving my two cents here hoping this bug will be picked up and fixed one day

Thanks :)