gardener / vpn2

Network connector between the control plane (deployed in a Seed cluster) and a Shoot cluster superseding the vpn repository.
Apache License 2.0
5 stars 21 forks source link

Security improvements #53

Closed dimityrmirchev closed 9 months ago

dimityrmirchev commented 1 year ago

What this PR does / why we need it: This PR introduces some security improvements to the openvpn configuration.

Logs from vpn-shoot-client before the changes (log verbosity increased by applying the configuration verb 4).

``` 2023-10-24 12:16:20 us=451411 Control Channel: TLSv1.3, cipher TLSv1.3 TLS_AES_256_GCM_SHA384, peer certificate: 3072 bit RSA, signature: RSA-SHA256 2023-10-24 12:16:20 us=451442 [vpn-seed-server] Peer Connection Initiated with [AF_INET]10.2.221.73:8132 2023-10-24 12:16:20 us=451486 TLS: move_session: dest=TM_ACTIVE src=TM_INITIAL reinit_src=1 2023-10-24 12:16:20 us=451558 TLS: tls_multi_process: initial untrusted session promoted to trusted 2023-10-24 12:16:20 us=492880 PUSH: Received control message: 'PUSH_REPLY,route-gateway 192.168.123.1,topology subnet,ping 10,ping-restart 60,ifconfig 192.168.123.10 255.255.255.0,peer-id 0,cipher AES-256-CBC,protocol-flags cc-exit tls-ekm dyn-tls-crypt,tun-mtu 1500' 2023-10-24 12:16:20 us=492932 OPTIONS IMPORT: --ifconfig/up options modified 2023-10-24 12:16:20 us=492947 OPTIONS IMPORT: route-related options modified 2023-10-24 12:16:20 us=492960 OPTIONS IMPORT: tun-mtu set to 1500 2023-10-24 12:16:20 us=493144 TUN/TAP device tun0 opened 2023-10-24 12:16:20 us=493177 TUN/TAP TX queue length set to 1000 2023-10-24 12:16:20 us=493195 do_ifconfig, ipv4=1, ipv6=0 2023-10-24 12:16:20 us=493218 /sbin/ip link set dev tun0 up mtu 1500 2023-10-24 12:16:20 us=494448 /sbin/ip link set dev tun0 up 2023-10-24 12:16:20 us=495397 /sbin/ip addr add dev tun0 192.168.123.10/24 2023-10-24 12:16:20 us=496108 Data Channel MTU parms [ mss_fix:1363 max_frag:0 tun_mtu:1500 tun_max_mtu:1600 headroom:136 payload:1768 tailroom:562 ET:0 ] 2023-10-24 12:16:20 us=496165 Outgoing dynamic tls-crypt: Cipher 'AES-256-CTR' initialized with 256 bit key 2023-10-24 12:16:20 us=496192 Outgoing dynamic tls-crypt: Using 256 bit message hash 'SHA256' for HMAC authentication 2023-10-24 12:16:20 us=496210 Incoming dynamic tls-crypt: Cipher 'AES-256-CTR' initialized with 256 bit key 2023-10-24 12:16:20 us=496225 Incoming dynamic tls-crypt: Using 256 bit message hash 'SHA256' for HMAC authentication 2023-10-24 12:16:20 us=496252 Outgoing Data Channel: Cipher 'AES-256-CBC' initialized with 256 bit key 2023-10-24 12:16:20 us=496268 Outgoing Data Channel: Using 160 bit message hash 'SHA1' for HMAC authentication 2023-10-24 12:16:20 us=496281 Incoming Data Channel: Cipher 'AES-256-CBC' initialized with 256 bit key 2023-10-24 12:16:20 us=496298 Incoming Data Channel: Using 160 bit message hash 'SHA1' for HMAC authentication 2023-10-24 12:16:20 us=496315 Initialization Sequence Completed 2023-10-24 12:16:20 us=496329 Data Channel: cipher 'AES-256-CBC', auth 'SHA1', peer-id: 0 2023-10-24 12:16:20 us=496340 Timers: ping 10, ping-restart 60 2023-10-24 12:16:20 us=496352 Protocol options: protocol-flags cc-exit tls-ekm dyn-tls-crypt ```

Logs from vpn-shoot-client after the changes (log verbosity increased by applying the configuration verb 4).

``` 2023-10-24 11:53:28 us=454809 Control Channel: TLSv1.3, cipher TLSv1.3 TLS_AES_256_GCM_SHA384, peer certificate: 3072 bit RSA, signature: RSA-SHA256 2023-10-24 11:53:28 us=454836 [vpn-seed-server] Peer Connection Initiated with [AF_INET]10.2.221.73:8132 2023-10-24 11:53:28 us=454873 TLS: move_session: dest=TM_ACTIVE src=TM_INITIAL reinit_src=1 2023-10-24 11:53:28 us=454952 TLS: tls_multi_process: initial untrusted session promoted to trusted 2023-10-24 11:53:28 us=497657 PUSH: Received control message: 'PUSH_REPLY,route-gateway 192.168.123.1,topology subnet,ping 10,ping-restart 60,ifconfig 192.168.123.10 255.255.255.0,peer-id 0,cipher AES-256-GCM,protocol-flags cc-exit tls-ekm dyn-tls-crypt,tun-mtu 1500' 2023-10-24 11:53:28 us=497702 OPTIONS IMPORT: --ifconfig/up options modified 2023-10-24 11:53:28 us=497715 OPTIONS IMPORT: route-related options modified 2023-10-24 11:53:28 us=497739 OPTIONS IMPORT: tun-mtu set to 1500 2023-10-24 11:53:28 us=497925 TUN/TAP device tun0 opened 2023-10-24 11:53:28 us=497959 TUN/TAP TX queue length set to 1000 2023-10-24 11:53:28 us=497974 do_ifconfig, ipv4=1, ipv6=0 2023-10-24 11:53:28 us=497992 /sbin/ip link set dev tun0 up mtu 1500 2023-10-24 11:53:28 us=499033 /sbin/ip link set dev tun0 up 2023-10-24 11:53:28 us=499796 /sbin/ip addr add dev tun0 192.168.123.10/24 2023-10-24 11:53:28 us=500671 Data Channel MTU parms [ mss_fix:1386 max_frag:0 tun_mtu:1500 tun_max_mtu:1600 headroom:136 payload:1768 tailroom:562 ET:0 ] 2023-10-24 11:53:28 us=500724 Outgoing dynamic tls-crypt: Cipher 'AES-256-CTR' initialized with 256 bit key 2023-10-24 11:53:28 us=500742 Outgoing dynamic tls-crypt: Using 256 bit message hash 'SHA256' for HMAC authentication 2023-10-24 11:53:28 us=500758 Incoming dynamic tls-crypt: Cipher 'AES-256-CTR' initialized with 256 bit key 2023-10-24 11:53:28 us=500779 Incoming dynamic tls-crypt: Using 256 bit message hash 'SHA256' for HMAC authentication 2023-10-24 11:53:28 us=500802 Outgoing Data Channel: Cipher 'AES-256-GCM' initialized with 256 bit key 2023-10-24 11:53:28 us=500817 Incoming Data Channel: Cipher 'AES-256-GCM' initialized with 256 bit key 2023-10-24 11:53:28 us=500829 Initialization Sequence Completed 2023-10-24 11:53:28 us=500839 Data Channel: cipher 'AES-256-GCM', peer-id ```

Mind that if we enforce TLS 1.2 by setting tls-version-max 1.2 in the server config we can see that ECDH key exchange is enforced.

2023-10-24 11:55:34 us=337638 Control Channel: TLSv1.2, cipher TLSv1.2 ECDHE-RSA-AES256-GCM-SHA384, peer certificate: 3072 bit RSA, signature: RSA-SHA256

Which issue(s) this PR fixes: Fixes #

Special notes for your reviewer:

Release note:

Security improvements to the `openvpn` configuration. Due to backwards incompatible change between the vpn server and client a short downtime is to be expected during the initial upgrade.
axel7born commented 1 year ago

@dimityrmirchev Thanks a lot for that change. The changes look good to me, though I'm not really a VPN expert. I've just one question to the changes.

dimityrmirchev commented 1 year ago

Why would we want to support AES-128 when we always used AES-256 in the past? This part does not look like a security improvement. Please explain.

Sure. As of now the default values in openvpn for the data-ciphers are as follow:

As stated in the docs the server will negotiate the first cipher that is also supported by the client. In our case this is AES-256-GCM. My guess is that hardware that supports AES-128-GCM should support AES-256-GCM as well, however I did not find any concrete examples of that. This is why I explicitly set what openvpn uses as default. As per the debate 256 vs 128 here is a nice explanation why the go team chooses 128 before 258 https://go.dev/blog/tls-cipher-suites (search for "AES-128 is preferred over AES-256 for encryption."). But maybe I misunderstood that statement and that is only valid in the context of TLS 🤔 I am perfectly fine with removing the AES-128-GCM option since it seems that this is the consensus.

If we make that change we need to allow both, aes-256-gcm and aes-256-cbc on the server (1)

I agree.

with the next release change the clients to -gcm (2), and then with the next release remove aes-256-cbc from the server(3).

Can't we directly change the clients to -gcm? The server will rollout before the clients are updated.

dimityrmirchev commented 1 year ago

@ScheererJ @marwinski PTAL, concerns should be addressed in the latest commits and comments.

dimityrmirchev commented 1 year ago

I tried an upgrade with the latest changes and it seems that the auth SHA256 config will cause the client failing to connect until its configuration is also updated. I will try to think of a way to apply this without breaking the connection.

/hold

dimityrmirchev commented 1 year ago

I tried an upgrade with the latest changes and it seems that the auth SHA256 config will cause the client failing to connect until its configuration is also updated. I will try to think of a way to apply this without breaking the connection.

I did some more testing and it seems that a graceful update is not possible in the current setup if we keep the auth SHA256 config. As a general question I would ask if such non compatible server/client changes were applied before and is there any strategy for such changes? I would expect that sooner or later such changes will happen even if we remove the auth SHA256 lines from the current PR.

axel7born commented 12 months ago

I fear there is no strategy or simple solution for such non compatible changes. I just stumbled over https://github.com/gardener/gardener/issues/7471 @hendrikKahl What is your opinion? What will be the impact of a VPN outage during update?

hendrikKahl commented 12 months ago

🤔 the only situation, where the things "really" got stuck was during the update of a shoot to an HA vpn. There, the deployment is deleted and replaced by a statefulset. Now, when the cluster has CRD conversion webhook, that is invoked by KCM garbage collection, it breaks the cluster in a specific way:

The only solution there is to manually purge the finalizers.

Now, I'm not sure if the same issue would occur, if a cluster is already on a HA VPN.

The best solution I can think of, would be to ensure that GRM does not wait for the deletion. Either the deletion mode becomes configurable and for the VPN it is changed to background or the rollout behavior is adapted so it does not wait.

dimityrmirchev commented 10 months ago

Can you please reevaluate this PR? The mentioned concerns were addressed/explained. In this form this PR introduces a backwards incompatible change to the server https://github.com/gardener/vpn2/pull/53/files#diff-1157b3771de832487f583c709a87a57eb0755d3102a8cf93c7422a6daae7ac87R131 which will cause a short downtime during upgrade.

axel7born commented 9 months ago

I tested the rollout to the new version with the HA setup and the non HA setup together with a failing conversion webhook as described in https://github.com/gardener/gardener/issues/7471. Even if the KCM garbage collection is stuck, the VPN rollout can be completed without manual intervention.

@dimityrmirchev, The VPN downtime should be mentioned in the release notes.

dimityrmirchev commented 9 months ago

@axel7born Thanks for reviewing. I mentioned the short downtime that is expected during the upgrade in the release note.

axel7born commented 9 months ago

/lgtm

axel7born commented 9 months ago

/unhold