cilium / hubble

Hubble - Network, Service & Security Observability for Kubernetes using eBPF
Apache License 2.0
3.56k stars 253 forks source link

Hubble relay fails to connect to Cilium Agent on Clustermesh with high MTU #1610

Open baurmatt opened 4 weeks ago

baurmatt commented 4 weeks ago

We're deploying Cilium, configured as clustermesh, in two different OpenStack regions. Both are connected via a VPNaaS IPSec tunnel to provide node IP connectivity. MTU of the OpenStack network and the VPN tunnels are set to 8942.

Hubble Relay successfully connects to the local (192.168.5.x) Cilium agents, but fails to connect to agents on the other side (192.168.4.x) with the following error: transport: authentication handshake failed: context deadline exceeded:

time="2024-10-28T12:20:23Z" level=info msg="Starting gRPC health server..." addr=":4222" subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=info msg="Starting metrics server..." address=":9966" subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=info msg="Starting gRPC server..." options="{peerTarget:hubble-peer.kube-system.svc.cluster.local:443 dialTimeout:30000000000 retryTimeout:30000000000 listenAddress::4245 healthListenAddress::4222 metricsListenAddress::9966 log:0xc0004402a0 serverTLSConfig:<nil> insecureServer:true clientTLSConfig:0xc0006ce7c8 clusterName:mcloud-stage4-ham1 insecureClient:false observerOptions:[0x22173e0 0x22174c0] grpcMetrics:0xc0006d8a90 grpcUnaryInterceptors:[0x22579e0] grpcStreamInterceptors:[0x2257c40]}" subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=debug msg="Client mtls handshake" config=tls-to-hubble keypair-sn=bb575e2375c1b5cb3e348bf24b2d305 subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=info msg="Received peer change notification" change notification="name:\"mcloud-stage4-ham1/worker-699cd449c9-wkccb\" address:\"192.168.5.211:4244\" type:PEER_ADDED tls:{server_name:\"worker-699cd449c9-wkccb.mcloud-stage4-ham1.hubble-grpc.cilium.io\"}" subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=info msg="Received peer change notification" change notification="name:\"mcloud-stage4-ham1/worker-699cd449c9-d6d9w\" address:\"192.168.5.152:4244\" type:PEER_ADDED tls:{server_name:\"worker-699cd449c9-d6d9w.mcloud-stage4-ham1.hubble-grpc.cilium.io\"}" subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=info msg="Received peer change notification" change notification="name:\"mcloud-stage3-dus2/worker-75fd65fb96-f2fw6\" address:\"192.168.4.178:4244\" type:PEER_ADDED tls:{server_name:\"worker-75fd65fb96-f2fw6.mcloud-stage3-dus2.hubble-grpc.cilium.io\"}" subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=info msg="Received peer change notification" change notification="name:\"mcloud-stage3-dus2/worker-75fd65fb96-k77sp\" address:\"192.168.4.40:4244\" type:PEER_ADDED tls:{server_name:\"worker-75fd65fb96-k77sp.mcloud-stage3-dus2.hubble-grpc.cilium.io\"}" subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=info msg="Received peer change notification" change notification="name:\"mcloud-stage3-dus2/worker-75fd65fb96-sfjhv\" address:\"192.168.4.167:4244\" type:PEER_ADDED tls:{server_name:\"worker-75fd65fb96-sfjhv.mcloud-stage3-dus2.hubble-grpc.cilium.io\"}" subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=info msg="Received peer change notification" change notification="name:\"mcloud-stage4-ham1/worker-699cd449c9-7qs79\" address:\"192.168.5.89:4244\" type:PEER_ADDED tls:{server_name:\"worker-699cd449c9-7qs79.mcloud-stage4-ham1.hubble-grpc.cilium.io\"}" subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=info msg=Connecting address="192.168.5.89:4244" hubble-tls=true peer=mcloud-stage4-ham1/worker-699cd449c9-7qs79 subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=info msg=Connecting address="192.168.4.167:4244" hubble-tls=true peer=mcloud-stage3-dus2/worker-75fd65fb96-sfjhv subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=info msg=Connecting address="192.168.4.178:4244" hubble-tls=true peer=mcloud-stage3-dus2/worker-75fd65fb96-f2fw6 subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=info msg=Connecting address="192.168.5.211:4244" hubble-tls=true peer=mcloud-stage4-ham1/worker-699cd449c9-wkccb subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=info msg=Connecting address="192.168.5.152:4244" hubble-tls=true peer=mcloud-stage4-ham1/worker-699cd449c9-d6d9w subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=info msg=Connecting address="192.168.4.40:4244" hubble-tls=true peer=mcloud-stage3-dus2/worker-75fd65fb96-k77sp subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=debug msg="Client mtls handshake" config=tls-to-hubble keypair-sn=bb575e2375c1b5cb3e348bf24b2d305 subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=debug msg="Client mtls handshake" config=tls-to-hubble keypair-sn=bb575e2375c1b5cb3e348bf24b2d305 subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=info msg=Connected address="192.168.5.89:4244" hubble-tls=true peer=mcloud-stage4-ham1/worker-699cd449c9-7qs79 subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=debug msg="Client mtls handshake" config=tls-to-hubble keypair-sn=bb575e2375c1b5cb3e348bf24b2d305 subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=info msg=Connected address="192.168.5.211:4244" hubble-tls=true peer=mcloud-stage4-ham1/worker-699cd449c9-wkccb subsys=hubble-relay
time="2024-10-28T12:20:23Z" level=info msg=Connected address="192.168.5.152:4244" hubble-tls=true peer=mcloud-stage4-ham1/worker-699cd449c9-d6d9w subsys=hubble-relay
time="2024-10-28T12:20:53Z" level=warning msg="Failed to create gRPC client" address="192.168.4.178:4244" error="context deadline exceeded: connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\"" hubble-tls=true next-try-in=1s peer=mcloud-stage3-dus2/worker-75fd65fb96-f2fw6 subsys=hubble-relay
time="2024-10-28T12:20:53Z" level=warning msg="Failed to create gRPC client" address="192.168.4.167:4244" error="context deadline exceeded: connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\"" hubble-tls=true next-try-in=1s peer=mcloud-stage3-dus2/worker-75fd65fb96-sfjhv subsys=hubble-relay
time="2024-10-28T12:20:53Z" level=warning msg="Failed to create gRPC client" address="192.168.4.40:4244" error="context deadline exceeded: connection error: desc = \"transport: authentication handshake failed: context deadline exceeded\"" hubble-tls=true next-try-in=1s peer=mcloud-stage3-dus2/worker-75fd65fb96-k77sp subsys=hubble-relay

While debugging, we realized that the TLS handshake already fails after the client hello:

/ # openssl s_client -showcerts -servername worker-699cd449c9-7qs79.mcloud-stage4-ham1.hubble-grpc.cilium.io -connect 192.168.5.89:4244 -CAfile /var/lib/hubble-relay/tls/hubble-server-ca.crt -debug -state
CONNECTED(00000003)
SSL_connect:before SSL initialization
write to 0x7fe304247c70 [0x7fe304231610] (366 bytes => 366 (0x16E))
0000 - 16 03 01 01 69 01 00 01-65 03 03 d4 89 75 1e fa   ....i...e....u..
0010 - 3c 08 af b0 d9 16 2a 07-08 b7 38 da ad e1 e0 3e   <.....*...8....>
0020 - 85 47 0d 21 45 0e 69 b5-8f ba d1 20 58 88 2e b5   .G.!E.i.... X...
0030 - 70 01 9f 7f 5a ce 8a 76-6d 74 7a 72 32 09 1b f8   p...Z..vmtzr2...
0040 - 5d c0 49 ad 6c ef d2 84-3a 71 5f 29 00 3e 13 02   ].I.l...:q_).>..
0050 - 13 03 13 01 c0 2c c0 30-00 9f cc a9 cc a8 cc aa   .....,.0........
0060 - c0 2b c0 2f 00 9e c0 24-c0 28 00 6b c0 23 c0 27   .+./...$.(.k.#.'
0070 - 00 67 c0 0a c0 14 00 39-c0 09 c0 13 00 33 00 9d   .g.....9.....3..
0080 - 00 9c 00 3d 00 3c 00 35-00 2f 00 ff 01 00 00 de   ...=.<.5./......
0090 - 00 00 00 45 00 43 00 00-40 77 6f 72 6b 65 72 2d   ...E.C..@worker-
00a0 - 36 39 39 63 64 34 34 39-63 39 2d 37 71 73 37 39   699cd449c9-7qs79
00b0 - 2e 6d 63 6c 6f 75 64 2d-73 74 61 67 65 34 2d 68   .mcloud-stage4-h
00c0 - 61 6d 31 2e 68 75 62 62-6c 65 2d 67 72 70 63 2e   am1.hubble-grpc.
00d0 - 63 69 6c 69 75 6d 2e 69-6f 00 0b 00 04 03 00 01   cilium.io.......
00e0 - 02 00 0a 00 16 00 14 00-1d 00 17 00 1e 00 19 00   ................
00f0 - 18 01 00 01 01 01 02 01-03 01 04 00 23 00 00 00   ............#...
0100 - 16 00 00 00 17 00 00 00-0d 00 2a 00 28 04 03 05   ..........*.(...
0110 - 03 06 03 08 07 08 08 08-09 08 0a 08 0b 08 04 08   ................
0120 - 05 08 06 04 01 05 01 06-01 03 03 03 01 03 02 04   ................
0130 - 02 05 02 06 02 00 2b 00-05 04 03 04 03 03 00 2d   ......+........-
0140 - 00 02 01 01 00 33 00 26-00 24 00 1d 00 20 b9 71   .....3.&.$... .q
0150 - 53 c9 ce 35 b3 58 04 0c-9a 59 9b 8d 1c f6 ae fb   S..5.X...Y......
0160 - 77 bc 9f 16 d1 b3 5a c6-cf 67 68 ce c2 5e         w.....Z..gh..^
SSL_connect:SSLv3/TLS write client hello

While googling, we found https://stackoverflow.com/questions/40009474/openssl-hangs-at-connected00000003 which suggests that this is an MTU issue and indeed, the error is gone after lowering the MTU to 1400.