flannel-io / flannel

flannel is a network fabric for containers, designed for Kubernetes
Apache License 2.0
8.61k stars 2.87k forks source link

node is nat'd and doesn't know its IP address on hybrid cluster use wireguard-native is wrong #1889

Open vast0906 opened 4 months ago

vast0906 commented 4 months ago

Cluster Configuration:

server:

  1. master EXTERNAL-IP: xx.xx.xx.xx INTERNAL-IP: 10.0.8.17

node:

  1. node-x86 node-x86 is NAT'd and doesn't know its IP address. EXTERNAL-IP: xx.xx.xx.yy INTERNAL-IP: 192.168.36.22

  2. node-arm EXTERNAL-IP: xx.xx.xx.zz INTERNAL-IP: 10.0.1.217

export PUBLIC_IP=`curl -sSL https://ipconfig.sh`
export INSTALL_K3S_EXEC="--disable servicelb --kube-proxy-arg proxy-mode=ipvs  --kube-proxy-arg masquerade-all=true --kube-proxy-arg metrics-bind-address=0.0.0.0  --disable traefik --node-ip 10.0.8.17 --node-external-ip $PUBLIC_IP --flannel-backend wireguard-native --flannel-external-ip"
curl -sfL https://get.k3s.io | sh -
# wg show flannel-wg
interface: flannel-wg
  public key: Wxxxx
  private key: (hidden)
  listening port: 51820

peer: hldi2xxxx
  endpoint: xx.xx.xx.zz:51820
  allowed ips: 10.42.2.0/24
  latest handshake: 25 seconds ago
  transfer: 11.72 MiB received, 6.53 MiB sent
  persistent keepalive: every 25 seconds

peer: Ap//Dxxx
  endpoint: 192.168.36.22:51820  # It's wrong
  allowed ips: 10.42.5.0/24
  transfer: 0 B received, 33.39 KiB sent
  persistent keepalive: every 25 seconds

peer: hldi2xxx endpoint: xx.xx.xx.zz:51820 allowed ips: 10.42.2.0/24 latest handshake: 28 seconds ago transfer: 1.52 KiB received, 3.16 KiB sent persistent keepalive: every 25 seconds

peer: Ww7xx endpoint: xx.xx.xx.xx:51820 allowed ips: 10.42.0.0/24 transfer: 0 B received, 30.06 KiB sent persistent keepalive: every 25 seconds

- node-arm wg show

interface: flannel-wg public key: hldi26xxxx private key: (hidden) listening port: 51820

peer: Ww7xxxx endpoint: xx.xx.xx.xx:51820 allowed ips: 10.42.0.0/24 latest handshake: 8 seconds ago transfer: 6.53 MiB received, 15.16 MiB sent persistent keepalive: every 25 seconds

peer: Ap//xxxx endpoint: xx.xx.xx.yy:8598 # that's right allowed ips: 10.42.5.0/24 latest handshake: 1 minute, 12 seconds ago transfer: 2.86 KiB received, 2.04 KiB sent persistent keepalive: every 25 seconds


## Expected Behavior
<!--- If you're describing a bug, tell us what should happen -->
<!--- If you're suggesting a change/improvement, tell us how it should work -->

- master wg show 

wg show flannel-wg

interface: flannel-wg public key: Wxxxx private key: (hidden) listening port: 51820

peer: hldi2xxxx endpoint: xx.xx.xx.zz:51820 allowed ips: 10.42.2.0/24 latest handshake: 25 seconds ago transfer: 11.72 MiB received, 6.53 MiB sent persistent keepalive: every 25 seconds

peer: Ap//Dxxx endpoint: xx.xx.xx.yy:8598 # that's right allowed ips: 10.42.5.0/24 transfer: 0 B received, 33.39 KiB sent persistent keepalive: every 25 seconds


## Current Behavior
<!--- If describing a bug, tell us what happens instead of the expected behavior -->
<!--- If suggesting a change/improvement, explain the difference from current behavior -->

- master wg show 

wg show flannel-wg

interface: flannel-wg public key: Wxxxx private key: (hidden) listening port: 51820

peer: hldi2xxxx endpoint: xx.xx.xx.zz:51820 allowed ips: 10.42.2.0/24 latest handshake: 25 seconds ago transfer: 11.72 MiB received, 6.53 MiB sent persistent keepalive: every 25 seconds

peer: Ap//Dxxx endpoint: 192.168.36.22:51820 # It's wrong allowed ips: 10.42.5.0/24 transfer: 0 B received, 33.39 KiB sent persistent keepalive: every 25 seconds



## Possible Solution
<!--- Not obligatory, but suggest a fix/reason for the bug, -->
<!--- or ideas how to implement the addition or change -->

The master and node use the WIREGUARD negotiated endpoint consistently.  

## Steps to Reproduce (for bugs)
<!--- Provide a link to a live example, or an unambiguous set of steps to -->
<!--- reproduce this bug. Include code to reproduce, if relevant -->
1.
2.
3.
4.

## Context
<!--- How has this issue affected you? What are you trying to accomplish? -->
<!--- Providing context helps us come up with a solution that is most useful in the real world -->

## Your Environment
<!--- Include as many relevant details about the environment you experienced the bug in -->
* Flannel version:
* Backend used (e.g. vxlan or udp):
* Etcd version:
* Kubernetes version (if used): k3s -v
k3s version v1.28.6+k3s2 (https://github.com/k3s-io/k3s/commit/c9f49a3b06cd7ebe793f8cc1dcd0293168e743d9)
go version go1.20.13
* Operating System and version:
* Link to your project (optional):

My English is very poor, please refer to this [issue](https://github.com/k3s-io/k3s/issues/9535) for specific details. Thank you
manuelbuil commented 4 months ago

Hey again! In your proposal, you are talking about server-client communication, where the client knows the endpoint of the server but the server only knows the public-key of the client. In this scenario, client can communicate with the server but the server can't communicate with client until the client contacts first, right?

The problem with the previous approach with Kubernetes is that the architecture is not a server-client when it comes to pod-pod communication. We are creating a mesh of tunnels between the nodes. Imagine a cluster of 3 nodes (node1, node2 and node3), I see for example two problems: 1 - When node3 comes up, should it know the endpoint of node1 and node2? Or only node1? How to decide on that? 2 - Imagine it knows the endpoint of both node1 and node2. But node1 and node2 don't know the endpoint of node3. If I understand correctly, node1 and node2 can't communicate with node3 unless node3 tries to communicate with them. That means that pods in node1 and node2 won't be able to contact node3 pods, right?

vast0906 commented 4 months ago

client can communicate with the server but the server can't communicate with client until the client contacts first, right?

client can communicate with the server but the server can't communicate with client until the client contacts first, right?

yes

server-client and pod-pod No conflict. The pod-pod network is a tunnel created through server-client. Pod-pod can communicate only after server-client establishes a connection and creates a tunnel.

manuelbuil commented 4 months ago

client can communicate with the server but the server can't communicate with client until the client contacts first, right?

client can communicate with the server but the server can't communicate with client until the client contacts first, right?

yes

server-client and pod-pod No conflict. The pod-pod network is a tunnel created through server-client. Pod-pod can communicate only after server-client establishes a connection and creates a tunnel.

Right, but the server needs to wait for the client to contact it. What if the client never contacts the server?

vast0906 commented 4 months ago

Right, but the server needs to wait for the client to contact it. What if the client never contacts the server?

WIREGUARD contacts the server when it starts up, if client never contacts the server , Represents this node is not ready

manuelbuil commented 4 months ago

Right, but the server needs to wait for the client to contact it. What if the client never contacts the server?

WIREGUARD contacts the server when it starts up, if client never contacts the server , Represents this node is not ready

Imagine we have 2 nodes. 1 node is the k8s control-plane and 1 node is the k8s agent and it is behind a NAT (let's call it node1). In this case, I can see your suggestion working.

However, what happens if we add a new k8s agent node behing a NAT (let's call it node2)? We need to know the endpoint of node1 or node2 to create that tunnel between both nodes, right?

vast0906 commented 4 months ago

Imagine we have 2 nodes. 1 node is the k8s control-plane and 1 node is the k8s agent and it is behind a NAT (let's call it node1). In this case, I can see your suggestion working.

However, what happens if we add a new k8s agent node behing a NAT (let's call it node2)? We need to know the endpoint of node1 or node2 to create that tunnel between both nodes, right?

I'm not sure if the wireguard master will synchronize all endpoint information to the other node