Closed networkop closed 4 years ago
CC @brb
@networkop Thanks for the issue. Two things - could you run cilium bpf ipcache get 10.0.255.7
from the node which runs the client pod, and could you provide a sysdump (https://docs.cilium.io/en/v1.8/troubleshooting/#automatic-log-state-collection)?
sure, here's the bpf ipcache output first
$ cilium bpf ipcache get 10.0.255.7
10.0.255.7 maps to identity 6 0 0.0.0.0
@brb where can I upload the sysdump?
oh wow, didn't know about GH's drag-and-drop cilium-sysdump-20200827-160052.zip
@networkop Could you try w/o tunneling? You can do that by setting auto-direct-node-routes: true
and tunnel: disabled
in the ConfigMap, and then restarting all cilium-agent pods? Also, you need to make sure that Azure underlying network won't drop packets from/to an unknown subnet (which is a Pod CIDR).
ok, l'll try. do you want me to collect any logs or just verify that it works?
Thanks. Just to verify that it works.
I've done the test, however I still don't have the e2e connectivity. In addition to changing the above flags I also had to specify the native-routing-cidr
to stop the agent from crashing. I set it to the same value as pod cidr (in my case 10.0.0.0/18).
Additionally, in Azure I enabled "ip forwarding" on the NIC of the worker node I was testing, to stop it from dropping unknown packets.
The only new things I've noticed was that traceroute now looks a bit more like what you'd expected in the beginning:
15:14:57.811789 IP 10.0.6.173.50440 > 10.0.255.10.6443: Flags [S], seq 3237958959, win 65280, options [mss 1360,sackOK,TS val 2709627184 ecr 0,nop,wscale 7], length 0
15:14:59.162251 IP 10.0.255.4.54334 > 10.0.255.10.6443: Flags [S], seq 2787558301, win 64240, options [mss 1460,sackOK,TS val 560436112 ecr 0,nop,wscale 7], length 0
15:14:59.163323 IP 10.0.255.10.6443 > 10.0.255.4.54334: Flags [S.], seq 4113293485, ack 2787558302, win 65160, options [mss 1418,sackOK,TS val 2268648707 ecr 560436112,nop,wscale 7], length 0
15:14:59.163377 IP 10.0.255.4.54334 > 10.0.255.10.6443: Flags [.], ack 1, win 502, options [nop,nop,TS val 560436113 ecr 2268648707], length 0
15:14:59.163417 IP 10.0.255.4.54334 > 10.0.255.10.6443: Flags [P.], seq 1:518, ack 1, win 502, options [nop,nop,TS val 560436113 ecr 2268648707], length 517
So it looks like there's a SNAT'ed packet, however I can't understand why the translated session keeps sending packets while I don't even see a SYN-ACK in the pod. So maybe it's a red herring.
Hmm. Would it be possible to get an access to your Azure cluster (I'm martynas
on Cilium's public slack)?
the same bug , mark it
@Mengkzhaoyun What is your setup and configuration?
I deploy the cilium-dev:v1.9.0-rc0 & kubernetes v1.18.8 with out kube-proxy mod (3master ha with kube-vip) when i test the deploy on 3 ubuntu 20.04 hyper-v win10 vm , i found my kubernetes'dashboard pod can not access https://kube-apiserver:6443 , the error is (apiserver service endpoint) access time out.
I login to bash on dashboard pod , it can not curl access other host's apiserver , but bash on the pod' vm host , the curl https://kube-apiserver.othervm.local:6443/version run success.
I change enable-bpf-masquerade to false then my kubernetes'dashboard pod work correct. here is my config
{
"auto-direct-node-routes": "false",
"bpf-lb-map-max": "65536",
"bpf-map-dynamic-size-ratio": "0.0025",
"bpf-policy-map-max": "16384",
"cluster-name": "default",
"cluster-pool-ipv4-cidr": "10.0.0.0/11",
"cluster-pool-ipv4-mask-size": "24",
"debug": "false",
"disable-cnp-status-updates": "true",
"disable-envoy-version-check": "true",
"enable-auto-protect-node-port-range": "true",
"enable-bpf-clock-probe": "true",
"enable-bpf-masquerade": "false",
"enable-endpoint-health-checking": "true",
"enable-external-ips": "true",
"enable-health-check-nodeport": "true",
"enable-host-port": "true",
"enable-host-reachable-services": "true",
"enable-hubble": "true",
"enable-ipv4": "true",
"enable-ipv6": "false",
"enable-node-port": "true",
"enable-remote-node-identity": "true",
"enable-session-affinity": "true",
"enable-well-known-identities": "false",
"enable-xt-socket-fallback": "true",
"hubble-listen-address": ":4244",
"hubble-socket-path": "/var/run/cilium/hubble.sock",
"identity-allocation-mode": "crd",
"install-iptables-rules": "true",
"ipam": "cluster-pool",
"k8s-require-ipv4-pod-cidr": "true",
"k8s-require-ipv6-pod-cidr": "false",
"kube-proxy-replacement": "partial",
"masquerade": "true",
"monitor-aggregation": "medium",
"monitor-aggregation-flags": "all",
"monitor-aggregation-interval": "5s",
"node-port-bind-protection": "true",
"node-port-range": "20,65000",
"operator-api-serve-addr": "127.0.0.1:9234",
"preallocate-bpf-maps": "false",
"sidecar-istio-proxy-image": "cilium/istio_proxy",
"tunnel": "vxlan",
"wait-bpf-mount": "false"
}
@Mengkzhaoyun Thanks. Do you run cilium on the master nodes?
there is 3 master nodes run cilium agent .
@Mengkzhaoyun Can you please ping me (martynas
) on the Cilium's public slack?
@brb I'm pretty sure this is related to issue I was last discussing with you guys as well where I had to revert to iptables.
@networkop @Mengkzhaoyun OK, so what happens in your case is that the wrong IPv4 addr is picked for SNAT, as the selected devices for NodePort have more than one IPv4 addr. Currently, cilium-agent can support only one per bpf_host.o
instance, and it does not check which IP addr is used to communicate between k8s nodes (aka k8s Node IP). The fix would be to to prefer the k8s Node IP when selecting that single IP addr.
If any of you wants to work on the fix (a good opportunity to familiarize with Cilium internals), I'm happy to guide you.
I'm happy to pick it up @brb
The fix might be simple: in the function https://github.com/cilium/cilium/blob/master/pkg/node/address.go#L128 you need to pass k8s Node IPv{4,6} addr to the invocations of firstGlobalV{4,6}Addr(...)
as a second parameter.
add CILIUM_IPV4_NODE env to cilium-agent can temp skip the bug.
env:
- name: K8S_NODE_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
- name: CILIUM_IPV4_NODE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: spec.nodeName
@Mengkzhaoyun that makes my deployment crash as the nodeName
is a hostname...aware of another property that pushes the IP definitively?
level=fatal msg="Invalid IPv4 node address" ipAddr=<node hostname> subsys=daemon
Here's the better way I suppose:
- name: CILIUM_IPV4_NODE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: status.podIP
Didn't help my scenario so I must be hitting a different bug altogether.
JFYI: CILIUM_IPV4_NODE
doesn't have anything to do with fixing the issue.
@brb I'm pretty sure this is related to issue I was last discussing with you guys as well where I had to revert to iptables.
As per discussion offline - it's not related.
Here's a brief summary of the issues I was hitting:
The main issue was caused by the fact that one of the k8s nodes had multiple outgoing interfaces. Specifically, the interface that was used to communicate with the affected nodes was wireguard, which doesn't have a L2 (MAC) address and is currently not supported until #12317 gets resolved.
The secondary issue was that cilium agent is picking the wrong SNAT IP for traffic leaving the primary (non-wg) interface. This interface was a bond interface with 1 public and 1 private IP. Cilium was picking the public IP, ignoring the fact that that k8s node had InternalIP
set to the private IP. The following patch fixed this issue for me:
diff --git a/pkg/node/address.go b/pkg/node/address.go
index a5fa3dadd..7c438a9d8 100644
--- a/pkg/node/address.go
+++ b/pkg/node/address.go
@@ -129,10 +129,14 @@ func InitNodePortAddrs(devices []string) error {
if option.Config.EnableIPv4 {
ipv4NodePortAddrs = make(map[string]net.IP, len(devices))
for _, device := range devices {
- ip, err := firstGlobalV4Addr(device, nil)
+ ip, err := firstGlobalV4Addr(device, GetK8sNodeIP())
if err != nil {
return fmt.Errorf("Failed to determine IPv4 of %s for NodePort", device)
}
+ log.WithFields(logrus.Fields{
+ "device": device,
+ "ip": ip,
+ }).Info("Pinning node IPs")
ipv4NodePortAddrs[device] = ip
}
}
@@ -140,7 +144,7 @@ func InitNodePortAddrs(devices []string) error {
if option.Config.EnableIPv6 {
ipv6NodePortAddrs = make(map[string]net.IP, len(devices))
for _, device := range devices {
- ip, err := firstGlobalV6Addr(device, nil)
+ ip, err := firstGlobalV6Addr(device, GetK8sNodeIP())
if err != nil {
return fmt.Errorf("Failed to determine IPv6 of %s for NodePort", device)
}
diff --git a/pkg/node/address_linux.go b/pkg/node/address_linux.go
index a05a3f38d..bfa16fc73 100644
--- a/pkg/node/address_linux.go
+++ b/pkg/node/address_linux.go
@@ -86,7 +86,7 @@ retryScope:
}
if len(ipsPublic) != 0 {
- if hasPreferred && ip.IsPublicAddr(preferredIP) {
+ if hasPreferred {
return preferredIP, nil
}
cilium monitor -t drop
helped identify that. Thanks @brb for the support, let me know if you want me to do a PR with the above fix or it's ok to just close this issue for now.
@networkop , good job , it is work.
@networkop can you push a PR with that fix? that would be great!
PR is ready for review :+1:
Bug report
When Cilium is configured as a kube-proxy replacement, it fails to masquerade the source IP of pods when the target pod is in the
hostNetwork
of one of the k8s nodes.General Information
How to reproduce the issue
Additional details
I have a test pod (10.0.6.223) trying to connect to the API server on
10.0.128.1:443
. When doing a tcpdump on the underlying node, I cannot see SNAT'ed packets.Since the source IP of the pod is not masqueraded, it gets dropped by Azure's networking stack.
According to @brb's comment it should have been translated on TC egress, however I don't see it happening (see above tcpdump).
As soon as I change
enable-bpf-masquerade
to "false" and restart the agent on the node, connectivity gets restored.I've got the environment up and running for a while so I'm happy to collect any additional logs/outputs.
Here's the cilium configmap.