cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
19.99k stars 2.94k forks source link

wrong mac addr when using cilium service mesh #22414

Closed ycydk closed 1 year ago

ycydk commented 1 year ago

Is there an existing issue for this?

What happened?

cilium config

agent-not-ready-taint-key: node.cilium.io/agent-not-ready
arping-refresh-period: 30s
auto-direct-node-routes: "true"
bpf-lb-external-clusterip: "false"
bpf-lb-map-max: "65536"
bpf-map-dynamic-size-ratio: "0.0025"
bpf-policy-map-max: "16384"
bpf-root: /sys/fs/bpf
cgroup-root: /run/cilium/cgroupv2
cilium-endpoint-gc-interval: 5m0s
cluster-id: "0"
cluster-name: default
custom-cni-conf: "false"
debug: "false"
disable-cnp-status-updates: "true"
disable-endpoint-crd: "false"
enable-auto-protect-node-port-range: "true"
enable-bandwidth-manager: "true"
enable-bgp-control-plane: "false"
enable-bpf-clock-probe: "true"
enable-bpf-masquerade: "false"
enable-bpf-tproxy: "true"
enable-endpoint-health-checking: "true"
enable-endpoint-routes: "false"
enable-envoy-config: "true"
enable-health-check-nodeport: "true"
enable-health-checking: "true"
enable-host-legacy-routing: "false"
enable-hubble: "true"
enable-ingress-controller: "true"
enable-ingress-secrets-sync: "true"
enable-ipv4: "true"
enable-ipv4-masquerade: "true"
enable-ipv6: "false"
enable-ipv6-masquerade: "true"
enable-k8s-terminating-endpoint: "true"
enable-l2-neigh-discovery: "true"
enable-l7-proxy: "true"
enable-local-node-route: "true"
enable-local-redirect-policy: "false"
enable-policy: default
enable-remote-node-identity: "true"
enable-svc-source-range-check: "true"
enable-vtep: "false"
enable-well-known-identities: "false"
enable-xt-socket-fallback: "true"
enforce-ingress-https: "true"
hubble-disable-tls: "false"
hubble-listen-address: :4244
hubble-socket-path: /var/run/cilium/hubble.sock
hubble-tls-cert-file: /var/lib/cilium/tls/hubble/server.crt
hubble-tls-client-ca-files: /var/lib/cilium/tls/hubble/client-ca.crt
hubble-tls-key-file: /var/lib/cilium/tls/hubble/server.key
identity-allocation-mode: crd
ingress-lb-annotation-prefixes: service.beta.kubernetes.io service.kubernetes.io
  cloud.google.com
ingress-secrets-namespace: cilium-secrets
install-iptables-rules: "true"
install-no-conntrack-iptables-rules: "false"
ipam: kubernetes
ipv4-native-routing-cidr: 172.20.0.0/16
kube-proxy-replacement: strict
kube-proxy-replacement-healthz-bind-address: ""
monitor-aggregation: medium
monitor-aggregation-flags: all
monitor-aggregation-interval: 5s
node-port-bind-protection: "true"
nodes-gc-interval: 5m0s
operator-api-serve-addr: 127.0.0.1:9234
preallocate-bpf-maps: "false"
procfs: /host/proc
remove-cilium-node-taints: "true"
set-cilium-is-up-condition: "true"
sidecar-istio-proxy-image: cilium/istio_proxy
synchronize-k8s-nodes: "true"
tofqdns-dns-reject-response-code: refused
tofqdns-enable-dns-compression: "true"
tofqdns-endpoint-max-ip-per-hostname: "50"
tofqdns-idle-connection-grace-period: 0s
tofqdns-max-deferred-connection-deletes: "10000"
tofqdns-min-ttl: "3600"
tofqdns-proxy-response-max-delay: 100ms
tunnel: disabled
unmanaged-pod-watcher-interval: "15"
vtep-cidr: ""
vtep-endpoint: ""
vtep-mac: ""
vtep-mask: ""

what happened

i create an ingress as

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  creationTimestamp: "2022-10-21T06:18:52Z"
  generation: 1
  name: testapp2
  namespace: default
  resourceVersion: "15592191"
  uid: cb272819-97b5-4f76-a5c0-35f5668471ac
spec:
  ingressClassName: cilium
  rules:
  - host: test2.xxxx.xxxx.cn
    http:
      paths:
      - backend:
          service:
            name: apptest2
            port:
              number: 8080
        path: /
        pathType: Exact
status:
  loadBalancer: {}

http request from hostns

dest pod in remote node

> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Accept: */*
> host: test2.xxxx.xxxx.cn
>
< HTTP/1.1 200 OK
< server: envoy
< date: Tue, 29 Nov 2022 08:24:40 GMT
< content-type: text/html
< content-length: 612
< last-modified: Tue, 13 Jul 2021 13:54:07 GMT
< etag: "60ed9aff-264"
< accept-ranges: bytes
< x-envoy-upstream-service-time: 1
<
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
    body {
        width: 35em;
        margin: 0 auto;
        font-family: Tahoma, Verdana, Arial, sans-serif;
    }
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>

<p>For online documentation and support please refer to
<a href="http://nginx.org/">nginx.org</a>.<br/>
Commercial support is available at
<a href="http://nginx.com/">nginx.com</a>.</p>

<p><em>Thank you for using nginx.</em></p>
</body>
</html>

http request from pod

dest pod in remote node

> GET / HTTP/1.1
> User-Agent: curl/7.29.0
> Accept: */*
> host: test2.xxxx.xxxx.cn
>
< HTTP/1.1 503 Service Unavailable
< content-length: 91
< content-type: text/plain
< date: Tue, 29 Nov 2022 08:25:11 GMT
< server: envoy
<
* Connection #0 to host 172.16.60.241 left intact
upstream connect error or disconnect/reset before headers. reset reason: connection failure

tcpdump

dump the packet from envoy to remote pod, i found that the dest mac addr in the packet is wrong, the mac addr should be mac addr of the remote node, but it seems to be the lxc device of source pod

Cilium Version

Client: 1.12.4 6eaecaf 2022-11-16T05:45:01+00:00 go version go1.18.8 linux/amd64 Daemon: 1.12.4 6eaecaf 2022-11-16T05:45:01+00:00 go version go1.18.8 linux/amd64

Kernel Version

5.10.0-60.56.0.84.oe2203.x86_64

Kubernetes Version

v1.24.3

Sysdump

No response

Relevant log output

No response

Anything else?

No response

Code of Conduct

pippolo84 commented 1 year ago

Hi @ycydk and thank you for reporting the issue. Could you please attach a sysdump from the environment where you noticed this?

ycydk commented 1 year ago

Hi, @pippolo84 thank you for your reply. I got the sysdump file from my environment. I also found if i use the config as follow, it will be ok.

cilium sysdump

cilium-sysdump.zip

cilium config

agent-not-ready-taint-key: node.cilium.io/agent-not-ready arping-refresh-period: 30s auto-direct-node-routes: "true" bpf-lb-external-clusterip: "false" bpf-lb-map-max: "65536" bpf-lb-sock: "true" bpf-lb-sock-hostns-only: "true" bpf-map-dynamic-size-ratio: "0.0025" bpf-policy-map-max: "16384" bpf-root: /sys/fs/bpf cgroup-root: /run/cilium/cgroupv2 cilium-endpoint-gc-interval: 5m0s cluster-id: "0" cluster-name: default custom-cni-conf: "false" debug: "false" disable-cnp-status-updates: "true" disable-endpoint-crd: "false" enable-auto-protect-node-port-range: "true" enable-bandwidth-manager: "true" enable-bgp-control-plane: "false" enable-bpf-clock-probe: "true" enable-bpf-masquerade: "true" enable-bpf-tproxy: "true" enable-endpoint-health-checking: "true" enable-endpoint-routes: "false" enable-envoy-config: "true" enable-health-check-nodeport: "true" enable-health-checking: "true" enable-host-firewall: "false" enable-host-legacy-routing: "false" enable-host-port: "false" enable-hubble: "true" enable-ingress-controller: "true" enable-ingress-secrets-sync: "true" enable-ipv4: "true" enable-ipv4-masquerade: "true" enable-ipv6: "false" enable-ipv6-masquerade: "true" enable-k8s-terminating-endpoint: "true" enable-l2-neigh-discovery: "true" enable-l7-proxy: "true" enable-local-node-route: "true" enable-local-redirect-policy: "false" enable-node-port: "false" enable-policy: default enable-remote-node-identity: "true" enable-svc-source-range-check: "true" enable-vtep: "false" enable-well-known-identities: "false" enable-xt-socket-fallback: "true" enforce-ingress-https: "true" hubble-disable-tls: "false" hubble-listen-address: :4244 hubble-socket-path: /var/run/cilium/hubble.sock hubble-tls-cert-file: /var/lib/cilium/tls/hubble/server.crt hubble-tls-client-ca-files: /var/lib/cilium/tls/hubble/client-ca.crt hubble-tls-key-file: /var/lib/cilium/tls/hubble/server.key identity-allocation-mode: crd ingress-lb-annotation-prefixes: service.beta.kubernetes.io service.kubernetes.io cloud.google.com ingress-secrets-namespace: cilium-secrets install-iptables-rules: "true" install-no-conntrack-iptables-rules: "false" ipam: kubernetes ipv4-native-routing-cidr: 172.20.0.0/16 kube-proxy-replacement: strict kube-proxy-replacement-healthz-bind-address: "" monitor-aggregation: medium monitor-aggregation-flags: all monitor-aggregation-interval: 5s node-port-bind-protection: "true" nodes-gc-interval: 5m0s operator-api-serve-addr: 127.0.0.1:9234 preallocate-bpf-maps: "false" procfs: /host/proc remove-cilium-node-taints: "true" set-cilium-is-up-condition: "true" sidecar-istio-proxy-image: cilium/istio_proxy synchronize-k8s-nodes: "true" tofqdns-dns-reject-response-code: refused tofqdns-enable-dns-compression: "true" tofqdns-endpoint-max-ip-per-hostname: "50" tofqdns-idle-connection-grace-period: 0s tofqdns-max-deferred-connection-deletes: "10000" tofqdns-min-ttl: "3600" tofqdns-proxy-response-max-delay: 100ms tunnel: disabled unmanaged-pod-watcher-interval: "15" vtep-cidr: "" vtep-endpoint: "" vtep-mac: "" vtep-mask: ""

pippolo84 commented 1 year ago

Hi @ycydk , thank you for the additional info.

I'll leave here the differences between the two configs you reported:

config-diff

pchaigno commented 1 year ago

As Fabio highlighted (no pun intended), there are several differences between the two setups. Could you try to see with just the enable-bpf-masquerade enabled if it makes a difference?

It would also be useful to have a packet trace of the failing request in the faulty config with cilium monitor or Hubble observe. To collect a full trace, you probably want to disable monitor aggregation first (monitor-aggregation=false).

ycydk commented 1 year ago

Hi, i did some test. If just the enable-bpf-masquerade enabled, it doesn't work. If the enable-bpf-masquerade and the enable-ipv4-masquerade disabled, it works for servicemesh. If delete the ebpf program sec("to-netdev"), it works in this node.

pchaigno commented 1 year ago

So with BPF masquerading it doesn't work and without it works.

Is the service you are trying to reach outside the cluster? Should packets to this service be masqueraded? Are they (you can use tcpdump on the native device to confirm that)?

ycydk commented 1 year ago

The service is in the cluster (clusterip service). Packets should not be masqueraded. But if masqueraded is disabled, packets trying to reach outside the cluster will be dropped. I dump the packet on the native device and i found that packet to the svc is not masqueraded.

pchaigno commented 1 year ago

I'm confused. Is it failing with or without BPF masquerading?

ycydk commented 1 year ago

it's failing if trying to access a service mesh svc in cluster from a pod (not hostnetwork) when using the first configuration i mentioned (with enable-ipv4-masquerade=true). if enable-bpf-masquerade=true, it's failing too. But it works using the second configuration.

failing config

agent-not-ready-taint-key: node.cilium.io/agent-not-ready arping-refresh-period: 30s auto-direct-node-routes: "true" bpf-lb-external-clusterip: "false" bpf-lb-map-max: "65536" bpf-map-dynamic-size-ratio: "0.0025" bpf-policy-map-max: "16384" bpf-root: /sys/fs/bpf cgroup-root: /run/cilium/cgroupv2 cilium-endpoint-gc-interval: 5m0s cluster-id: "0" cluster-name: default custom-cni-conf: "false" debug: "false" disable-cnp-status-updates: "true" disable-endpoint-crd: "false" enable-auto-protect-node-port-range: "true" enable-bandwidth-manager: "true" enable-bgp-control-plane: "false" enable-bpf-clock-probe: "true" enable-bpf-masquerade: "true" enable-bpf-tproxy: "true" enable-endpoint-health-checking: "true" enable-endpoint-routes: "false" enable-envoy-config: "true" enable-health-check-nodeport: "true" enable-health-checking: "true" enable-host-legacy-routing: "false" enable-hubble: "true" enable-ingress-controller: "true" enable-ingress-secrets-sync: "true" enable-ipv4: "true" enable-ipv4-masquerade: "true" enable-ipv6: "false" enable-ipv6-masquerade: "true" enable-k8s-terminating-endpoint: "true" enable-l2-neigh-discovery: "true" enable-l7-proxy: "true" enable-local-node-route: "true" enable-local-redirect-policy: "false" enable-policy: default enable-remote-node-identity: "true" enable-svc-source-range-check: "true" enable-vtep: "false" enable-well-known-identities: "false" enable-xt-socket-fallback: "true" enforce-ingress-https: "true" hubble-disable-tls: "false" hubble-listen-address: :4244 hubble-socket-path: /var/run/cilium/hubble.sock hubble-tls-cert-file: /var/lib/cilium/tls/hubble/server.crt hubble-tls-client-ca-files: /var/lib/cilium/tls/hubble/client-ca.crt hubble-tls-key-file: /var/lib/cilium/tls/hubble/server.key identity-allocation-mode: crd ingress-lb-annotation-prefixes: service.beta.kubernetes.io service.kubernetes.io cloud.google.com ingress-secrets-namespace: cilium-secrets install-iptables-rules: "true" install-no-conntrack-iptables-rules: "false" ipam: kubernetes ipv4-native-routing-cidr: 172.20.0.0/16 kube-proxy-replacement: strict kube-proxy-replacement-healthz-bind-address: "" monitor-aggregation: medium monitor-aggregation-flags: all monitor-aggregation-interval: 5s node-port-bind-protection: "true" nodes-gc-interval: 5m0s operator-api-serve-addr: 127.0.0.1:9234 preallocate-bpf-maps: "false" procfs: /host/proc remove-cilium-node-taints: "true" set-cilium-is-up-condition: "true" sidecar-istio-proxy-image: cilium/istio_proxy synchronize-k8s-nodes: "true" tofqdns-dns-reject-response-code: refused tofqdns-enable-dns-compression: "true" tofqdns-endpoint-max-ip-per-hostname: "50" tofqdns-idle-connection-grace-period: 0s tofqdns-max-deferred-connection-deletes: "10000" tofqdns-min-ttl: "3600" tofqdns-proxy-response-max-delay: 100ms tunnel: disabled unmanaged-pod-watcher-interval: "15" vtep-cidr: "" vtep-endpoint: "" vtep-mac: "" vtep-mask: ""

working config:

agent-not-ready-taint-key: node.cilium.io/agent-not-ready arping-refresh-period: 30s auto-direct-node-routes: "true" bpf-lb-external-clusterip: "false" bpf-lb-map-max: "65536" bpf-lb-sock: "true" bpf-lb-sock-hostns-only: "true" bpf-map-dynamic-size-ratio: "0.0025" bpf-policy-map-max: "16384" bpf-root: /sys/fs/bpf cgroup-root: /run/cilium/cgroupv2 cilium-endpoint-gc-interval: 5m0s cluster-id: "0" cluster-name: default custom-cni-conf: "false" debug: "false" disable-cnp-status-updates: "true" disable-endpoint-crd: "false" enable-auto-protect-node-port-range: "true" enable-bandwidth-manager: "true" enable-bgp-control-plane: "false" enable-bpf-clock-probe: "true" enable-bpf-masquerade: "true" enable-bpf-tproxy: "true" enable-endpoint-health-checking: "true" enable-endpoint-routes: "false" enable-envoy-config: "true" enable-health-check-nodeport: "true" enable-health-checking: "true" enable-host-firewall: "false" enable-host-legacy-routing: "false" enable-host-port: "false" enable-hubble: "true" enable-ingress-controller: "true" enable-ingress-secrets-sync: "true" enable-ipv4: "true" enable-ipv4-masquerade: "true" enable-ipv6: "false" enable-ipv6-masquerade: "true" enable-k8s-terminating-endpoint: "true" enable-l2-neigh-discovery: "true" enable-l7-proxy: "true" enable-local-node-route: "true" enable-local-redirect-policy: "false" enable-node-port: "false" enable-policy: default enable-remote-node-identity: "true" enable-svc-source-range-check: "true" enable-vtep: "false" enable-well-known-identities: "false" enable-xt-socket-fallback: "true" enforce-ingress-https: "true" hubble-disable-tls: "false" hubble-listen-address: :4244 hubble-socket-path: /var/run/cilium/hubble.sock hubble-tls-cert-file: /var/lib/cilium/tls/hubble/server.crt hubble-tls-client-ca-files: /var/lib/cilium/tls/hubble/client-ca.crt hubble-tls-key-file: /var/lib/cilium/tls/hubble/server.key identity-allocation-mode: crd ingress-lb-annotation-prefixes: service.beta.kubernetes.io service.kubernetes.io cloud.google.com ingress-secrets-namespace: cilium-secrets install-iptables-rules: "true" install-no-conntrack-iptables-rules: "false" ipam: kubernetes ipv4-native-routing-cidr: 172.20.0.0/16 kube-proxy-replacement: strict kube-proxy-replacement-healthz-bind-address: "" monitor-aggregation: medium monitor-aggregation-flags: all monitor-aggregation-interval: 5s node-port-bind-protection: "true" nodes-gc-interval: 5m0s operator-api-serve-addr: 127.0.0.1:9234 preallocate-bpf-maps: "false" procfs: /host/proc remove-cilium-node-taints: "true" set-cilium-is-up-condition: "true" sidecar-istio-proxy-image: cilium/istio_proxy synchronize-k8s-nodes: "true" tofqdns-dns-reject-response-code: refused tofqdns-enable-dns-compression: "true" tofqdns-endpoint-max-ip-per-hostname: "50" tofqdns-idle-connection-grace-period: 0s tofqdns-max-deferred-connection-deletes: "10000" tofqdns-min-ttl: "3600" tofqdns-proxy-response-max-delay: 100ms tunnel: disabled unmanaged-pod-watcher-interval: "15" vtep-cidr: "" vtep-endpoint: "" vtep-mac: "" vtep-mask: ""

pchaigno commented 1 year ago

Diff between those two configurations:

5a6,7
> bpf-lb-sock: "true"
> bpf-lb-sock-hostns-only: "true"
27a30
> enable-host-firewall: "false"
28a32
> enable-host-port: "false"
40a45
> enable-node-port: "false"

You've set enable-host-firewall to its default value, false, so that doesn't make any difference. I believe bpf-lb-sock, enable-host-port, and enable-node-port will be forced to true anyway because you have kube-proxy-replacement: strict (you can confirm it with the agent logs).

So the only remaining difference is bpf-lb-sock-hostns-only: true. Could you try to change only that flag to confirm it is the culprit?

ycydk commented 1 year ago

it works if bpf-lb-sock-hostns-only: false

config:

agent-not-ready-taint-key: node.cilium.io/agent-not-ready arping-refresh-period: 30s auto-direct-node-routes: "true" bpf-lb-external-clusterip: "false" bpf-lb-map-max: "65536" bpf-lb-sock: "true" bpf-lb-sock-hostns-only: "false" bpf-map-dynamic-size-ratio: "0.0025" bpf-policy-map-max: "16384" bpf-root: /sys/fs/bpf cgroup-root: /run/cilium/cgroupv2 cilium-endpoint-gc-interval: 5m0s cluster-id: "0" cluster-name: default custom-cni-conf: "false" debug: "false" disable-cnp-status-updates: "true" disable-endpoint-crd: "false" enable-auto-protect-node-port-range: "true" enable-bandwidth-manager: "true" enable-bgp-control-plane: "false" enable-bpf-clock-probe: "true" enable-bpf-masquerade: "true" enable-bpf-tproxy: "true" enable-endpoint-health-checking: "true" enable-endpoint-routes: "false" enable-envoy-config: "true" enable-health-check-nodeport: "true" enable-health-checking: "true" enable-host-firewall: "false" enable-host-legacy-routing: "false" enable-host-port: "false" enable-hubble: "true" enable-ingress-controller: "true" enable-ingress-secrets-sync: "true" enable-ipv4: "true" enable-ipv4-masquerade: "true" enable-ipv6: "false" enable-ipv6-masquerade: "true" enable-k8s-terminating-endpoint: "true" enable-l2-neigh-discovery: "true" enable-l7-proxy: "true" enable-local-node-route: "true" enable-local-redirect-policy: "false" enable-node-port: "false" enable-policy: default enable-remote-node-identity: "true" enable-svc-source-range-check: "true" enable-vtep: "false" enable-well-known-identities: "false" enable-xt-socket-fallback: "true" enforce-ingress-https: "true" hubble-disable-tls: "false" hubble-listen-address: :4244 hubble-socket-path: /var/run/cilium/hubble.sock hubble-tls-cert-file: /var/lib/cilium/tls/hubble/server.crt hubble-tls-client-ca-files: /var/lib/cilium/tls/hubble/client-ca.crt hubble-tls-key-file: /var/lib/cilium/tls/hubble/server.key identity-allocation-mode: crd ingress-lb-annotation-prefixes: service.beta.kubernetes.io service.kubernetes.io cloud.google.com ingress-secrets-namespace: cilium-secrets install-iptables-rules: "true" install-no-conntrack-iptables-rules: "false" ipam: kubernetes ipv4-native-routing-cidr: 172.20.0.0/16 kube-proxy-replacement: strict kube-proxy-replacement-healthz-bind-address: "" monitor-aggregation: medium monitor-aggregation-flags: all monitor-aggregation-interval: 5s node-port-bind-protection: "true" nodes-gc-interval: 5m0s operator-api-serve-addr: 127.0.0.1:9234 preallocate-bpf-maps: "false" procfs: /host/proc remove-cilium-node-taints: "true" set-cilium-is-up-condition: "true" sidecar-istio-proxy-image: cilium/istio_proxy synchronize-k8s-nodes: "true" tofqdns-dns-reject-response-code: refused tofqdns-enable-dns-compression: "true" tofqdns-endpoint-max-ip-per-hostname: "50" tofqdns-idle-connection-grace-period: 0s tofqdns-max-deferred-connection-deletes: "10000" tofqdns-min-ttl: "3600" tofqdns-proxy-response-max-delay: 100ms tunnel: disabled unmanaged-pod-watcher-interval: "15" vtep-cidr: "" vtep-endpoint: "" vtep-mac: "" vtep-mask: ""

github-actions[bot] commented 1 year ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] commented 1 year ago

This issue has not seen any activity since it was marked stale. Closing.