cilium / cilium

eBPF-based Networking, Security, and Observability
https://cilium.io
Apache License 2.0
19.92k stars 2.92k forks source link

Ingress controller load balancer can not connect to nodes #32556

Closed carlosrejano closed 4 weeks ago

carlosrejano commented 4 months ago

Is there an existing issue for this?

What happened?

We have an EKS cluster where we are trying to use Cilium ingress controller and the load balancer created for the ingress can not always connect to the nodes.

What we see is that the load balancer can connect to some nodes during periods but is not a consistent behavior and there is no pattern between the nodes behind that it can connect and the ones that can not.

Checking directly in the nodes also connecting to the nodePort opened for the load balancer does not work so should not be a problem of security groups, anyway we tried opening traffic from every internal address and nothing, some nodes work and others not or even sometimes no nodes happen to be accessible by the load balancer.

I checked and all the nodes have this cilium LB configuration for the nodePort:

10.218.248.217:31799   0.0.0.0:0 (331) (0) [NodePort, l7-load-balancer]
0.0.0.0:31799          0.0.0.0:0 (333) (0) [NodePort, non-routable, l7-load-balancer]
10.0.243.9:31799       0.0.0.0:0 (330) (0) [NodePort, l7-load-balancer]
169.254.0.11:31799     0.0.0.0:0 (332) (0) [NodePort, l7-load-balancer]

Configuration values used:

cni:
  configMap: cni-config
  customConf: true
eni:
  enabled: true
  updateEC2AdapterLimitViaAPI: true
  awsEnablePrefixDelegation: true
  awsReleaseExcessIPs: true
egressMasqueradeInterfaces: eth0
policyEnforcementMode: "never"
ipam:
  mode: eni
hubble:
  relay:
    enabled: true
  ui:
    enabled: true
tunnelProtocol: ""
nodePort:
  enabled: true
nodeinit:
  enabled: true
ingressController:
  enabled: true

cni-config configmap values:

    {
      "cniVersion":"0.3.1",
      "name":"cilium",
      "plugins": [
        {
          "cniVersion":"0.3.1",
          "type":"cilium-cni",
          "eni": {
            "subnet-ids": ["subnet-xxxxxx", "subnet-xxxxxx", "subnet-xxxxxxx"],
            "first-interface-index": 1
          }
        }
      ]
    }

Cilium Version

We tried it in multiple versions:

Kernel Version

Linux 5.10.215-203.850.amzn2.aarch64

Kubernetes Version

v1.26.15

Regression

No response

Sysdump

Relevant log output

No response

Anything else?

No response

Cilium Users Document

Code of Conduct

squeed commented 4 months ago

Hi there, thanks for the bug report. It's not yet clear to me how exactly traffic is flowing. Could you outline the expected traffic flow, and indicate where you think it is failing?

In particular, I suggest the section on troubleshooting with hubble to identify where packets are being dropped. Can you go through the troubleshooting section and clarify the problem a bit?

Thanks.

sayboras commented 4 months ago

Also can you share your cilium configmap as well? Thanks.

carlosrejano commented 4 months ago

Hi there, thanks for the bug report. It's not yet clear to me how exactly traffic is flowing. Could you outline the expected traffic flow, and indicate where you think it is failing?

In particular, I suggest the section on troubleshooting with hubble to identify where packets are being dropped. Can you go through the troubleshooting section and clarify the problem a bit?

Thanks.

@squeed Hi, sorry for the delay, yes let me explain it better. Correct me if I mention something wrong. The idea is to use Cilium as an Ingress Controller, when I create an ingress object it creates the Classic AWS LB or NLB, tried both, which will balance the traffic to the Cilium ingress controller. If I'm not wrong the component of Cilium that handles the traffic coming from the LB is cilium-envoy which runs inside cilium-agent in my case. The traffic after arriving to cilium-envoy gets sent to the relevant backend of the Ingress. My problem is the communication between the Load Balancer and envoy, the Load Balancer can not target envoy most of the time.

Ask any other question that you need if I still did not explain it well enough.

Thanks for taking a look into this!

carlosrejano commented 4 months ago

Also can you share your cilium configmap as well? Thanks.

@sayboras Yes, here it is:

  agent-not-ready-taint-key: node.cilium.io/agent-not-ready
  arping-refresh-period: 30s
  auto-direct-node-routes: "false"
  bpf-lb-acceleration: disabled
  bpf-lb-external-clusterip: "false"
  bpf-lb-map-max: "65536"
  bpf-lb-sock: "false"
  bpf-map-dynamic-size-ratio: "0.0025"
  bpf-policy-map-max: "16384"
  bpf-root: /sys/fs/bpf
  cgroup-root: /run/cilium/cgroupv2
  cilium-endpoint-gc-interval: 5m0s
  cluster-id: "0"
  cluster-name: default
  cluster-pool-ipv4-cidr: 10.0.0.0/8
  cluster-pool-ipv4-mask-size: "24"
  cni-chaining-mode: aws-cni
  cni-exclusive: "false"
  cni-log-file: /var/run/cilium/cilium-cni.log
  custom-cni-conf: "false"
  debug: "false"
  debug-verbose: ""
  egress-gateway-reconciliation-trigger-interval: 1s
  enable-auto-protect-node-port-range: "true"
  enable-bgp-control-plane: "false"
  enable-bpf-clock-probe: "false"
  enable-endpoint-health-checking: "false"
  enable-endpoint-routes: "true"
  enable-envoy-config: "true"
  enable-external-ips: "false"
  enable-gateway-api: "true"
  enable-gateway-api-secrets-sync: "true"
  enable-health-check-loadbalancer-ip: "false"
  enable-health-check-nodeport: "true"
  enable-health-checking: "true"
  enable-host-legacy-routing: "true"
  enable-host-port: "false"
  enable-hubble: "true"
  enable-ingress-controller: "true"
  enable-ingress-proxy-protocol: "false"
  enable-ingress-secrets-sync: "true"
  enable-ipv4: "true"
  enable-ipv4-big-tcp: "false"
  enable-ipv4-masquerade: "false"
  enable-ipv6: "false"
  enable-ipv6-big-tcp: "false"
  enable-ipv6-masquerade: "true"
  enable-k8s-networkpolicy: "true"
  enable-k8s-terminating-endpoint: "true"
  enable-l2-neigh-discovery: "true"
  enable-l7-proxy: "true"
  enable-local-node-route: "false"
  enable-local-redirect-policy: "false"
  enable-masquerade-to-route-source: "false"
  enable-metrics: "true"
  enable-node-port: "true"
  enable-policy: never
  enable-remote-node-identity: "true"
  enable-sctp: "false"
  enable-svc-source-range-check: "true"
  enable-vtep: "false"
  enable-well-known-identities: "false"
  enable-xt-socket-fallback: "true"
  enforce-ingress-https: "true"
  external-envoy-proxy: "false"
  gateway-api-secrets-namespace: cilium-secrets
  hubble-disable-tls: "false"
  hubble-export-file-max-backups: "5"
  hubble-export-file-max-size-mb: "10"
  hubble-listen-address: :4244
  hubble-socket-path: /var/run/cilium/hubble.sock
  hubble-tls-cert-file: /var/lib/cilium/tls/hubble/server.crt
  hubble-tls-client-ca-files: /var/lib/cilium/tls/hubble/client-ca.crt
  hubble-tls-key-file: /var/lib/cilium/tls/hubble/server.key
  identity-allocation-mode: crd
  identity-gc-interval: 15m0s
  identity-heartbeat-timeout: 30m0s
  ingress-default-lb-mode: dedicated
  ingress-lb-annotation-prefixes: service.beta.kubernetes.io service.kubernetes.io
    cloud.google.com
  ingress-secrets-namespace: cilium-secrets
  ingress-shared-lb-service-name: cilium-ingress
  install-no-conntrack-iptables-rules: "false"
  ipam: cluster-pool
  ipam-cilium-node-update-rate: 15s
  k8s-client-burst: "10"
  k8s-client-qps: "5"
  kube-proxy-replacement: "false"
  kube-proxy-replacement-healthz-bind-address: ""
  max-connected-clusters: "255"
  mesh-auth-enabled: "true"
  mesh-auth-gc-interval: 5m0s
  mesh-auth-queue-size: "1024"
  mesh-auth-rotated-identities-queue-size: "1024"
  monitor-aggregation: medium
  monitor-aggregation-flags: all
  monitor-aggregation-interval: 5s
  node-port-bind-protection: "true"
  nodes-gc-interval: 5m0s
  operator-api-serve-addr: 127.0.0.1:9234
  operator-prometheus-serve-addr: :9963
  policy-cidr-match-mode: ""
  preallocate-bpf-maps: "false"
  procfs: /host/proc
  proxy-connect-timeout: "2"
  proxy-idle-timeout-seconds: "60"
  proxy-max-connection-duration-seconds: "0"
  proxy-max-requests-per-connection: "0"
  proxy-prometheus-port: "9964"
  proxy-xff-num-trusted-hops-egress: "0"
  proxy-xff-num-trusted-hops-ingress: "0"
  remove-cilium-node-taints: "true"
  routing-mode: native
  service-no-backend-response: reject
  set-cilium-is-up-condition: "true"
  set-cilium-node-taints: "true"
  sidecar-istio-proxy-image: cilium/istio_proxy
  skip-cnp-status-startup-clean: "false"
  synchronize-k8s-nodes: "true"
  tofqdns-dns-reject-response-code: refused
  tofqdns-enable-dns-compression: "true"
  tofqdns-endpoint-max-ip-per-hostname: "50"
  tofqdns-idle-connection-grace-period: 0s
  tofqdns-max-deferred-connection-deletes: "10000"
  tofqdns-proxy-response-max-delay: 100ms
  unmanaged-pod-watcher-interval: "15"
  vtep-cidr: ""
  vtep-endpoint: ""
  vtep-mac: ""
  vtep-mask: ""
  write-cni-conf-when-ready: /host/etc/cni/net.d/05-cilium.conflist

Thank you for taking a look into this!

github-actions[bot] commented 1 month ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

github-actions[bot] commented 4 weeks ago

This issue has not seen any activity since it was marked stale. Closing.