angelnu / pod-gateway

Container image used to set a pod gateway
Apache License 2.0
56 stars 29 forks source link

Routed Pod `gateway-init` container fails to resolve gw host #29

Open NoeSamaille opened 1 year ago

NoeSamaille commented 1 year ago

Details

What steps did you take and what happened:

Hi there! I have tried to deploy the pod gateway and admission controller using your Helm chart, using the following values:

image:
  repository: ghcr.io/angelnu/pod-gateway
  # I am using dev version for testing - others should be using latest
  tag: v1.8.1
DNSPolicy: ClusterFirst
webhook:
  image:
    repository: ghcr.io/angelnu/gateway-admision-controller
    # Use dev version
    pullPolicy: Always
    tag: v3.9.0
  namespaceSelector:
    type: label
    label: routed-gateway
  gatewayDefault: true
  gatewayLabel: setGateway
addons:
  vpn:
    enabled: true
    type: gluetun
    gluetun:
      image:
        repository: docker.io/qmcgaw/gluetun
        tag: latest
    env:
    - name:  VPN_SERVICE_PROVIDER
      value: custom
    - name:  VPN_TYPE
      value: wireguard
    - name:  VPN_INTERFACE
      value: wg0
    - name:  FIREWALL
      value: "off"
    - name:  DOT
      value: "off"
    - name: DNS_KEEP_NAMESERVER
      value: "on"

    envFrom:
      - secretRef:
          name: wireguard-config

    livenessProbe:
      exec:
        command:
          - sh
          - -c
          - if [ $(wget -q -O- https://ipinfo.io/country) == 'NL' ]; then exit 0; else exit $?; fi
      initialDelaySeconds: 30
      periodSeconds: 60
      failureThreshold: 3

    networkPolicy:
      enabled: true

      egress:
        - to:
          - ipBlock:
              cidr: 0.0.0.0/0
          ports:
          # VPN traffic
          - port: 51820
            protocol: UDP
        - to:
          - ipBlock:
              cidr: 10.0.0.0/8

settings:
  # -- If using a VPN, interface name created by it
  VPN_INTERFACE: wg0
  # -- Prevent non VPN traffic to leave the gateway
  VPN_BLOCK_OTHER_TRAFFIC: true
  # -- If VPN_BLOCK_OTHER_TRAFFIC is true, allow VPN traffic over this port
  VPN_TRAFFIC_PORT: 51820
  # -- Traffic to these IPs will be send through the K8S gateway
  VPN_LOCAL_CIDRS: "10.0.0.0/8 192.168.0.0/16"

# -- settings to expose ports, usually through a VPN provider.
# NOTE: if you change it you will need to manually restart the gateway POD
publicPorts:
- hostname: transmission-client.media-center
  IP: 10
  ports:
  - type: udp
    port: 51413
  - type: tcp
    port: 51413

So far so good, I've got the pod gateway and admission controller up and running in my vpn-gateway ns with wireguard VPN client working on the pod gateway, now trying to actually route a pod in my media-center routed ns:

These are the logs of the gateway-init container of my transmission-client pod:

❯ kubectl logs transmission-client-7bbc685b44-xkk6h -n media-center -c gateway-init

...

++ dig +short vpn-gateway-pod-gateway.vpn-gateway.svc.cluster.local @10.43.0.10
+ GATEWAY_IP=';; connection timed out; no servers could be reached'

It looks like it's not able to resolve vpn-gateway-pod-gateway.vpn-gateway.svc.cluster.local in the init container, but the cluster local DNS works fine I tried running the same pod in a non routed namespace, exec into it and nslookup and it worked fine:

root@transmission-client-7bbc685b44-m8l4r:/# nslookup vpn-gateway-pod-gateway.vpn-gateway.svc.cluster.local 10.43.0.10
Server:         10.43.0.10
Address:        10.43.0.10:53

Name:   vpn-gateway-pod-gateway.vpn-gateway.svc.cluster.local
Address: 10.42.2.61

Any idea what can cause this behavior?

What did you expect to happen:

I was expecting the routed pod gateway to be updated successfully with the pod starting up.

Anything else you would like to add:

Any help appreciated, there is probably something I'm missing here, happy to provide more information to debug this, thanks :)

NoeSamaille commented 1 year ago

Found the issue, so the service IP range of my cluster is 10.43.0.0/16 and my Pod IP range is 10.42.0.0/16, however only the later is specifically mentioned in my Pod routing table:

$ ip route
default via 10.42.1.1 dev eth0 
10.42.0.0/16 via 10.42.1.1 dev eth0 
10.42.1.0/24 dev eth0 proto kernel scope link src 10.42.1.24 

It means that when the client_init.sh script deletes the existing default gateway the pod isn't able to access the DNS server any longer.

To fix that as a workaround I have manually configured my routed deployment as follow with a gateway-preinit initContainer that adds 10.43.0.0/16:

...

      initContainers:
      - command: ["/bin/sh","-c"]
        args: ["ip route add 10.43.0.0/16 via 10.42.1.1 dev eth0"]
        image: ghcr.io/angelnu/pod-gateway:v1.8.1
        imagePullPolicy: IfNotPresent
        name: gateway-preinit
        resources: {}
        securityContext:
          capabilities:
            add:
            - NET_ADMIN
            - NET_RAW
          runAsNonRoot: false
          runAsUser: 0

That way after default gateway deletion it's still able to reach the services IP range and therefore the K8s internal DNS server.

@angelnu I'm sure there is a way to have a clean fix by slightly updating the client_init.sh script e.g. by adding that routeip route add ${K8S_DNS_IP}/16 via ${K8S_DEFAULT_GW} dev eth0 before removing default GW, happy to discuss/contribute.

I'm not sure if it's due to my K8s topology which is pretty standard: K3s running K8s v1.26 with default Flannel CNI.

TheAceMan commented 3 months ago

Don't have you exact setup but have you looked at using NOT_ROUTED_TO_GATEWAY_CIDRS, these values are put in the ip route as you outline so may help.