cbeneke / hcloud-fip-controller

Kubernetes controller to (re-)assign floating IPs on hetzner cloud instances
Apache License 2.0
123 stars 14 forks source link

Can't get floating IP and fipcontroller going #30

Closed iohenkies closed 4 years ago

iohenkies commented 4 years ago

Hi,

I'm following along with this: https://community.hetzner.com/tutorials/install-kubernetes-cluster

Basically I have a fully functional cluster (all nodes an deployments healthy, all pods up and running) but can't get the floating IP and fipcontroller to work. After 2 full days I think its time to file an issue :)

When I try to install the fipcontroller via the above link or the below slightly different instructions (daemonset or deployment doesn't matter, same issue): https://github.com/cbeneke/hcloud-fip-controller/blob/master/README.md

The fipcontroller pods keep restarting with this in the logs:

henkies@kube01] ~ $ kubectl -n fip-controller logs fip-controller-tlvxm
I0301 18:08:34.024801       1 leaderelection.go:235] attempting to acquire leader lease  fip-controller/fip...
I0301 18:08:34.049756       1 leaderelection.go:245] successfully acquired lease fip-controller/fip
time="2020-03-01T18:08:34Z" level=info msg="Started leading" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).onStartedLeading" file="/app/internal/app/fipcontroller/leaderelection.go:53"
time="2020-03-01T18:08:34Z" level=fatal msg="Could not run controller: could not get kubernetes node address: could not find address for node kube02.mydomain.com" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).onStartedLeading" file="/app/internal/app/fipcontroller/leaderelection.go:56"

The culprit seamlingly being:

 could not find address for node kube02.mydomain.com

But why I do not understand. DNS is OK on the host and in the cluster, coreDNS is running, I'm out of options.

On a side note, once again following the guide at: https://community.hetzner.com/tutorials/install-kubernetes-cluster

Should the floating IP not be active at this point? vipcontroller is only for moving the IP, no? I'm wondering since I've created a test service of type LoadBalancer and does not get the Hetzner floating IP.

iohenkies commented 4 years ago

I've found this issue that seems to be the same: https://github.com/cbeneke/hcloud-fip-controller/issues/25

I also do not have an EXTERNAL-IP listed doing a kubectl get nodes -o wide. At INTERNAL-IP my public IP addresses are listed :(. The cloud controller manager is deployed and active.

From this thread (#25), setting the node_address_type option to internal (or external) doesn't solve the problem.

What should I do?

iohenkies commented 4 years ago

I've started completely over, reconfiguring the cluster so that at INTERNAL-IP indeed are my internal IPs (extra args for the kubelet) and while setting up with kubeadm the --apiserver-advertise-address is also specifying the internal IP. Long story short: exact same error and problem.

Then I've started completely over while declaring the external IP at the kubelet extra args and --apiserver-advertise-address: same problem.

Of course I would like it all to work on the private network instead of the public IPs, but at this point I would like to see it working either way.

iohenkies commented 4 years ago

This takes an enormous amount of work. Although I'm several steps further, it still doesn't work.

A few important points (others can hopefully benefit from my painful process):

This is the only thing in the debug logs:

[henkies@kube01] ~ $ k -n fip-controller logs fip-controller-74ldv
I0304 18:04:39.026931       1 leaderelection.go:235] attempting to acquire leader lease  fip-controller/fip...
I0304 18:05:14.974486       1 leaderelection.go:245] successfully acquired lease fip-controller/fip
time="2020-03-04T18:05:14Z" level=info msg="Started leading" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).onStartedLeading" file="/app/internal/app/fipcontroller/leaderelection.go:53"
time="2020-03-04T18:05:14Z" level=debug msg="Checking floating IPs" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).UpdateFloatingIPs" file="/app/internal/app/fipcontroller/controller.go:77"
time="2020-03-04T18:05:14Z" level=debug msg="Found 3 nodes" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).nodeAddress" file="/app/internal/app/fipcontroller/kubernetes.go:32"
time="2020-03-04T18:05:14Z" level=debug msg="Found 3 addresses" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).nodeAddress" file="/app/internal/app/fipcontroller/kubernetes.go:41"
time="2020-03-04T18:05:14Z" level=debug msg="Using address type ExternalIP" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).nodeAddress" file="/app/internal/app/fipcontroller/kubernetes.go:47"
time="2020-03-04T18:05:14Z" level=debug msg="Found node address: 116.203.101.104" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).UpdateFloatingIPs" file="/app/internal/app/fipcontroller/controller.go:83"

And the other pod that got moved after bringing the nodes down:

Error from server: Get https://172.16.0.3:10250/containerLogs/fip-controller/fip-controller-xs9ts/fip-controller: dial tcp 172.16.0.3:10250: i/o timeout
cbeneke commented 4 years ago

Hey, sorry for the long reaction time and first of all: thanks for the detailed description!

A few notes beforehand: The article I have written when kubernetes 1.15.3 was the latest version. It is to be expected, that API references etc change. My private cluster ist still running von 1.16 (I just didn't find the time to update it yet), so there might also be an incompatibiliy with 1.17 the way it is programmed (I am also not certain if the hetzner cloud controller supports it yet, the changelog only mentiones v1.16). I will have to check this when I find the time (which unfortunately is a VERY limited resource for me atm :/ ) Flannel should work fine on a 1.17 cluster, please check the kubernetes installation guide and validate you are using the correct version.

Regarding the metallb config: Are you installing it via helm or manually? The guide focused on helm (which is using a custom name), the default would be config instead of metallb-config (compare https://github.com/helm/charts/blob/master/stable/metallb/templates/config.yaml).

Do the logs just stop after Found node address? The address is used to map the hcloud object to the kubernets object. The direct next call is a hetznerclient call against the hetzner API. Is your token working? (Can you use that token to fetch your servers?) The error message you posted in the end comes afaict from the kubernetes API server. It tries to connect to the kubelet (port 10250) and times out. Can your master pod reach all the nodes in the cluster correctly?

iohenkies commented 4 years ago

Hi Christian, thank you for the response.

As said I will reinstall once again with different Kubernetes versions.

iohenkies commented 4 years ago

The problems are exactly the same for me on Kubernetes 1.16.7. I don't understand since you have it running on 1.16 ifI understand it correctly.

And on a side note, Flannel still doesn't work on this fresh install wit the corrects ports open. It starts and all, but pods cannot communicate with each other. I do not have these problems with Weave or Calico.

cbeneke commented 4 years ago

Yes, my cluster is running on v1.16.7 atm and has no problems (using flannel, hcloud-cloud-controller, metallb and multiple fip-controller). Could you - just for a test - open your firewall completely? The flannel errors sound a lot like something in your network is set up in a way that it can not communicate correctly with the other pods, and a functional network in your cluster is definetely not optional :) Also are you using the correct subnet ranges (compare to the installation guide I linked above)?

iohenkies commented 4 years ago

Hi Christian. After disabling the firewall completely, flannel does work. Very strange because of all the open ports (all kubernetes defaults + the flannel udp ports), but nevertheless this is of a later concern for me to find out because I still can't get the fip-controller to work...

So I have the version at 1.16.7, flannel up, all pods up, same logs as before. I'm trying with deployment atm instead of dameonset but result the same. Two pods say:

I0309 17:43:53.046790       1 leaderelection.go:235] attempting to acquire leader lease  fip-controller/fip...

And one pod says:

I0309 17:43:52.909930       1 leaderelection.go:235] attempting to acquire leader lease  fip-controller/fip...
I0309 17:43:52.931169       1 leaderelection.go:245] successfully acquired lease fip-controller/fip
time="2020-03-09T17:43:52Z" level=info msg="Started leading" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).onStartedLeading" file="/app/internal/app/fipcontroller/leaderelection.go:53"
time="2020-03-09T17:43:52Z" level=debug msg="Checking floating IPs" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).UpdateFloatingIPs" file="/app/internal/app/fipcontroller/controller.go:77"
time="2020-03-09T17:43:52Z" level=debug msg="Found 3 nodes" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).nodeAddress" file="/app/internal/app/fipcontroller/kubernetes.go:32"
time="2020-03-09T17:43:52Z" level=debug msg="Found 3 addresses" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).nodeAddress" file="/app/internal/app/fipcontroller/kubernetes.go:41"
time="2020-03-09T17:43:52Z" level=debug msg="Using address type ExternalIP" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).nodeAddress" file="/app/internal/app/fipcontroller/kubernetes.go:47"
time="2020-03-09T17:43:52Z" level=debug msg="Found node address: 116.203.101.72" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).UpdateFloatingIPs" file="/app/internal/app/fipcontroller/controller.go:83"
time="2020-03-09T17:43:53Z" level=debug msg="Fetched %!s(int=3) servers" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).server" file="/app/internal/app/fipcontroller/hcloud.go:39"
time="2020-03-09T17:43:53Z" level=debug msg="Found matching public IP on server 'kube02.domain.io'" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).server" file="/app/internal/app/fipcontroller/hcloud.go:44"
time="2020-03-09T17:43:53Z" level=debug msg="Found server: kube02.domain.io (4786060)" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).UpdateFloatingIPs" file="/app/internal/app/fipcontroller/controller.go:89"
time="2020-03-09T17:43:53Z" level=info msg="Initialization complete. Starting reconciliation" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).Run" file="/app/internal/app/fipcontroller/controller.go:61"

Repeating itself every 30 seconds.

The floating IP is configured at the 2 worker nodes, but looking in the Hetzner console it is not assigned to any node. It can't be pinged. Only after manually assigning, it can be pinged, but does not failover when I bring down the node (for testing purposes).

Also, when I assign the floating IP manually to node03 instead of node02, the logs say the same (i.e. Found server: kube02.domain.io (4786060) although I manually assigned it to node03).

I really hope you can make something of it.

iohenkies commented 4 years ago

I'm not 100% positive, but I believe that earlier the Nginx ingress did also not receive the EXTERNAL-IP and it does now. But failover still does not happen. A kubectl get nodes shows the node as NotReady, pods get evicted, but floating IP remains unreachable. Although the floating IP in this test case is attached to kube02, fip controller only talks about kube03.

time="2020-03-10T07:36:03Z" level=debug msg="Checking floating IPs" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).UpdateFloatingIPs" file="/app/internal/app/fipcontroller/controller.go:77"
time="2020-03-10T07:36:03Z" level=debug msg="Found 3 nodes" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).nodeAddress" file="/app/internal/app/fipcontroller/kubernetes.go:32"
time="2020-03-10T07:36:03Z" level=debug msg="Found 3 addresses" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).nodeAddress" file="/app/internal/app/fipcontroller/kubernetes.go:41"
time="2020-03-10T07:36:03Z" level=debug msg="Using address type ExternalIP" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).nodeAddress" file="/app/internal/app/fipcontroller/kubernetes.go:47"
time="2020-03-10T07:36:03Z" level=debug msg="Found node address: 116.203.101.104" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).UpdateFloatingIPs" file="/app/internal/app/fipcontroller/controller.go:83"
time="2020-03-10T07:36:03Z" level=debug msg="Fetched %!s(int=3) servers" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).server" file="/app/internal/app/fipcontroller/hcloud.go:39"
time="2020-03-10T07:36:03Z" level=debug msg="Found matching public IP on server 'kube03.domain.io'" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).server" file="/app/internal/app/fipcontroller/hcloud.go:44"
time="2020-03-10T07:36:03Z" level=debug msg="Found server: kube03.domain.io (4786062)" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).UpdateFloatingIPs" file="/app/internal/app/fipcontroller/controller.go:89"
time="2020-03-10T07:36:33Z" level=debug msg="Checking floating IPs" func="github.com/cbeneke/hcloud-fip-controller/internal/app/fipcontroller.(*Controller).UpdateFloatingIPs" file="/app/internal/app/fipcontroller/controller.go:77"
cbeneke commented 4 years ago

To the controller it does not matter which node the IP is currently attached to. It will - when being the leader - detach it from where it is and attach it to the node it is currently running on. I guess the pod which won the leader election is running on kube3? :)

Which version of the controller are you running? Could you also please paste your (redacted) deployed config? It somehow seems the HcloudFloatingIPs field is not initialized correctly.

The ingress not showing up an external IP gives a hint, that your metalLB might've been misconfigured / not working properly. Since the ingress pull the external IP from the loadbalancer service object, which gets updated by metalLB.

iohenkies commented 4 years ago

Hi Christian,

Deployment:

apiVersion: v1
items:
- apiVersion: apps/v1
  kind: Deployment
  metadata:
    annotations:
      deployment.kubernetes.io/revision: "1"
      kubectl.kubernetes.io/last-applied-configuration: |
        {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"annotations":{},"name":"fip-controller","namespace":"fip-controller"},"spec":{"replicas":3,"selector":{"matchLabels":{"app":"fip-controller"}},"strategy":{"rollingUpdate":{"maxSurge":1,"maxUnavailable":1},"type":"RollingUpdate"},"template":{"metadata":{"labels":{"app":"fip-controller"}},"spec":{"containers":[{"env":[{"name":"NODE_NAME","valueFrom":{"fieldRef":{"fieldPath":"spec.nodeName"}}},{"name":"POD_NAME","valueFrom":{"fieldRef":{"fieldPath":"metadata.name"}}},{"name":"NAMESPACE","valueFrom":{"fieldRef":{"fieldPath":"metadata.namespace"}}}],"envFrom":[{"secretRef":{"name":"fip-controller-secrets"}}],"image":"cbeneke/hcloud-fip-controller:v0.3.1","imagePullPolicy":"IfNotPresent","name":"fip-controller","volumeMounts":[{"mountPath":"/app/config","name":"config"}]}],"serviceAccountName":"fip-controller","volumes":[{"configMap":{"name":"fip-controller-config"},"name":"config"}]}}}}
    creationTimestamp: "2020-03-09T17:43:51Z"
    generation: 1
    name: fip-controller
    namespace: fip-controller
    resourceVersion: "84009"
    selfLink: /apis/apps/v1/namespaces/fip-controller/deployments/fip-controller
    uid: d0532a37-3fcb-4cd3-8b7e-7f4e682e0e59
  spec:
    progressDeadlineSeconds: 600
    replicas: 3
    revisionHistoryLimit: 10
    selector:
      matchLabels:
        app: fip-controller
    strategy:
      rollingUpdate:
        maxSurge: 1
        maxUnavailable: 1
      type: RollingUpdate
    template:
      metadata:
        creationTimestamp: null
        labels:
          app: fip-controller
      spec:
        containers:
        - env:
          - name: NODE_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: spec.nodeName
          - name: POD_NAME
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.name
          - name: NAMESPACE
            valueFrom:
              fieldRef:
                apiVersion: v1
                fieldPath: metadata.namespace
          envFrom:
          - secretRef:
              name: fip-controller-secrets
          image: cbeneke/hcloud-fip-controller:v0.3.1
          imagePullPolicy: IfNotPresent
          name: fip-controller
          resources: {}
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
          volumeMounts:
          - mountPath: /app/config
            name: config
        dnsPolicy: ClusterFirst
        restartPolicy: Always
        schedulerName: default-scheduler
        securityContext: {}
        serviceAccount: fip-controller
        serviceAccountName: fip-controller
        terminationGracePeriodSeconds: 30
        volumes:
        - configMap:
            defaultMode: 420
            name: fip-controller-config
          name: config
  status:
    availableReplicas: 3
    conditions:
    - lastTransitionTime: "2020-03-09T17:43:54Z"
      lastUpdateTime: "2020-03-09T17:43:54Z"
      message: Deployment has minimum availability.
      reason: MinimumReplicasAvailable
      status: "True"
      type: Available
    - lastTransitionTime: "2020-03-09T17:43:51Z"
      lastUpdateTime: "2020-03-09T17:43:54Z"
      message: ReplicaSet "fip-controller-5c95ff6b4f" has successfully progressed.
      reason: NewReplicaSetAvailable
      status: "True"
      type: Progressing
    observedGeneration: 1
    readyReplicas: 3
    replicas: 3
    updatedReplicas: 3
kind: List
metadata:
  resourceVersion: ""
  selfLink: ""

ConfigMap fip-controller-config:

apiVersion: v1
data:
  config.json: |
    {
      "hcloudFloatingIPs": [ "MYIP" ],
      "nodeAddressType": "external",
      "log_level": "Debug"
    }
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"config.json":"{\n  \"hcloudFloatingIPs\": [ \"MYIP\" ],\n  \"nodeAddressType\": \"external\",\n  \"log_level\": \"Debug\"\n}\n"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"fip-controller-config","namespace":"fip-controller"}}
  creationTimestamp: "2020-03-09T17:43:46Z"
  name: fip-controller-config
  namespace: fip-controller
  resourceVersion: "4093"
  selfLink: /api/v1/namespaces/fip-controller/configmaps/fip-controller-config
  uid: 4339d0c6-472b-46be-98d0-8d2ee6582033

Secret fip-controller-secrets. Here I just discovered, but I have redacted both, that the first HCLOUD_API_TOKEN is different from the second HCLOUD_API_TOKEN. Is this normal?

apiVersion: v1
data:
  HCLOUD_API_TOKEN: FIRST TOKEN
kind: Secret
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","kind":"Secret","metadata":{"annotations":{},"name":"fip-controller-secrets","namespace":"fip-controller"},"stringData":{"HCLOUD_API_TOKEN":"SECOND TOKEN"}}
  creationTimestamp: "2020-03-09T17:43:46Z"
  name: fip-controller-secrets
  namespace: fip-controller
  resourceVersion: "4094"
  selfLink: /api/v1/namespaces/fip-controller/secrets/fip-controller-secrets
  uid: d21e636f-2cb4-481c-a9a6-7fdaa00443bd
type: Opaque

Secret fip-controller-token-jbdzm:

apiVersion: v1
data:
  ca.crt: REDACTED
  token: REDACTED
kind: Secret
metadata:
  annotations:
    kubernetes.io/service-account.name: fip-controller
    kubernetes.io/service-account.uid: c8d18922-94bb-416c-8094-d240a6e4ac1f
  creationTimestamp: "2020-03-09T17:43:37Z"
  name: fip-controller-token-jbdzm
  namespace: fip-controller
  resourceVersion: "4079"
  selfLink: /api/v1/namespaces/fip-controller/secrets/fip-controller-token-jbdzm
  uid: beee3e1f-90d4-4342-947e-30381bcd1655
type: kubernetes.io/service-account-token

Let me know if you need anything else. Many thanks for your help.

cbeneke commented 4 years ago

Ah okay, the config format is still from v0.1 (compare changelog v0.2.0), please adapt the config accordingly:

apiVersion: v1
kind: ConfigMap
metadata:
  name: fip-controller-config
  namespace: fip-controller
data:
  config.json: |
    {
      "hcloud_floating_ips": [ "MYIP" ],
      "node_address_type": "external",
      "log_level": "Debug"
    }

Also the node_address_type external is default. You don't need to add this :) Can you try to run the controller with the correct version of the config?

iohenkies commented 4 years ago

OMG works instantly. Thank you for your help. It's wrong in the guide, OK in the GitHub readme. I'll do some more testing with Kubernetes 1.17 and different CNIs. I can report back I you would like.

cbeneke commented 4 years ago

Thats good to hear! Yeah, the guide was written on the v0.1.0 version of the controller, thats why I added the hint about being still under development. Will try to get an update into the guide over the weekend.

If you spend the time on it anyway I would like to hear the results, but don't spend extra time if you wouldn't do so anyway :)

iohenkies commented 4 years ago

Ok well I'm done testing now and can continue with the real purpose of the cluster.

FWIW I can confirm that this Hetzner configuration is compatible with:

I've got one more question before signing off: what is the added benefit of the Hetzner Cloud Container Storage Interface? How can I use this?

cbeneke commented 4 years ago

Thanks for taking the time and writing down your results!

Regarding your question: The Hetzner Cloud CSI (driver) is an implementation for the kubernetes CSI which enables you to use hetzner volumes as native volumes in kubernetes (the controller takes care of commission, attaching to the correct node on pod start, decomission etc). Have a look at the csi driver docs for info how to install it!

I'm closing the issue now :)

rmja commented 4 years ago

Hi all,

just wanted to put some details about the new Hetzner Ubuntu 20.04 image. It uses netplan instead of ifupdown as the tool to configure networking.

The official docs from Hetzner states that a /etc/netplan/60-floating-ip.yaml should be created with the floating IP. That is also the case, but the list of addresses must also include the server IP address assigned by dhcp as the first entry. for docker (and hence also kubernetes) to work. An example is:

cat << EOF > /etc/netplan/60-floating-ip.yaml
network:
  version: 2
  renderer: networkd
  ethernets:
    eth0:
      addresses:
        - <server_public_ip_as_assigned_by_dhcp>/32
        - <floating_ip>/32
EOF

Run netplan apply to apply the configuration.

This will make sure that the server IP address remains the primary, i.e. ifconfig eth0 shows the server IP address and not the floating IP.

The floating IP becomes the primary IP if the dhcp assigned server IP address is not included in the list of addresses. This in turn sets the floating IP as source for outgoing traffic from within docker, resulting in no internet access within containers. This can be seen by running tcpdump -ni eth0 icmp together with docker run --rm alpine ping -- 1.1.1.1.