costela / hcloud-ip-floater

k8s controller for Hetzner Cloud floating IPs
GNU General Public License v3.0
53 stars 10 forks source link

example on a service/ip #10

Closed ghorofamike closed 4 years ago

ghorofamike commented 4 years ago

care to provide a concrete example of how the service and floating ip should be labelled?

costela commented 4 years ago

hey @ghorofamike

You don't need any labels: any Service with type: LoadBalancer that gets an IP that hcloud-ip-floater can find in your hcloud, should work.

Note that - as described in the README - hcloud-ip-floater doesn't deal with the IP-assignment (only attachment). For that you need something like metallb.

ghorofamike commented 4 years ago

I was able to get it to run, and yes im aware of metallb's necessity, and have used metallb with BGP before at a different provider. Now i met a different issue, if you have a moment, sometimes the floating ip does not get bound to a node until i kill this service's container. any idea where i should look?

costela commented 4 years ago

:thinking: this could be a bug. We may be missing some k8s events about the service or its pods. Care to share some logs of the behavior? It would be nice if you could include either the target service's or hcloud-ip-floater|s startup in the output. This way we can narrow the problem down a bit.

ghorofamike commented 4 years ago

i was looking around, and i saw a config parameter sync-interval here, its by default 6 min, maybe its the reason? im not sure i waited the entire 6 min, i may have waited for less, confirm for me that it should have noticed at the event, rather than every 6 min then we can start to look around the logs.

costela commented 4 years ago

The 5m sync interval should only be a fallback in the case of k8s: it exists to cover the case of missing events, but should only be needed in the case of network hiccups, etc (under normal conditions, the informer should receive the events as they happen). As for the hcloud side, there's no "event" interface, so polling is the only option. But as I understand your problem, it's not related to new IPs being added to hcloud, right?

ghorofamike commented 4 years ago

No, its not new addresses, but rather no assignment of an ip to a node being done after metallb assigns one to a service, here is a sample log from metallb:

{"caller":"main.go:75","event":"noChange","msg":"service converged, no change","service":"default/internalsmtp","ts":"2020-06-15T07:56:34.081442387Z"}
{"caller":"main.go:76","event":"endUpdate","msg":"end of service update","service":"default/internalsmtp","ts":"2020-06-15T07:56:34.081489776Z"}
{"caller":"main.go:49","event":"startUpdate","msg":"start of service update","service":"default/wildduck","ts":"2020-06-15T07:56:34.107356371Z"}
{"caller":"service.go:114","event":"ipAllocated","ip":"116.202.182.160","msg":"IP address assigned by controller","service":"default/wildduck","ts":"2020-06-15T07:56:34.107486173Z"}
{"caller":"main.go:96","event":"serviceUpdated","msg":"updated service object","service":"default/wildduck","ts":"2020-06-15T07:56:34.123257367Z"}
{"caller":"main.go:98","event":"endUpdate","msg":"end of service update","service":"default/wildduck","ts":"2020-06-15T07:56:34.123309676Z"}
{"caller":"main.go:49","event":"startUpdate","msg":"start of service update","service":"default/mongo","ts":"2020-06-15T07:56:34.126760015Z"}
{"caller":"service.go:33","event":"clearAssignment","msg":"not a LoadBalancer","reason":"notLoadBalancer","service":"default/mongo","ts":"2020-06-15T07:56:34.126818653Z"}
{"caller":"main.go:75","event":"noChange","msg":"service converged, no change","service":"default/mongo","ts":"2020-06-15T07:56:34.127040167Z"}
{"caller":"main.go:76","event":"endUpdate","msg":"end of service update","service":"default/mongo","ts":"2020-06-15T07:56:34.127070494Z"}
{"caller":"main.go:49","event":"startUpdate","msg":"start of service update","service":"default/wildduck","ts":"2020-06-15T07:56:34.127094279Z"}
{"caller":"main.go:75","event":"noChange","msg":"service converged, no change","service":"default/wildduck","ts":"2020-06-15T07:56:34.127182112Z"}
{"caller":"main.go:76","event":"endUpdate","msg":"end of service update","service":"default/wildduck","ts":"2020-06-15T07:56:34.12720232Z"}

metallb assigned an ip, but no reaction from this controller, and zero logs at HCLOUD_IP_FLOATER_LOG_LEVEL: debug level, so im unable to provide any output from it.

here is the kubectl describe of the service:

...
NodePort:                 pop  32645/TCP
Endpoints:                172.20.0.17:587
Session Affinity:         None
External Traffic Policy:  Local
HealthCheck NodePort:     30534
Events:
  Type    Reason        Age    From                Message
  ----    ------        ----   ----                -------
  Normal  IPAllocated   6m42s  metallb-controller  Assigned IP "XXX.XXX.XXX.XXX"
  Normal  nodeAssigned  4m22s  metallb-speaker     announcing from node "XXXXX-86-fra1d64554xmzy"

and curiously even after 6 min, im not seeing any assignment of the floating ip to a node. however this is not all, i have another ip already assigned earlier successfully, so this second ip is having issues, and it happens maybe 15-30 minutes after the other first ip worked flawlessly.

ghorofamike commented 4 years ago

so the situation goes as follows, i have 2 IP's sitting in hcloud, i assign the first on installing this controller and metallb, and in 30-50 min i assign the other ip and it remains stuck.

eplightning commented 4 years ago

I think I managed to reproduce that. It seems to get stuck inside Reconcile loop, most likely inside https://github.com/costela/hcloud-ip-floater/blob/master/internal/fipcontroller/fipcontroller.go#L211

I have no idea why it gets stuck there tho, maybe some network issues on Hetzner's side? I'll try to debug it some time later. Either way it probably would be a good idea to add some timeouts there: replace context.Background().

time="2020-06-15T14:14:57Z" level=info msg="starting hcloud IP floater" version=v0.1.5-rc.2-11-g18dbc13
time="2020-06-15T14:14:57Z" level=info msg="new service" namespace=projectcontour service=envoy
time="2020-06-15T14:14:57Z" level=info msg="adding pod informer" namespace=projectcontour service=envoy
time="2020-06-15T14:14:57Z" level=info msg="skipping non-LoadBalancer service" namespace=cert-manager service=cert-manager
time="2020-06-15T14:14:57Z" level=info msg="new service" namespace=default service=kuard-service2
time="2020-06-15T14:14:57Z" level=info msg="adding pod informer" namespace=default service=kuard-service2
time="2020-06-15T14:14:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kube-system service=metrics-server
time="2020-06-15T14:14:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kube-system service=hcloud-csi-controller-metrics
time="2020-06-15T14:14:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kube-system service=hcloud-csi-node-metrics
time="2020-06-15T14:14:57Z" level=info msg="skipping non-LoadBalancer service" namespace=projectcontour service=contour
time="2020-06-15T14:14:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kuard-test service=kuard
time="2020-06-15T14:14:57Z" level=info msg="skipping non-LoadBalancer service" namespace=default service=kubernetes
time="2020-06-15T14:14:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kube-system service=kube-dns
time="2020-06-15T14:14:57Z" level=info msg="skipping non-LoadBalancer service" namespace=cert-manager service=cert-manager-webhook
time="2020-06-15T14:14:57Z" level=info msg="new service" namespace=default service=kuard-service
time="2020-06-15T14:14:57Z" level=info msg="adding pod informer" namespace=default service=kuard-service
time="2020-06-15T14:14:58Z" level=info msg="starting reconciliation" component=fipcontroller
time="2020-06-15T14:14:58Z" level=info msg="floating IP already attached" component=fipcontroller fip=xxxx node=kube-worker0
time="2020-06-15T14:19:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kube-system service=metrics-server
time="2020-06-15T14:19:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kube-system service=hcloud-csi-controller-metrics
time="2020-06-15T14:19:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kube-system service=hcloud-csi-node-metrics
time="2020-06-15T14:19:57Z" level=info msg="skipping non-LoadBalancer service" namespace=projectcontour service=contour
time="2020-06-15T14:19:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kube-system service=kube-dns
time="2020-06-15T14:19:57Z" level=info msg="skipping non-LoadBalancer service" namespace=cert-manager service=cert-manager-webhook
time="2020-06-15T14:19:57Z" level=info msg="service update" namespace=projectcontour service=envoy
time="2020-06-15T14:19:57Z" level=info msg="service unchanged" namespace=projectcontour service=envoy
time="2020-06-15T14:19:57Z" level=info msg="skipping non-LoadBalancer service" namespace=cert-manager service=cert-manager
time="2020-06-15T14:19:57Z" level=info msg="service update" namespace=default service=kuard-service
time="2020-06-15T14:19:57Z" level=info msg="service unchanged" namespace=default service=kuard-service
time="2020-06-15T14:19:57Z" level=info msg="skipping non-LoadBalancer service" namespace=default service=kubernetes
time="2020-06-15T14:19:57Z" level=info msg="service update" namespace=default service=kuard-service2
time="2020-06-15T14:19:57Z" level=info msg="service unchanged" namespace=default service=kuard-service2
time="2020-06-15T14:19:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kuard-test service=kuard
time="2020-06-15T14:24:57Z" level=info msg="service update" namespace=default service=kuard-service2
time="2020-06-15T14:24:57Z" level=info msg="service unchanged" namespace=default service=kuard-service2
time="2020-06-15T14:24:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kuard-test service=kuard
time="2020-06-15T14:24:57Z" level=info msg="skipping non-LoadBalancer service" namespace=default service=kubernetes
time="2020-06-15T14:24:57Z" level=info msg="skipping non-LoadBalancer service" namespace=projectcontour service=contour
time="2020-06-15T14:24:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kube-system service=kube-dns
time="2020-06-15T14:24:57Z" level=info msg="skipping non-LoadBalancer service" namespace=cert-manager service=cert-manager-webhook
time="2020-06-15T14:24:57Z" level=info msg="service update" namespace=projectcontour service=envoy
time="2020-06-15T14:24:57Z" level=info msg="service unchanged" namespace=projectcontour service=envoy
time="2020-06-15T14:24:57Z" level=info msg="skipping non-LoadBalancer service" namespace=cert-manager service=cert-manager
time="2020-06-15T14:24:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kube-system service=metrics-server
time="2020-06-15T14:24:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kube-system service=hcloud-csi-controller-metrics
time="2020-06-15T14:24:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kube-system service=hcloud-csi-node-metrics
time="2020-06-15T14:24:57Z" level=info msg="service update" namespace=default service=kuard-service
time="2020-06-15T14:24:57Z" level=info msg="service unchanged" namespace=default service=kuard-service
time="2020-06-15T14:29:57Z" level=info msg="skipping non-LoadBalancer service" namespace=cert-manager service=cert-manager-webhook
time="2020-06-15T14:29:57Z" level=info msg="service update" namespace=projectcontour service=envoy
time="2020-06-15T14:29:57Z" level=info msg="service unchanged" namespace=projectcontour service=envoy
time="2020-06-15T14:29:57Z" level=info msg="skipping non-LoadBalancer service" namespace=cert-manager service=cert-manager
time="2020-06-15T14:29:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kube-system service=metrics-server
time="2020-06-15T14:29:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kube-system service=hcloud-csi-controller-metrics
time="2020-06-15T14:29:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kube-system service=hcloud-csi-node-metrics
time="2020-06-15T14:29:57Z" level=info msg="skipping non-LoadBalancer service" namespace=projectcontour service=contour
time="2020-06-15T14:29:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kube-system service=kube-dns
time="2020-06-15T14:29:57Z" level=info msg="service update" namespace=default service=kuard-service
time="2020-06-15T14:29:57Z" level=info msg="service unchanged" namespace=default service=kuard-service
time="2020-06-15T14:29:57Z" level=info msg="service update" namespace=default service=kuard-service2
time="2020-06-15T14:29:57Z" level=info msg="service unchanged" namespace=default service=kuard-service2
time="2020-06-15T14:29:57Z" level=info msg="skipping non-LoadBalancer service" namespace=kuard-test service=kuard
time="2020-06-15T14:29:57Z" level=info msg="skipping non-LoadBalancer service" namespace=default service=kubernetes
eplightning commented 4 years ago

Alright I think I got it, there's a deadlock which basically completely stops the conciliation when triggered. After the fix it started working correctly for me.

I have the image with a fix on my fork's dockerhub registry: bslawianowski/hcloud-ip-floater:latest

@ghorofamike Could you try this image and verify if it fixes the issue for you?

ghorofamike commented 4 years ago

Alright I think I got it, there's a deadlock which basically completely stops the conciliation when triggered. After the fix it started working correctly for me.

I have the image with a fix on my fork's dockerhub registry: bslawianowski/hcloud-ip-floater:latest

@ghorofamike Could you try this image and verify if it fixes the issue for you?

It seems to me that the problem is solved, is this the change in the other PR?

eplightning commented 4 years ago

Forgot to create PR yesterday, it's #13