feat: Add support for readiness\liveness probes

ekampf commented 1 month ago

What this PR does / why we need it:

Resolves #42

Checklist

[Place an '[x]' (no spaces) in all applicable fields. Please remove unrelated fields.]

[x] Chart Version bumped
[x] Variables are documented in the README.md

linear[bot] commented 1 month ago

OSS-12 Configure Readiness and Liveness Probes

loganbest commented 2 weeks ago

I can confirm in the related issue the statement:

Interestingly, this doesn't seem to solve my exact issue, as my pod which is showing as Controller Could not connect in the Twingate Console shows returnsOK in the health check:

This also happens to me. When there's a problem with external-dns in k8s for whatever reason (currently self inflicted, but not important) the pod spits out the same logs as in the #42 issue. When I check the /connectorctl health command it still says it's OK but Twingate stlil says it's offline, until I kill the pod and let it reschedule.

readiness/liveliness probes are essential, but there's a bigger root problem to fix as to why the health check is returning OK when it's clearly not OK.

ekampf commented 1 hour ago

@loganbest the way healthcheck is implemented today - it just checks that the connector process is up and functioning not that the connector is connected or not. Reason is, if connector is not connected its either deployed with invalid tokens or there's some external reason preventing it from connecting (network policy forbidding egress, firewall, ...)

Failing liveness and restarting the pod wont help fixing any of these issues - we'll just have a pod that keeps restarting (and potentially, in the invalid token case, spamming twingate with invalid connection requests)

loganbest commented 37 minutes ago

@loganbest the way healthcheck is implemented today - it just checks that the connector process is up and functioning not that the connector is connected or not.

Reason is, if connector is not connected its either deployed with invalid tokens or there's some external reason preventing it from connecting (network policy forbidding egress, firewall, ...)

Failing liveness and restarting the pod wont help fixing any of these issues - we'll just have a pod that keeps restarting (and potentially, in the invalid token case, spamming twingate with invalid connection requests)

Ok sure but if the connector doesn't actually reconnect when there's a network/dns failure then that seems like a pretty big bug that needs to be fixed. Otherwise what's the point of the health check if the connector isn't actually healthy? Healthy meaning up and functioning and connected. Health check should be able to ignore token issues as being a health problem and treat not being able to connect to twingate as a real health problem. Sure restarting the pod may not help here but in the absence of the connector retrying its connection to the twingate network it's better than just silently dying and staying that way. At least you'll get metrics on excessive pod restarts from k8s events and can see why from logs.

The last thing anyone wants is to have an unexpected network/dns blip (which happens) and the vpn not even being able to reconnect on its own because it just gives up and doesn't restart.

Twingate / helm-charts

feat: Add support for readiness\liveness probes #46

What this PR does / why we need it:

Checklist