canonical / oidc-gatekeeper-operator

Charmed OIDC Gatekeeper
Apache License 2.0
1 stars 7 forks source link

Implement pebble health checks #82

Open i-chvets opened 1 year ago

i-chvets commented 1 year ago

Description

Implement pebble health checks

ca-scribner commented 7 months ago

Ran into this today - workload is reporting errors but the charm reports Active/Idle

Can be reproduced by:

juju deploy oidc-gatekeeper
kubectl logs oidc-gatekeeper-0 oidc-authservice -f

# (truncated the start of the logs)
# ...
2024-01-18T21:08:29.496Z [oidc-authservice] time="2024-01-18T21:08:29Z" level=error msg="OIDC provider setup failed, retrying in 10 seconds: Get \"http://test-url/dex/.well-known/openid-configuration\": dial tcp: lookup test-url on 10.152.183.10:53: no such host"
syncronize-issues-to-jira[bot] commented 7 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5219.

This message was autogenerated

orfeas-k commented 6 months ago

Regarding the health check's implementation, when oidc-authservice cannot start, it logs what was mentioned in the previous comment. Trying to curl the service port (8080) from inside the pod, this returns a 503 request with message OIDC Setup is not complete yet..

╰─$ kubectl -n kubeflow exec oidc-gatekeeper-0 -c charm -- curl localhost:8080

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    31  100    31    0     0  31000      0 --:--:-- --:--:-- --:--:-- 31000
OIDC Setup is not complete yet.%  

adding -I results in:

HTTP/1.1 503 Service Unavailable
Content-Type: text/plain
Date: Fri, 02 Feb 2024 12:56:49 GMT
Content-Length: 31

When the OIDC setup is complete and the svc is working, curling the same port, results in a successful 302 request

╰─$ kf exec oidc-gatekeeper-0 -c charm -- curl localhost:8080                                                                                                                                                               
(svc ip)

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100   367  100   367    0     0   179k      0 --:--:-- --:--:-- --:--:--  179k
<a href="http://10.64.140.43.nip.io/dex/auth?client_id=authservice-oidc&amp;redirect_uri=%2Fauthservice%2Foidc%2Fcallback&amp;response_type=code&amp;scope=openid+profile+email+groups&amp;state=MTcwNjg3NzQ5MXxOd3dBTkZwS05WQldOVVZPVEZST1FrdEVSVnBHTmpkQk5EVTNORk0wVmt4RlIwUktTekpSTjBOVFYwZE1UazFaUjBwYU5sTk1XRUU9fCd1dGAKv6VgobwzUaKugEVwxTAhYm_TtuhbejMc5Aqf">Found</a>.

adding -I results in:

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0HTTP/1.1 302 Found
Content-Type: text/html; charset=utf-8
Location: http://10.64.140.43.nip.io/dex/auth?client_id=authservice-oidc&redirect_uri=%2Fauthservice%2Foidc%2Fcallback&response_type=code&scope=openid+profile+email+groups&state=MTcwNjg3ODY5MHxOd3dBTkRkWU4xZE9SREpZVjFWWVREZEtORXBNU2tKVVYwTkxWVU0yU1VWQ1FVdGFWMUJGVUZJMVRsaFlORXRZV2xwRU4wWlBSMUU9fOOsNXPYQ1GM4dr1tagMW-aWgVvOTaFK_XuJdDBl8hn-
Set-Cookie: oidc_state_csrf=MTcwNjg3ODY5MHxOd3dBTkRkWU4xZE9SREpZVjFWWVREZEtORXBNU2tKVVYwTkxWVU0yU1VWQ1FVdGFWMUJGVUZJMVRsaFlORXRZV2xwRU4wWlBSMUU9fOOsNXPYQ1GM4dr1tagMW-aWgVvOTaFK_XuJdDBl8hn-; Path=/; Expires=Fri, 24 Jul 2054 13:51:39 GMT; Max-Age=1200000000000
Date: Fri, 02 Feb 2024 12:58:10 GMT

Thus, we will base the health check's implementation on this workload's behaviour and implement it like that:

# self._health_check_name is `oidc-authservice-up`.
# self._http_port is 8080.
...
                    "on-check-failure": {self._health_check_name: "restart"},
                }
            },
            "checks": {
                self._health_check_name: {
                    "override": "replace",
                    "period": "10s",
                    "timeout": "3s",
                    "http": {"url": f"http://localhost:{self._http_port}"},
                }
            },
orfeas-k commented 6 months ago

Charm handling

In order for the charm reflect the failed health check, it has to:

Following what admission-webhook-operator does, this check will happen on update-status events. If down, the charm's status will be set to Maintenance(with workload failed health check).

orfeas-k commented 6 months ago

Reiterating on that (see https://github.com/canonical/oidc-gatekeeper-operator/pull/134#pullrequestreview-1888166557), at the moment, as stated above, Pebble does not offer a mechanism to reflect in the charm's status the workload's health check.

Pebble notices

As of Pebble 1.7 (included in Juju 3.4), Pebble offers custom notices mechanism (discourse post). These allow a workload to use pebble notify and essentially wake up the charm so it can act on it (e.g. update status, raise an error etc).

In order to implement that, we would need to add a script that would be added to the workload's container. This would be responsible for checking periodically the workload's health check's status.

The conerns with this is that:

  1. We first need to migrate to Juju 3.4 to use pebble notices (we will need to do that anyway very soon)
  2. No other teams are likely using Pebble notices yet (random conversation in MM)
  3. This would require manual process while Pebble team is working at the moment on a feature (matrix conversation) they plan to deliver before May 2024, which will introduce automated pebble notices for health check failures. Essentially, it will give the ability to wake up the charm when a check fails or changes state, without having to manually add any script to the workload.