canonical / oidc-gatekeeper-operator

Charmed OIDC Gatekeeper
Apache License 2.0
1 stars 7 forks source link

oidc charm fails to replan after config `public-url` #126

Closed orfeas-k closed 8 months ago

orfeas-k commented 8 months ago

Bug Description

Logs

aks-juju-debug-log.txt aks-oidc-container-logs.txt aks-oidc-pod-yaml.txt

ec2-juju-debug-log.txt ec2-oidc-container-logs.txt ec2-oidc-pod-yaml.txt

To Reproduce

With Juju 3.1 and Microk8s 1.26, deploy this minimal kubeflow bundle bundle-yaml.txt Wait until most charms are green and

juju config dex-auth public-url=http://10.64.140.43.nip.io
juju config oidc-gatekeeper public-url=http://10.64.140.43.nip.io

IIUC, this is not caught by the charm's integration tests because we pass a dummy public_url and somehow the charm goes to Active.

container's logs in integration tests ``` ╰─$ kl -ntest-charm-8l1z oidc-gatekeeper-0 -c oidc-authservice 2024-01-17T16:00:49.099Z [pebble] HTTP API server listening on ":38813". 2024-01-17T16:00:49.099Z [pebble] Started daemon. 2024-01-17T16:01:14.670Z [pebble] GET /v1/plan?format=yaml 70.649µs 200 2024-01-17T16:01:14.672Z [pebble] POST /v1/layers 199.264µs 200 2024-01-17T16:01:14.680Z [pebble] POST /v1/services 3.526964ms 202 2024-01-17T16:01:14.683Z [pebble] Service "oidc-authservice" starting: /home/authservice/oidc-authservice 2024-01-17T16:01:14.697Z [oidc-authservice] time="2024-01-17T16:01:14Z" level=info msg="Config: &{ProviderURL:http://test-url/dex ClientID:authservice-oidc ClientSecret:XUCX52JVWCEAUEFUMIN055XEZMG3NX OIDCAuthURL: RedirectURL:/authservice/oidc/callback OIDCScopes:[openid profile email groups] StrictSessionValidation:false OIDCStateStorePath:oidc_state.db AuthserviceURLPrefix:/authservice/ SkipAuthURLs:[/authservice/ /dex/] AuthHeader:Authorization Audiences:[istio-ingressgateway.istio-system.svc.cluster.local] HomepageURL:/authservice/site/homepage AfterLoginURL: AfterLogoutURL:test-url UserIDHeader:kubeflow-userid GroupsHeader:kubeflow-groups UserIDPrefix: UserIDTransformer:{rules:[]} UserIDClaim:email UserIDTokenHeader: GroupsClaim:groups IDTokenHeader:Authorization Hostname: Port:8080 WebServerPort:8082 ReadinessProbePort:8081 CABundlePath: SessionStorePath:bolt.db SessionMaxAge:86400 SessionSameSite:Lax ClientName:AuthService ThemesURL:themes Theme:kubeflow TemplatePath:[web/templates/default] UserTemplateContext:map[] GroupsAllowlist:[*]}" 2024-01-17T16:01:14.697Z [oidc-authservice] time="2024-01-17T16:01:14Z" level=info msg="Starting readiness probe at 8081" 2024-01-17T16:01:14.697Z [oidc-authservice] time="2024-01-17T16:01:14Z" level=info msg="Starting server at :8080" 2024-01-17T16:01:14.697Z [oidc-authservice] time="2024-01-17T16:01:14Z" level=info msg="Starting web server at :8082" 2024-01-17T16:01:14.850Z [oidc-authservice] time="2024-01-17T16:01:14Z" level=error msg="OIDC provider setup failed, retrying in 10 seconds: Get \"http://test-url/dex/.well-known/openid-configuration\": dial tcp: lookup test-url on 10.152.183.10:53: no such host" 2024-01-17T16:01:15.687Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.006314378s 200 2024-01-17T16:01:16.231Z [pebble] GET /v1/plan?format=yaml 122.797µs 200 2024-01-17T16:01:16.234Z [pebble] POST /v1/layers 148.084µs 200 2024-01-17T16:01:16.241Z [pebble] POST /v1/services 3.601199ms 202 2024-01-17T16:01:16.252Z [pebble] GET /v1/changes/2/wait?timeout=4.000s 10.629602ms 200 2024-01-17T16:01:24.853Z [oidc-authservice] time="2024-01-17T16:01:24Z" level=error msg="OIDC provider setup failed, retrying in 10 seconds: Get \"http://test-url/dex/.well-known/openid-configuration\": dial tcp: lookup test-url on 10.152.183.10:53: no such host" 2024-01-17T16:01:34.854Z [oidc-authservice] time="2024-01-17T16:01:34Z" level=error msg="OIDC provider setup failed, retrying in 10 seconds: Get \"http://test-url/dex/.well-known/openid-configuration\": dial tcp: lookup test-url on 10.152.183.10:53: no such host" 2024-01-17T16:01:45.023Z [oidc-authservice] time="2024-01-17T16:01:45Z" level=error msg="OIDC provider setup failed, retrying in 10 seconds: Get \"http://test-url/dex/.well-known/openid-configuration\": dial tcp: lookup test-url on 10.152.183.10:53: no such host" ```

Environment

Both in an EC2 Instance and an AKS cluster.

EC2:

AKS:

Relevant Log Output

<>

Additional Context

No response

syncronize-issues-to-jira[bot] commented 8 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5205.

This message was autogenerated

orfeas-k commented 8 months ago

We haven't yet reached to the bottom of it. Summarizing debugging and findings until now (some original discussions here):

Debugging

Possible root causes

Pebble

Starting from the fact that adding working-dir fixed the issue, this could mean that there was a change in pebble's behavior, but this would probably have broken more charms, not just this one, so I don't think this really stands. Looking at releases' changelog, there was a related change in Pebble that was released on the 4th Sep 2023 v1.3.0

feat: add support for services to configure working directory

However, looking at the PR I don't see any changes to pebble's default behaviour.

Fixing Waiting for pod startup to complete

One guess about it is that this error surfaced after we fixed #112 on #118 meaning that this didn't occur before because we never handled pebble_ready events. Keeping in mind though #112 was intermittent, while Failed to replan is consistent. I tried removing the line that we added in #118 (self.on.oidc_authservice_pebble_ready) and rebuilt the charm. It looks like the charm went to Active but the workload didn't operate. Instead, it logged repeatedly

2024-01-19T09:00:57.503Z [oidc-authservice] time="2024-01-19T09:00:57Z" level=error msg="OIDC provider setup failed, retrying in 10 seconds: Get \"http://10.64.140.43.nip.io/dex/.well-known/openid-configuration\": dial tcp 10.64.140.43:80: connect: connection refused"
2024-01-19T09:00:58.370Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.008170238s 200
2024-01-19T09:00:59.220Z [pebble] GET /v1/plan?format=yaml 197.978µs 200
orfeas-k commented 8 months ago

Our understanding on how charm works at the moment

(quoting @ca-scribner on this)

  1. public-url must be set. If it is not set, the charm reports as blocked before the workload is started. Once set, the workload will be started

  2. public-url must be a valid URL. If it is not a valid url, we hit the issue in this comment where the workload retries every 10 seconds to hit that url. oidc-authservice is not healthy, but also not crashing in a way pebble alerts us about (we have no health checks on the workload, so we can't notice an unhealthy-but-running workload)

  3. if public-url is valid, then oidc tries to find the web resources and cannot. That's when we get the error message about not finding web/...

ca-scribner commented 8 months ago

This is mitigated by #128, but probably can't say #128 fixes it unless we better understand why the issue happened

DnPlas commented 8 months ago

I am not able to reproduce this issue neither in 1.8/stable nor in latest/edge. Here are my steps and relevant information:

Charmed Kubeflow 1.8/stable

Env:

Steps:

  1. Deploy juju deploy kubeflow 1.8/stable --trust
  2. Wait for the oidc-gatekeeper charm to go to BlockedStatus with message Blocked: public-url config required
  3. Configure the charm juju config oidc-gatekeeper public-url=http://10.64.140.43.nip.io

Logs:

  1. Charm goes to active and idle after setting up the value of public-url
    
    Model     Controller  Cloud/Region        Version  SLA          Timestamp
    kubeflow  uk8s        microk8s/localhost  3.1.7    unsupported  17:22:57Z

App Version Status Scale Charm Channel Rev Address Exposed Message oidc-gatekeeper active 1 oidc-gatekeeper ckf-1.8/stable 294 10.152.183.215 no

Unit Workload Agent Address Ports Message oidc-gatekeeper/0* active idle 10.1.60.169

2. [Logs from the container](https://pastebin.canonical.com/p/hYB85jwkSN/)

#### Deploying charms from latest/edge

1. Same env as before
2. Deployed `oidc-gatekeeper` and `dex-auth` from `latest/edge`
4. Relate oidc-gatekeeper to dex-auth and istio-pilot
5. Wait for the oidc-gatekeeper charm to go to `BlockedStatus` with message `Blocked: public-url config required`
6. Configure the charm `juju config oidc-gatekeeper public-url=http://10.64.140.43.nip.io`

Logs:

* From a `juju debug-code` [debug session](https://pastebin.canonical.com/p/DWCybqPjwx/), we can see that the charm updates the pebble layer and then goes to active
* No error status:

Model Controller Cloud/Region Version SLA Timestamp kubeflow uk8s microk8s/localhost 3.1.7 unsupported 17:43:24Z

App Version Status Scale Charm Channel Rev Address Exposed Message oidc-gatekeeper active 1 oidc-gatekeeper latest/edge 329 10.152.183.84 no

Unit Workload Agent Address Ports Message oidc-gatekeeper/0* active idle 10.1.60.140

DnPlas commented 8 months ago

UPDATE: I am able to reproduce this issue when deploying oidc-gatekeeper from revision 312, which was published before #128 was merged, as this helps fixing the issue.

I essentially deployed the oidc-gatekeeper from latest/edge rev 312, relate it to dex-auth and istio-pilot, then waited for the unit to go to BlockedStatus waiting for the public-url config. I then configured the public-url and the unit went to Blocked: Failed to replan. Opening a debug session and trying to manually replan yields the following:

(Pdb) self._container.add_layer(self._container.name, self._oidc_layer.services, combine=True)
(Pdb) self._container.replan()
*** ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "oidc-authservice" (cannot start service: exited quickly with code 1)
----- Logs from task 0 -----
2024-01-19T17:53:14Z INFO Most recent service output:
    time="2024-01-19T17:53:14Z" level=info msg="Config: &{ProviderURL:http://10.64.140.43.nip.io/dex ClientID:authservice-oidc ClientSecret:QU36Q0D91861PT09MN6KCLRIK6AR88 OIDCAuthURL: RedirectURL:/authservice/oidc/callback OIDCScopes:[openid profile email groups] StrictSessionValidation:false OIDCStateStorePath:oidc_state.db AuthserviceURLPrefix:/authservice/ SkipAuthURLs:[/authservice/ /dex/] AuthHeader:Authorization Audiences:[istio-ingressgateway.istio-system.svc.cluster.local] HomepageURL:/authservice/site/homepage AfterLoginURL: AfterLogoutURL:http://10.64.140.43.nip.io UserIDHeader:kubeflow-userid GroupsHeader:kubeflow-groups UserIDPrefix: UserIDTransformer:{rules:[]} UserIDClaim:email UserIDTokenHeader: GroupsClaim:groups IDTokenHeader:Authorization Hostname: Port:8080 WebServerPort:8082 ReadinessProbePort:8081 CABundlePath: SessionStorePath:bolt.db SessionMaxAge:86400 SessionSameSite:Lax ClientName:AuthService ThemesURL:themes Theme:kubeflow TemplatePath:[web/templates/default] UserTemplateContext:map[] GroupsAllowlist:[*]}"
    time="2024-01-19T17:53:14Z" level=info msg="Starting readiness probe at 8081"
    time="2024-01-19T17:53:14Z" level=info msg="Starting server at :8080"
    time="2024-01-19T17:53:14Z" level=info msg="Starting web server at :8082"
    time="2024-01-19T17:53:14Z" level=fatal msg="open web/templates/default: no such file or directory"
2024-01-19T17:53:14Z ERROR cannot start service: exited quickly with code 1
-----

which was already reported.

You can also found container logs here.

DnPlas commented 8 months ago

I deployed oidc-gatekeeper from channel ckf-1.8/stable and executed into the oidc-authservice container, which automatically defaults the working dir to /home/authservice. In contrast, the newer versions of the oidc-authservice charms will have the default working dir to root, which does not contain the directory and files that the service is looking for.

Executing into the container (ckf-1.8/stable)

ubuntu@ip-172-31-15-25:~$ kubectl exec -ti oidc-gatekeeper-0 -coidc-authservice -nkubeflow -- /bin/sh
/home/authservice #

It is not yet clear to me what introduced the change, but it seems like containers from latest/edge rev312 and ckf-1.8/stable are different from the ones in latest/edge. In fact the "older ones" only have /bin/sh while the newer containers have /bin/bash. The only change I see from a working charm to the broken one is this, which changes the image registry. Not sure if the published image is different or has changed in the past months after we released.

I think, though, that #128 is the right fix for this issue. Thoughts @ca-scribner @orfeas-k ?

orfeas-k commented 8 months ago

From the upstream Dockerfile, we can confirm that the executable is expected to be run from the working-dir added in #128.

ENV APP_HOME=/home/$USER
WORKDIR $APP_HOME

# Copy in binary and give permissions
COPY --from=builder /go/bin/oidc-authservice $APP_HOME
COPY web $APP_HOME/web
RUN chmod +x $APP_HOME/oidc-authservice
RUN chown -R $USER:$GROUP $APP_HOME

USER $USER

ENTRYPOINT [ "./oidc-authservice" ]

The question that's left unanswered is why this surfaced now and not earlier.

In fact the "older ones" only have /bin/sh while the newer containers have /bin/bash. The only change I see from a working charm to the broken one is this, which changes the image registry. Not sure if the published image is different or has changed in the past months after we released.

Comparing the two images from kubeflowmanifests repo and arrikto repo, they both have the same digest (sha), which means IIUC that they are the same exact image, just pushed to another repo. This doesn't explain the thing @DnPlas mentioned about /bin/sh and /bin/bash..

ca-scribner commented 8 months ago

The only thing I can think of is something changed in pebble's behaviour, but I don't see why that would be the case either.

Do we just close this in light of #128 at least fixing things most of the time? I don't like accepting "we don't know" as the resolution here, but also don't have a better idea

orfeas-k commented 8 months ago

Closing then since we cannot reproduce it to the current charms. Let's re-open in case we stumble on this again (although we shouldn't)