Closed orfeas-k closed 8 months ago
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5205.
This message was autogenerated
We haven't yet reached to the bottom of it. Summarizing debugging and findings until now (some original discussions here):
deploying with public-url
already configured also results in Blocked (Failed to replan)
situtation
juju deploy oidc-gatekeeper --channel latest/edge --trust --config public-url=http://10.64.140.43.nip.io
exec
ing into the failing container, we see that there is the actual directory
root@oidc-gatekeeper-0:~# ls /home/authservice/web/templates/default/
after_logout.html footer.html header.html homepage.html
contrary to what the container logs:
2024-01-17T14:12:36.386Z [oidc-authservice] time="2024-01-17T14:12:36Z" level=fatal msg="open web/templates/default: no such file or directory"
Adding a working-dir
field to pebble
configuration as done here fixes the issue.
Reordering main
and moving _check_public_url()
just before update_layer()
(as done here) also fixes the issue.
Indeed, as mentioned in this issue's description, integration tests pass while the workload is failing as described here too https://github.com/canonical/oidc-gatekeeper-operator/issues/82#issuecomment-1899208072.
Starting from the fact that adding working-dir
fixed the issue, this could mean that there was a change in pebble
's behavior, but this would probably have broken more charms, not just this one, so I don't think this really stands. Looking at releases' changelog, there was a related change in Pebble that was released on the 4th Sep 2023 v1.3.0
feat: add support for services to configure working directory
However, looking at the PR I don't see any changes to pebble
's default behaviour.
Waiting for pod startup to complete
One guess about it is that this error surfaced after we fixed #112 on #118 meaning that this didn't occur before because we never handled pebble_ready
events. Keeping in mind though #112 was intermittent, while Failed to replan
is consistent. I tried removing the line that we added in #118 (self.on.oidc_authservice_pebble_ready
) and rebuilt the charm. It looks like the charm went to Active
but the workload didn't operate. Instead, it logged repeatedly
2024-01-19T09:00:57.503Z [oidc-authservice] time="2024-01-19T09:00:57Z" level=error msg="OIDC provider setup failed, retrying in 10 seconds: Get \"http://10.64.140.43.nip.io/dex/.well-known/openid-configuration\": dial tcp 10.64.140.43:80: connect: connection refused"
2024-01-19T09:00:58.370Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.008170238s 200
2024-01-19T09:00:59.220Z [pebble] GET /v1/plan?format=yaml 197.978µs 200
(quoting @ca-scribner on this)
public-url
must be set. If it is not set, the charm reports as blocked
before the workload is started. Once set, the workload will be started
public-url
must be a valid URL. If it is not a valid url, we hit the issue in this comment where the workload retries every 10 seconds to hit that url. oidc-authservice
is not healthy, but also not crashing in a way pebble alerts us about (we have no health checks on the workload, so we can't notice an unhealthy-but-running workload)
if public-url
is valid, then oidc tries to find the web resources and cannot. That's when we get the error message about not finding web/...
This is mitigated by #128, but probably can't say #128 fixes it unless we better understand why the issue happened
I am not able to reproduce this issue neither in 1.8/stable nor in latest/edge. Here are my steps and relevant information:
Env:
Steps:
juju deploy kubeflow 1.8/stable --trust
BlockedStatus
with message Blocked: public-url config required
juju config oidc-gatekeeper public-url=http://10.64.140.43.nip.io
Logs:
Model Controller Cloud/Region Version SLA Timestamp
kubeflow uk8s microk8s/localhost 3.1.7 unsupported 17:22:57Z
App Version Status Scale Charm Channel Rev Address Exposed Message oidc-gatekeeper active 1 oidc-gatekeeper ckf-1.8/stable 294 10.152.183.215 no
Unit Workload Agent Address Ports Message oidc-gatekeeper/0* active idle 10.1.60.169
2. [Logs from the container](https://pastebin.canonical.com/p/hYB85jwkSN/)
#### Deploying charms from latest/edge
1. Same env as before
2. Deployed `oidc-gatekeeper` and `dex-auth` from `latest/edge`
4. Relate oidc-gatekeeper to dex-auth and istio-pilot
5. Wait for the oidc-gatekeeper charm to go to `BlockedStatus` with message `Blocked: public-url config required`
6. Configure the charm `juju config oidc-gatekeeper public-url=http://10.64.140.43.nip.io`
Logs:
* From a `juju debug-code` [debug session](https://pastebin.canonical.com/p/DWCybqPjwx/), we can see that the charm updates the pebble layer and then goes to active
* No error status:
Model Controller Cloud/Region Version SLA Timestamp kubeflow uk8s microk8s/localhost 3.1.7 unsupported 17:43:24Z
App Version Status Scale Charm Channel Rev Address Exposed Message oidc-gatekeeper active 1 oidc-gatekeeper latest/edge 329 10.152.183.84 no
Unit Workload Agent Address Ports Message oidc-gatekeeper/0* active idle 10.1.60.140
UPDATE: I am able to reproduce this issue when deploying oidc-gatekeeper
from revision 312, which was published before #128 was merged, as this helps fixing the issue.
I essentially deployed the oidc-gatekeeper from latest/edge rev 312, relate it to dex-auth and istio-pilot, then waited for the unit to go to BlockedStatus waiting for the public-url config. I then configured the public-url and the unit went to Blocked: Failed to replan
. Opening a debug session and trying to manually replan yields the following:
(Pdb) self._container.add_layer(self._container.name, self._oidc_layer.services, combine=True)
(Pdb) self._container.replan()
*** ops.pebble.ChangeError: cannot perform the following tasks:
- Start service "oidc-authservice" (cannot start service: exited quickly with code 1)
----- Logs from task 0 -----
2024-01-19T17:53:14Z INFO Most recent service output:
time="2024-01-19T17:53:14Z" level=info msg="Config: &{ProviderURL:http://10.64.140.43.nip.io/dex ClientID:authservice-oidc ClientSecret:QU36Q0D91861PT09MN6KCLRIK6AR88 OIDCAuthURL: RedirectURL:/authservice/oidc/callback OIDCScopes:[openid profile email groups] StrictSessionValidation:false OIDCStateStorePath:oidc_state.db AuthserviceURLPrefix:/authservice/ SkipAuthURLs:[/authservice/ /dex/] AuthHeader:Authorization Audiences:[istio-ingressgateway.istio-system.svc.cluster.local] HomepageURL:/authservice/site/homepage AfterLoginURL: AfterLogoutURL:http://10.64.140.43.nip.io UserIDHeader:kubeflow-userid GroupsHeader:kubeflow-groups UserIDPrefix: UserIDTransformer:{rules:[]} UserIDClaim:email UserIDTokenHeader: GroupsClaim:groups IDTokenHeader:Authorization Hostname: Port:8080 WebServerPort:8082 ReadinessProbePort:8081 CABundlePath: SessionStorePath:bolt.db SessionMaxAge:86400 SessionSameSite:Lax ClientName:AuthService ThemesURL:themes Theme:kubeflow TemplatePath:[web/templates/default] UserTemplateContext:map[] GroupsAllowlist:[*]}"
time="2024-01-19T17:53:14Z" level=info msg="Starting readiness probe at 8081"
time="2024-01-19T17:53:14Z" level=info msg="Starting server at :8080"
time="2024-01-19T17:53:14Z" level=info msg="Starting web server at :8082"
time="2024-01-19T17:53:14Z" level=fatal msg="open web/templates/default: no such file or directory"
2024-01-19T17:53:14Z ERROR cannot start service: exited quickly with code 1
-----
which was already reported.
You can also found container logs here.
I deployed oidc-gatekeeper
from channel ckf-1.8/stable and executed into the oidc-authservice
container, which automatically defaults the working dir to /home/authservice
. In contrast, the newer versions of the oidc-authservice
charms will have the default working dir to root
, which does not contain the directory and files that the service is looking for.
Executing into the container (ckf-1.8/stable)
ubuntu@ip-172-31-15-25:~$ kubectl exec -ti oidc-gatekeeper-0 -coidc-authservice -nkubeflow -- /bin/sh
/home/authservice #
It is not yet clear to me what introduced the change, but it seems like containers from latest/edge rev312 and ckf-1.8/stable are different from the ones in latest/edge. In fact the "older ones" only have /bin/sh
while the newer containers have /bin/bash
. The only change I see from a working charm to the broken one is this, which changes the image registry. Not sure if the published image is different or has changed in the past months after we released.
I think, though, that #128 is the right fix for this issue. Thoughts @ca-scribner @orfeas-k ?
From the upstream Dockerfile, we can confirm that the executable is expected to be run from the working-dir
added in #128.
ENV APP_HOME=/home/$USER
WORKDIR $APP_HOME
# Copy in binary and give permissions
COPY --from=builder /go/bin/oidc-authservice $APP_HOME
COPY web $APP_HOME/web
RUN chmod +x $APP_HOME/oidc-authservice
RUN chown -R $USER:$GROUP $APP_HOME
USER $USER
ENTRYPOINT [ "./oidc-authservice" ]
The question that's left unanswered is why this surfaced now and not earlier.
In fact the "older ones" only have
/bin/sh
while the newer containers have/bin/bash
. The only change I see from a working charm to the broken one is this, which changes the image registry. Not sure if the published image is different or has changed in the past months after we released.
Comparing the two images from kubeflowmanifests repo and arrikto repo, they both have the same digest (sha
), which means IIUC that they are the same exact image, just pushed to another repo. This doesn't explain the thing @DnPlas mentioned about /bin/sh
and /bin/bash
..
The only thing I can think of is something changed in pebble's behaviour, but I don't see why that would be the case either.
Do we just close this in light of #128 at least fixing things most of the time? I don't like accepting "we don't know" as the resolution here, but also don't have a better idea
Closing then since we cannot reproduce it to the current charms. Let's re-open in case we stumble on this again (although we shouldn't)
Bug Description
Blocked: public-url config required
juju config oidc-gatekeeper public-url=http://10.64.140.43.nip.io
Blocked: Failed to replan
Logs
aks-juju-debug-log.txt aks-oidc-container-logs.txt aks-oidc-pod-yaml.txt
ec2-juju-debug-log.txt ec2-oidc-container-logs.txt ec2-oidc-pod-yaml.txt
To Reproduce
With Juju 3.1 and Microk8s 1.26, deploy this minimal kubeflow bundle bundle-yaml.txt Wait until most charms are green and
IIUC, this is not caught by the charm's integration tests because we pass a dummy public_url and somehow the charm goes to
Active
.container's logs in integration tests
``` ╰─$ kl -ntest-charm-8l1z oidc-gatekeeper-0 -c oidc-authservice 2024-01-17T16:00:49.099Z [pebble] HTTP API server listening on ":38813". 2024-01-17T16:00:49.099Z [pebble] Started daemon. 2024-01-17T16:01:14.670Z [pebble] GET /v1/plan?format=yaml 70.649µs 200 2024-01-17T16:01:14.672Z [pebble] POST /v1/layers 199.264µs 200 2024-01-17T16:01:14.680Z [pebble] POST /v1/services 3.526964ms 202 2024-01-17T16:01:14.683Z [pebble] Service "oidc-authservice" starting: /home/authservice/oidc-authservice 2024-01-17T16:01:14.697Z [oidc-authservice] time="2024-01-17T16:01:14Z" level=info msg="Config: &{ProviderURL:http://test-url/dex ClientID:authservice-oidc ClientSecret:XUCX52JVWCEAUEFUMIN055XEZMG3NX OIDCAuthURL: RedirectURL:/authservice/oidc/callback OIDCScopes:[openid profile email groups] StrictSessionValidation:false OIDCStateStorePath:oidc_state.db AuthserviceURLPrefix:/authservice/ SkipAuthURLs:[/authservice/ /dex/] AuthHeader:Authorization Audiences:[istio-ingressgateway.istio-system.svc.cluster.local] HomepageURL:/authservice/site/homepage AfterLoginURL: AfterLogoutURL:test-url UserIDHeader:kubeflow-userid GroupsHeader:kubeflow-groups UserIDPrefix: UserIDTransformer:{rules:[]} UserIDClaim:email UserIDTokenHeader: GroupsClaim:groups IDTokenHeader:Authorization Hostname: Port:8080 WebServerPort:8082 ReadinessProbePort:8081 CABundlePath: SessionStorePath:bolt.db SessionMaxAge:86400 SessionSameSite:Lax ClientName:AuthService ThemesURL:themes Theme:kubeflow TemplatePath:[web/templates/default] UserTemplateContext:map[] GroupsAllowlist:[*]}" 2024-01-17T16:01:14.697Z [oidc-authservice] time="2024-01-17T16:01:14Z" level=info msg="Starting readiness probe at 8081" 2024-01-17T16:01:14.697Z [oidc-authservice] time="2024-01-17T16:01:14Z" level=info msg="Starting server at :8080" 2024-01-17T16:01:14.697Z [oidc-authservice] time="2024-01-17T16:01:14Z" level=info msg="Starting web server at :8082" 2024-01-17T16:01:14.850Z [oidc-authservice] time="2024-01-17T16:01:14Z" level=error msg="OIDC provider setup failed, retrying in 10 seconds: Get \"http://test-url/dex/.well-known/openid-configuration\": dial tcp: lookup test-url on 10.152.183.10:53: no such host" 2024-01-17T16:01:15.687Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.006314378s 200 2024-01-17T16:01:16.231Z [pebble] GET /v1/plan?format=yaml 122.797µs 200 2024-01-17T16:01:16.234Z [pebble] POST /v1/layers 148.084µs 200 2024-01-17T16:01:16.241Z [pebble] POST /v1/services 3.601199ms 202 2024-01-17T16:01:16.252Z [pebble] GET /v1/changes/2/wait?timeout=4.000s 10.629602ms 200 2024-01-17T16:01:24.853Z [oidc-authservice] time="2024-01-17T16:01:24Z" level=error msg="OIDC provider setup failed, retrying in 10 seconds: Get \"http://test-url/dex/.well-known/openid-configuration\": dial tcp: lookup test-url on 10.152.183.10:53: no such host" 2024-01-17T16:01:34.854Z [oidc-authservice] time="2024-01-17T16:01:34Z" level=error msg="OIDC provider setup failed, retrying in 10 seconds: Get \"http://test-url/dex/.well-known/openid-configuration\": dial tcp: lookup test-url on 10.152.183.10:53: no such host" 2024-01-17T16:01:45.023Z [oidc-authservice] time="2024-01-17T16:01:45Z" level=error msg="OIDC provider setup failed, retrying in 10 seconds: Get \"http://test-url/dex/.well-known/openid-configuration\": dial tcp: lookup test-url on 10.152.183.10:53: no such host" ```Environment
Both in an EC2 Instance and an AKS cluster.
EC2:
AKS:
Relevant Log Output
Additional Context
No response