Open NohaIhab opened 4 months ago
Thank you for reporting us your feedback!
The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5651.
This message was autogenerated
To gather more information, I've tested:
1.8/edge
dashboard charm on juju 3.1, and it was functioning correctly.1.8/stable
dashboard charm on juju 3.5, this also functioned correctly.So this concludes that the 1.8/edge
charm has a diff from 1.8/stable
that makes break with juju 3.5 specifically.
It's worth noting that we are using the rock in 1.8/edge
, but not in 1.8/stable
.
The Juju 3.5 release notes include the change https://github.com/juju/juju/pull/17070, I'm curious if this has to do with the errors we're seeing, and relating to the rock that we are using.
Comparing the pod definitions when deploying with 3.4 and 3.5, this security context is additional in 3.5:
securityContext:
fsGroup: 170
supplementalGroups:
- 170
I can reproduce with:
juju deploy kubeflow-dashboard --channel 1.8/edge --trust
juju deploy kubeflow-profiles --channel 1.8/stable --trust
juju relate kubeflow-dashboard kubeflow-profiles
But in my case, kubeflow-dashboard's charm goes to Error because ops's pebble raises an exception during charm execution:
2024-05-09T15:18:02.818Z [container-agent] 2024-05-09 15:18:02 INFO juju-log Pebble plan updated with new configuration, replaning
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed Error in sys.excepthook:
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed Traceback (most recent call last):
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "/usr/lib/python3.8/logging/__init__.py", line 954, in handle
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed self.emit(record)
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/log.py", line 41, in emit
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed self.model_backend.juju_log(record.levelname, self.format(record))
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "/usr/lib/python3.8/logging/__init__.py", line 929, in format
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed return fmt.format(record)
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "/usr/lib/python3.8/logging/__init__.py", line 676, in format
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed record.exc_text = self.formatException(record.exc_info)
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "/usr/lib/python3.8/logging/__init__.py", line 626, in formatException
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed traceback.print_exception(ei[0], ei[1], tb, None, sio)
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "/usr/lib/python3.8/traceback.py", line 103, in print_exception
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed for line in TracebackException(
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "/usr/lib/python3.8/traceback.py", line 617, in format
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed yield from self.format_exception_only()
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "/usr/lib/python3.8/traceback.py", line 566, in format_exception_only
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed stype = smod + '.' + stype
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
2024-05-09T15:18:03.849Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed
2024-05-09T15:18:03.849Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed Original exception was:
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed Traceback (most recent call last):
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "./src/charm.py", line 212, in _update_layer
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed self.container.replan()
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/model.py", line 1915, in replan
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed self._pebble.replan_services()
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/pebble.py", line 1680, in replan_services
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed return self._services_action('replan', [], timeout, delay)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/pebble.py", line 1761, in _services_action
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed raise ChangeError(change.err, change)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed ops.pebble.ChangeError: cannot perform the following tasks:
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed - Start service "serve" (cannot start service: exited quickly with code 0)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed ----- Logs from task 0 -----
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed 2024-05-09T15:18:03Z INFO Most recent service output:
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed (...)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed > kubeflow-centraldashboard@0.0.2 serve
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed > node dist/server.js
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed Initializing Kubernetes configuration
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed [SEVERE] unhandledRejection Error: EACCES: permission denied, open '/var/run/secrets/kubernetes.io/serviceaccount/token'
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed at Object.openSync (fs.js:462:3)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed at Object.readFileSync (fs.js:364:35)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed at KubeConfig.loadFromCluster (/app/node_modules/@kubernetes/client-node/dist/config.js:137:27)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed at KubeConfig.loadFromDefault (/app/node_modules/@kubernetes/client-node/dist/config.js:173:18)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed at new KubernetesService (/app/dist/k8s_service.js:45:25)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed at /app/dist/server.js:46:28
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed at Generator.next (<anonymous>)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed at /app/dist/server.js:8:71
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed at new Promise (<anonymous>)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed at __awaiter (/app/dist/server.js:4:12) {
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed errno: -13,
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed syscall: 'open',
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed code: 'EACCES',
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed path: '/var/run/secrets/kubernetes.io/serviceaccount/token'
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed }
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed 2024-05-09T15:18:03Z ERROR cannot start service: exited quickly with code 0
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed -----
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed The above exception was the direct cause of the following exception:
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed Traceback (most recent call last):
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "./src/charm.py", line 304, in <module>
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed main(KubeflowDashboardOperator)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/main.py", line 441, in main
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed _emit_charm_event(charm, dispatcher.event_name)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed event_to_emit.emit(*args, **kwargs)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/framework.py", line 344, in emit
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed framework._emit(event)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/framework.py", line 841, in _emit
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed self._reemit(event_path)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/framework.py", line 930, in _reemit
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed custom_handler(event)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "./src/charm.py", line 284, in main
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed self._update_layer()
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed File "./src/charm.py", line 214, in _update_layer
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed raise GenericCharmRuntimeError("Failed to replan") from e
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed <unknown>GenericCharmRuntimeError: Failed to replan
I can also reproduce this behaviour directly using kubernetes and the rock charmedkubeflow/kubeflow-central-dashboard (source rockcraft.yaml) by:
rockcraft pack
skopeo --insecure-policy copy oci-archive:kubeflow-central-dashboard_1.8_amd64.rock docker-daemon:centraldashboard:rock
docker save centraldashboard:rock > tempimage
microk8s ctr image import tempimage
rm tempimage
kubectl -f broken-dashboard-rock.yaml
where broken-dashboard-rock.yaml is:
apiVersion: v1
kind: Pod
metadata:
name: kubeflow-dashboard-with-security-context
spec:
automountServiceAccountToken: true
securityContext:
fsGroup: 170
supplementalGroups:
- 170
volumes:
- name: sec-ctx-vol
emptyDir: {}
containers:
- name: kubeflow-dashboard
image: centraldashboard:rock
volumeMounts:
- name: sec-ctx-vol
mountPath: /data/demo
securityContext:
runAsGroup: 0
runAsUser: 0
volumeMounts:
- mountPath: /charm/bin/pebble
name: charm-data
readOnly: true
subPath: charm/bin/pebble
- mountPath: /charm/container
name: charm-data
subPath: charm/containers/kubeflow-dashboard
- mountPath: /var/run/secrets/kubernetes.io/serviceaccount
name: kube-api-access-b4hms
readOnly: true
volumes:
- emptyDir: {}
name: charm-data
- name: kube-api-access-b4hms
projected:
defaultMode: 420
sources:
- serviceAccountToken:
expirationSeconds: 3607
path: token
- configMap:
items:
- key: ca.crt
path: ca.crt
name: kube-root-ca.crt
- downwardAPI:
items:
- fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
path: namespace
(note that I'm reusing the ServiceAccount
that juju
created above for kubeflow-dashboard)
This raises the same errors as @NohaIhab mention above. If we remove the securityContext
s, the app works as normal.
Looking at the serviceaccount/token that npm reports we cannot access, we see that it is in group 170
kubectl exec kubeflow-dashboard-with-security-context -- ls -l /var/run/secrets/kubernetes.io/serviceaccount/token
lrwxrwxrwx 1 root 170 12 May 9 17:51 /var/run/secrets/kubernetes.io/serviceaccount/token -> ..data/token
which makes sense because setting Pod.spec.securityContext.fsGroup: 170
sets that all volumeMounts
will be part of group 170
.
Interestingly, if we do the same example as above but use upstream's image (kubeflownotebookswg/centraldashboard:v1.8.0
), the dashboard starts successfully. Summary of some attempts:
kubeflownotebookswg/centraldashboard:v1.8.0
with container securityContext: {runAsGroup: 0; runAsUser: 0}
:
kubectl exec kubeflow-dashboard-with-pod-security-context-upstream-image -- whoami
shows root
kubectl exec kubeflow-dashboard-with-pod-security-context-upstream-image -- groups
shows root 170 ...
(some other groups)kubeflownotebookswg/centraldashboard:v1.8.0
without container securityContext
:
kubectl exec kubeflow-dashboard-with-pod-security-context-upstream-image -- whoami
shows node
(which is expected because the upstream image sets USER node
kubectl exec kubeflow-dashboard-with-pod-security-context-upstream-image -- groups
shows node 170
So for the above cases, whether the upstream image is executed as root
or node
, they both appear to have the group 170
added to their existing groups. Strangly, typing groups node
or groups root
in those sessions does not show 170
in the list, making me think Kubernetes adds that group in somehow on the fly to the existing user.
Interestingly if I kubectl exec
(which gives me the root user) or runc exec -u 584792
(which gives me the rockcraft _daemon_
user), and then type groups
, I see 170
is in my current groups and I can view the serviceaccount token. So somehow in these adhoc sessions, the users are being granted access to group 170
just like above for the upstream docker images.
My guess at what happening is that we're hitting trouble because rocks run their services via pebble
. My guess is that the pebble
entrypoint is likely getting the added group 170
like we see with the docker images, but that pebble
does not pass this group down to the processes that it executes. So when pebble runs our service using the _daemon_
user, it does not have group 170
and thus cannot access the service token.
groups
and whoami
just to see what it is running. Could also do some cat /var/run/secretss...
etc in that script to see what it saysTried deploying the rock pod with the following changes (full yaml given below):
fsGroup=584792
(to match the _daemon_
user pebble would use to execute services in the rocksecurityContext
(it didn't matter whether it was there or not - I tried both and got the same result)So I think this reinforces the theory above that kubernetes natively grants the fsGroup to the entrypoint's process, but pebble does not do the same for its children
Based on @NohaIhab's suggestion to use the upstream image instead of the rock, I have pushed two temporal patches (#191 and #192) to help alleviate the issue and unblock other workflows.
Also, based on some comments from the Juju team, it seems like the issue is rather on that side, so not much more we can do on the charm and rock.
FYI, this issue has been reported in juju, let's keep an eye on https://bugs.launchpad.net/juju/+bug/2066517.
When deploying the dashboard charm latest/edge
now (this deploys the rock due to #194 ) we are seeing:
In Juju 3.5.0, the charm is active
but it's not functioning correctly, and we can see in the workload logs:
2024-06-04T15:09:52.430Z [pebble] Service "kubeflow-dashboard" starting: npm start
2024-06-04T15:09:52.912Z [kubeflow-dashboard] npm ERR! code ENOENT
2024-06-04T15:09:52.913Z [kubeflow-dashboard] npm ERR! syscall open
2024-06-04T15:09:52.913Z [kubeflow-dashboard] npm ERR! path //package.json
2024-06-04T15:09:52.913Z [kubeflow-dashboard] npm ERR! errno -2
2024-06-04T15:09:52.915Z [kubeflow-dashboard] npm ERR! enoent ENOENT: no such file or directory, open '//package.json'
2024-06-04T15:09:52.915Z [kubeflow-dashboard] npm ERR! enoent This is related to npm not being able to find a file.
2024-06-04T15:09:52.915Z [kubeflow-dashboard] npm ERR! enoent
2024-06-04T15:09:52.917Z [kubeflow-dashboard]
2024-06-04T15:09:52.918Z [kubeflow-dashboard] npm ERR! A complete log of this run can be found in:
2024-06-04T15:09:52.918Z [kubeflow-dashboard] npm ERR! /root/.npm/_logs/2024-06-04T15_09_52_732Z-debug-0.log
2024-06-04T15:09:52.928Z [pebble] Service "kubeflow-dashboard" stopped unexpectedly with code 254
2024-06-04T15:09:52.928Z [pebble] Service "kubeflow-dashboard" on-failure action is "restart", waiting ~30s before restart (backoff 7)
2024-06-04T15:10:01.913Z [pebble] Service "serve" starting: /usr/bin/npm start
2024-06-04T15:10:02.645Z [serve]
2024-06-04T15:10:02.646Z [serve] > kubeflow-centraldashboard@0.0.2 start
2024-06-04T15:10:02.646Z [serve] > npm run serve
2024-06-04T15:10:02.646Z [serve]
2024-06-04T15:10:03.393Z [serve] glob error [Error: EACCES: permission denied, scandir '/root/.npm/_logs'] {
2024-06-04T15:10:03.393Z [serve] errno: -13,
2024-06-04T15:10:03.393Z [serve] code: 'EACCES',
2024-06-04T15:10:03.393Z [serve] syscall: 'scandir',
2024-06-04T15:10:03.393Z [serve] path: '/root/.npm/_logs'
2024-06-04T15:10:03.393Z [serve] }
2024-06-04T15:10:03.396Z [serve] npm WARN logfile Error: EACCES: permission denied, scandir '/root/.npm/_logs'
2024-06-04T15:10:03.397Z [serve] npm WARN logfile error cleaning log files [Error: EACCES: permission denied, scandir '/root/.npm/_logs'] {
2024-06-04T15:10:03.397Z [serve] npm WARN logfile errno: -13,
2024-06-04T15:10:03.397Z [serve] npm WARN logfile code: 'EACCES',
2024-06-04T15:10:03.397Z [serve] npm WARN logfile syscall: 'scandir',
2024-06-04T15:10:03.397Z [serve] npm WARN logfile path: '/root/.npm/_logs'
2024-06-04T15:10:03.397Z [serve] npm WARN logfile }
2024-06-04T15:10:03.402Z [serve]
2024-06-04T15:10:03.402Z [serve] > kubeflow-centraldashboard@0.0.2 serve
2024-06-04T15:10:03.402Z [serve] > node dist/server.js
2024-06-04T15:10:03.402Z [serve]
2024-06-04T15:10:04.426Z [serve] Initializing Kubernetes configuration
2024-06-04T15:10:04.432Z [serve] [SEVERE] unhandledRejection Error: EACCES: permission denied, open '/var/run/secrets/kubernetes.io/serviceaccount/token'
2024-06-04T15:10:04.432Z [serve] at Object.openSync (fs.js:462:3)
2024-06-04T15:10:04.432Z [serve] at Object.readFileSync (fs.js:364:35)
2024-06-04T15:10:04.432Z [serve] at KubeConfig.loadFromCluster (/app/node_modules/@kubernetes/client-node/dist/config.js:137:27)
2024-06-04T15:10:04.432Z [serve] at KubeConfig.loadFromDefault (/app/node_modules/@kubernetes/client-node/dist/config.js:173:18)
2024-06-04T15:10:04.432Z [serve] at new KubernetesService (/app/dist/k8s_service.js:45:25)
2024-06-04T15:10:04.432Z [serve] at /app/dist/server.js:46:28
2024-06-04T15:10:04.432Z [serve] at Generator.next (<anonymous>)
2024-06-04T15:10:04.432Z [serve] at /app/dist/server.js:8:71
2024-06-04T15:10:04.432Z [serve] at new Promise (<anonymous>)
2024-06-04T15:10:04.432Z [serve] at __awaiter (/app/dist/server.js:4:12) {
2024-06-04T15:10:04.432Z [serve] errno: -13,
2024-06-04T15:10:04.432Z [serve] syscall: 'open',
2024-06-04T15:10:04.432Z [serve] code: 'EACCES',
2024-06-04T15:10:04.432Z [serve] path: '/var/run/secrets/kubernetes.io/serviceaccount/token'
2024-06-04T15:10:04.432Z [serve] }
2024-06-04T15:10:04.486Z [pebble] Service "serve" stopped unexpectedly with code 0
2024-06-04T15:10:04.486Z [pebble] Service "serve" on-success action is "restart", waiting ~30s before restart (backoff 7)
2024-06-04T15:10:16.859Z [pebble] GET /v1/notices?after=2024-06-04T15%3A09%3A16.846321248Z&timeout=30s 30.000718606s 200
In Juju 3.5.1, the charm went to error
, and we can see in the workload logs:
kubectl logs -nkubeflow kubeflow-dashboard-0 -c kubeflow-dashboard
2024-06-04T15:18:20.917Z [pebble] HTTP API server listening on ":38813".
2024-06-04T15:18:20.917Z [pebble] Started daemon.
2024-06-04T15:18:51.009Z [pebble] GET /v1/notices?timeout=30s 30.000488277s 200
2024-06-04T15:19:12.203Z [pebble] GET /v1/plan?format=yaml 644.109µs 200
2024-06-04T15:19:12.228Z [pebble] POST /v1/layers 785.985µs 200
2024-06-04T15:19:12.257Z [pebble] POST /v1/services 11.122061ms 202
2024-06-04T15:19:12.263Z [pebble] GET /v1/notices?timeout=30s 21.254288538s 200
2024-06-04T15:19:12.263Z [pebble] Service "kubeflow-dashboard" starting: npm start
2024-06-04T15:19:12.806Z [kubeflow-dashboard] npm ERR! code ENOENT
2024-06-04T15:19:12.808Z [kubeflow-dashboard] npm ERR! syscall open
2024-06-04T15:19:12.808Z [kubeflow-dashboard] npm ERR! path //package.json
2024-06-04T15:19:12.809Z [kubeflow-dashboard] npm ERR! errno -2
2024-06-04T15:19:12.813Z [kubeflow-dashboard] npm ERR! enoent ENOENT: no such file or directory, open '//package.json'
2024-06-04T15:19:12.814Z [kubeflow-dashboard] npm ERR! enoent This is related to npm not being able to find a file.
2024-06-04T15:19:12.814Z [kubeflow-dashboard] npm ERR! enoent
2024-06-04T15:19:12.820Z [kubeflow-dashboard]
2024-06-04T15:19:12.821Z [kubeflow-dashboard] npm ERR! A complete log of this run can be found in:
2024-06-04T15:19:12.821Z [kubeflow-dashboard] npm ERR! /var/lib/pebble/default/.npm/_logs/2024-06-04T15_19_12_546Z-debug-0.log
2024-06-04T15:19:12.837Z [pebble] Change 1 task (Start service "kubeflow-dashboard") failed: cannot start service: exited quickly with code 254
2024-06-04T15:19:12.850Z [pebble] GET /v1/notices?after=2024-06-04T15%3A19%3A12.257312723Z&timeout=30s 585.796055ms 200
2024-06-04T15:19:12.850Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 592.081916ms 200
2024-06-04T15:19:20.371Z [pebble] GET /v1/plan?format=yaml 488.842µs 200
2024-06-04T15:19:20.402Z [pebble] POST /v1/layers 264.427µs 200
2024-06-04T15:19:20.419Z [pebble] POST /v1/services 7.468515ms 202
2024-06-04T15:19:20.426Z [pebble] GET /v1/notices?after=2024-06-04T15%3A19%3A12.837810917Z&timeout=30s 7.57522781s 200
2024-06-04T15:19:20.426Z [pebble] Service "kubeflow-dashboard" starting: npm start
2024-06-04T15:19:20.999Z [kubeflow-dashboard] npm ERR! code ENOENT
2024-06-04T15:19:21.000Z [kubeflow-dashboard] npm ERR! syscall open
2024-06-04T15:19:21.000Z [kubeflow-dashboard] npm ERR! path //package.json
2024-06-04T15:19:21.001Z [kubeflow-dashboard] npm ERR! errno -2
2024-06-04T15:19:21.003Z [kubeflow-dashboard] npm ERR! enoent ENOENT: no such file or directory, open '//package.json'
2024-06-04T15:19:21.003Z [kubeflow-dashboard] npm ERR! enoent This is related to npm not being able to find a file.
2024-06-04T15:19:21.003Z [kubeflow-dashboard] npm ERR! enoent
2024-06-04T15:19:21.004Z [kubeflow-dashboard]
2024-06-04T15:19:21.005Z [kubeflow-dashboard] npm ERR! A complete log of this run can be found in:
2024-06-04T15:19:21.005Z [kubeflow-dashboard] npm ERR! /var/lib/pebble/default/.npm/_logs/2024-06-04T15_19_20_762Z-debug-0.log
2024-06-04T15:19:21.019Z [pebble] Change 2 task (Start service "kubeflow-dashboard") failed: cannot start service: exited quickly with code 254
2024-06-04T15:19:21.035Z [pebble] GET /v1/changes/2/wait?timeout=4.000s 614.686733ms 200
2024-06-04T15:19:21.035Z [pebble] GET /v1/notices?after=2024-06-04T15%3A19%3A20.419724889Z&timeout=30s 608.795658ms 200
2024-06-04T15:19:33.805Z [pebble] GET /v1/plan?format=yaml 1.599755ms 200
2024-06-04T15:19:33.831Z [pebble] POST /v1/layers 261.411µs 200
2024-06-04T15:19:33.849Z [pebble] POST /v1/services 6.548049ms 202
2024-06-04T15:19:33.859Z [pebble] GET /v1/notices?after=2024-06-04T15%3A19%3A21.019663273Z&timeout=30s 12.823734778s 200
It looks like in Juju 3.5.1 we are no longer seeing the service account token Permission denied
error that we were seeing in Juju 3.5.0 -> See Juju bug that was fixed
BUT we are seeing a different error saying no such file or directory, open '//package.json'
, so we need to investigate whether this is a different issue, and if so file it seperately.
On Juju 3.5.1, when refreshing the charm to use the upstream image instead like:
juju refresh kubeflow-dashboard --resource oci-image=docker.io/kubeflownotebookswg/centraldashboard:v1.8.0
we no longer see any error in the workload logs. So we can safely proceed with fixing #194 by making sure the charm uses the resource specified in the metadata.yaml i.e. the upstream image.
I have tested the charm (1.8/stable) with the rock (charmedkubeflow/kubeflow-central-dashboard:1.8-5a3d6ea
) and I'm having the same issue as @NohaIhab reported in https://github.com/canonical/kubeflow-dashboard-operator/issues/188#issuecomment-2147934751:
2024-06-19T13:46:11.056Z [pebble] GET /v1/changes/4/wait?timeout=4.000s 426.897172ms 200
2024-06-19T13:46:17.949Z [pebble] GET /v1/notices?timeout=30s 30.001642379s 200
2024-06-19T13:46:47.952Z [pebble] GET /v1/notices?timeout=30s 30.001880459s 200
2024-06-19T13:46:54.960Z [pebble] GET /v1/plan?format=yaml 180.599µs 200
2024-06-19T13:46:54.981Z [pebble] POST /v1/layers 191.613µs 200
2024-06-19T13:46:54.997Z [pebble] POST /v1/services 10.006459ms 202
2024-06-19T13:46:55.006Z [pebble] Service "kubeflow-dashboard" starting: npm start
2024-06-19T13:46:55.370Z [kubeflow-dashboard] npm ERR! code ENOENT
2024-06-19T13:46:55.372Z [kubeflow-dashboard] npm ERR! syscall open
2024-06-19T13:46:55.372Z [kubeflow-dashboard] npm ERR! path //package.json
2024-06-19T13:46:55.372Z [kubeflow-dashboard] npm ERR! errno -2
2024-06-19T13:46:55.373Z [kubeflow-dashboard] npm ERR! enoent ENOENT: no such file or directory, open '//package.json'
2024-06-19T13:46:55.373Z [kubeflow-dashboard] npm ERR! enoent This is related to npm not being able to find a file.
2024-06-19T13:46:55.373Z [kubeflow-dashboard] npm ERR! enoent
2024-06-19T13:46:55.374Z [kubeflow-dashboard]
2024-06-19T13:46:55.374Z [kubeflow-dashboard] npm ERR! A complete log of this run can be found in:
2024-06-19T13:46:55.374Z [kubeflow-dashboard] npm ERR! /root/.npm/_logs/2024-06-19T13_46_55_207Z-debug-0.log
2024-06-19T13:46:55.408Z [pebble] GET /v1/changes/5/wait?timeout=4.000s 410.115382ms 200
2024-06-19T13:47:17.953Z [pebble] GET /v1/notices?timeout=30s 30.000991436s 200
This could potentially be an issue with the rock, as the package.json
file is more related to npm
.
EDIT: I have filed https://github.com/canonical/kubeflow-rocks/issues/104, it seems like this is a rocks issue rather than a juju issue.
Bug Description
Note: This behavior is seen with
1.8/edge
andlatest/edge
, but not1.8/stable
. This issue is blocking the CI of PR #https://github.com/canonical/dex-auth-operator/pull/187. This came up during the CI updates to Juju 3.5, specifically in the dex-auth integration CI wheretest_login
is timing out to login to the dshboard. When trying to access the dashboard from the browser, it's unsuccessful with this error shown:the response status is
503 Service Unavailable
and the server isistio-envoy
Observing the
istio-ingerssgateway
workload logs:delayed connection error 111 typically means the application port is not open
and observing the
kubeflow-dashboard
workload logs:we can see that the npm server is not able start, and the pebble service is constantly restarting with permission denied errors to access the files:
The kubeflow dashboard is down in this case, but the charm status still shows up as
active
. This is problematic because the charm's state is not representative.To Reproduce
Environment
juju 3.5/stable microk8s 1.25-strict/stable
Relevant Log Output
Additional Context
No response