canonical / kubeflow-dashboard-operator

Operator for Kubeflow Dashboard
Apache License 2.0
1 stars 1 forks source link

`kubeflow-dashboard` channels `1.8/edge` and `latest/edge` is not functional when deployed with Juju 3.5 #188

Open NohaIhab opened 4 months ago

NohaIhab commented 4 months ago

Bug Description

Note: This behavior is seen with 1.8/edge and latest/edge, but not 1.8/stable. This issue is blocking the CI of PR #https://github.com/canonical/dex-auth-operator/pull/187. This came up during the CI updates to Juju 3.5, specifically in the dex-auth integration CI where test_login is timing out to login to the dshboard. When trying to access the dashboard from the browser, it's unsuccessful with this error shown:

upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure, transport failure reason: delayed connect error: 111

the response status is 503 Service Unavailable and the server is istio-envoy

Observing the istio-ingerssgateway workload logs:

[2024-05-09T10:48:49.956Z] "GET /?next=%2F HTTP/1.1" 503 UF,URX upstream_reset_before_response_started{connection_failure,delayed_connect_error:_111} - "delayed_connect_error:_111" 0 168 64 - "172.31.18.195" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/124.0.6367.155 Safari/537.36" "60b6e7d2-4ee7-461f-aef4-9d510109ccbf" "10.64.140.43.nip.io" "10.1.199.227:8082" outbound|8082||kubeflow-dashboard.kubeflow.svc.cluster.local - 10.1.199.236:8080 172.31.18.195:3212 - -

delayed connection error 111 typically means the application port is not open

and observing the kubeflow-dashboard workload logs:

2024-05-09T09:42:41.298Z [pebble] HTTP API server listening on ":38813".
2024-05-09T09:42:41.299Z [pebble] Started daemon.
2024-05-09T09:43:07.129Z [pebble] GET /v1/plan?format=yaml 230.782µs 200
2024-05-09T09:43:07.148Z [pebble] POST /v1/layers 291.684µs 200
2024-05-09T09:43:07.163Z [pebble] POST /v1/services 7.638296ms 202
2024-05-09T09:43:07.170Z [pebble] GET /v1/notices?timeout=30s 25.7898025s 200
2024-05-09T09:43:07.170Z [pebble] Service "serve" starting: /usr/bin/npm start
2024-05-09T09:43:07.646Z [serve] 
2024-05-09T09:43:07.646Z [serve] > kubeflow-centraldashboard@0.0.2 start
2024-05-09T09:43:07.646Z [serve] > npm run serve
2024-05-09T09:43:07.646Z [serve] 
2024-05-09T09:43:08.112Z [serve] glob error [Error: EACCES: permission denied, scandir '/root/.npm/_logs'] {
2024-05-09T09:43:08.112Z [serve]   errno: -13,
2024-05-09T09:43:08.112Z [serve]   code: 'EACCES',
2024-05-09T09:43:08.112Z [serve]   syscall: 'scandir',
2024-05-09T09:43:08.112Z [serve]   path: '/root/.npm/_logs'
2024-05-09T09:43:08.112Z [serve] }
2024-05-09T09:43:08.114Z [serve] npm WARN logfile Error: EACCES: permission denied, scandir '/root/.npm/_logs'
2024-05-09T09:43:08.114Z [serve] npm WARN logfile  error cleaning log files [Error: EACCES: permission denied, scandir '/root/.npm/_logs'] {
2024-05-09T09:43:08.114Z [serve] npm WARN logfile   errno: -13,
2024-05-09T09:43:08.115Z [serve] npm WARN logfile   code: 'EACCES',
2024-05-09T09:43:08.115Z [serve] npm WARN logfile   syscall: 'scandir',
2024-05-09T09:43:08.115Z [serve] npm WARN logfile   path: '/root/.npm/_logs'
2024-05-09T09:43:08.115Z [serve] npm WARN logfile }
2024-05-09T09:43:08.118Z [serve] 
2024-05-09T09:43:08.118Z [serve] > kubeflow-centraldashboard@0.0.2 serve
2024-05-09T09:43:08.118Z [serve] > node dist/server.js
2024-05-09T09:43:08.118Z [serve] 
2024-05-09T09:43:08.177Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.012609377s 200
2024-05-09T09:43:08.177Z [pebble] GET /v1/notices?after=2024-05-09T09%3A43%3A07.164028187Z&timeout=30s 1.006409154s 200
2024-05-09T09:43:08.896Z [serve] Initializing Kubernetes configuration
2024-05-09T09:43:08.904Z [serve] [SEVERE] unhandledRejection Error: EACCES: permission denied, open '/var/run/secrets/kubernetes.io/serviceaccount/token'
2024-05-09T09:43:08.904Z [serve]     at Object.openSync (fs.js:462:3)
2024-05-09T09:43:08.904Z [serve]     at Object.readFileSync (fs.js:364:35)
2024-05-09T09:43:08.904Z [serve]     at KubeConfig.loadFromCluster (/app/node_modules/@kubernetes/client-node/dist/config.js:137:27)
2024-05-09T09:43:08.904Z [serve]     at KubeConfig.loadFromDefault (/app/node_modules/@kubernetes/client-node/dist/config.js:173:18)
2024-05-09T09:43:08.904Z [serve]     at new KubernetesService (/app/dist/k8s_service.js:45:25)
2024-05-09T09:43:08.904Z [serve]     at /app/dist/server.js:46:28
2024-05-09T09:43:08.904Z [serve]     at Generator.next (<anonymous>)
2024-05-09T09:43:08.904Z [serve]     at /app/dist/server.js:8:71
2024-05-09T09:43:08.904Z [serve]     at new Promise (<anonymous>)
2024-05-09T09:43:08.904Z [serve]     at __awaiter (/app/dist/server.js:4:12) {
2024-05-09T09:43:08.904Z [serve]   errno: -13,
2024-05-09T09:43:08.904Z [serve]   syscall: 'open',
2024-05-09T09:43:08.904Z [serve]   code: 'EACCES',
2024-05-09T09:43:08.904Z [serve]   path: '/var/run/secrets/kubernetes.io/serviceaccount/token'
2024-05-09T09:43:08.904Z [serve] }
2024-05-09T09:43:08.946Z [pebble] Service "serve" stopped unexpectedly with code 0
2024-05-09T09:43:08.946Z [pebble] Service "serve" on-success action is "restart", waiting ~500ms before restart (backoff 1)

we can see that the npm server is not able start, and the pebble service is constantly restarting with permission denied errors to access the files:

The kubeflow dashboard is down in this case, but the charm status still shows up as active. This is problematic because the charm's state is not representative.

To Reproduce

  1. Deploy the following charms from latest/edge or corresponding 1.8/edge channels:
    dex-auth 
    istio-ingressgateway
    istio-pilot
    kubeflow-dashboard
    kubeflow-profiles
    oidc-gatekeeper
  2. try to access the dashboard by visiting the public-url

Environment

juju 3.5/stable microk8s 1.25-strict/stable

Relevant Log Output

upstream connect error or disconnect/reset before headers. retried and the latest reset reason: connection failure, transport failure reason: delayed connect error: 111

Additional Context

No response

syncronize-issues-to-jira[bot] commented 4 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5651.

This message was autogenerated

NohaIhab commented 4 months ago

To gather more information, I've tested:

So this concludes that the 1.8/edge charm has a diff from 1.8/stable that makes break with juju 3.5 specifically.

It's worth noting that we are using the rock in 1.8/edge, but not in 1.8/stable.

The Juju 3.5 release notes include the change https://github.com/juju/juju/pull/17070, I'm curious if this has to do with the errors we're seeing, and relating to the rock that we are using.

NohaIhab commented 4 months ago

Comparing the pod definitions when deploying with 3.4 and 3.5, this security context is additional in 3.5:

  securityContext:
    fsGroup: 170
    supplementalGroups:
    - 170
ca-scribner commented 4 months ago

I can reproduce with:

But in my case, kubeflow-dashboard's charm goes to Error because ops's pebble raises an exception during charm execution:

2024-05-09T15:18:02.818Z [container-agent] 2024-05-09 15:18:02 INFO juju-log Pebble plan updated with new configuration, replaning
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed Error in sys.excepthook:
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed Traceback (most recent call last):
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/usr/lib/python3.8/logging/__init__.py", line 954, in handle
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     self.emit(record)
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/log.py", line 41, in emit
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     self.model_backend.juju_log(record.levelname, self.format(record))
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/usr/lib/python3.8/logging/__init__.py", line 929, in format
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     return fmt.format(record)
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/usr/lib/python3.8/logging/__init__.py", line 676, in format
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     record.exc_text = self.formatException(record.exc_info)
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/usr/lib/python3.8/logging/__init__.py", line 626, in formatException
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     traceback.print_exception(ei[0], ei[1], tb, None, sio)
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/usr/lib/python3.8/traceback.py", line 103, in print_exception
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     for line in TracebackException(
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/usr/lib/python3.8/traceback.py", line 617, in format
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     yield from self.format_exception_only()
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/usr/lib/python3.8/traceback.py", line 566, in format_exception_only
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     stype = smod + '.' + stype
2024-05-09T15:18:03.848Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed TypeError: unsupported operand type(s) for +: 'NoneType' and 'str'
2024-05-09T15:18:03.849Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed 
2024-05-09T15:18:03.849Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed Original exception was:
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed Traceback (most recent call last):
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "./src/charm.py", line 212, in _update_layer
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     self.container.replan()
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/model.py", line 1915, in replan
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     self._pebble.replan_services()
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/pebble.py", line 1680, in replan_services
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     return self._services_action('replan', [], timeout, delay)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/pebble.py", line 1761, in _services_action
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     raise ChangeError(change.err, change)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed ops.pebble.ChangeError: cannot perform the following tasks:
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed - Start service "serve" (cannot start service: exited quickly with code 0)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed ----- Logs from task 0 -----
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed 2024-05-09T15:18:03Z INFO Most recent service output:
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     (...)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     > kubeflow-centraldashboard@0.0.2 serve
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     > node dist/server.js
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     Initializing Kubernetes configuration
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     [SEVERE] unhandledRejection Error: EACCES: permission denied, open '/var/run/secrets/kubernetes.io/serviceaccount/token'
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at Object.openSync (fs.js:462:3)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at Object.readFileSync (fs.js:364:35)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at KubeConfig.loadFromCluster (/app/node_modules/@kubernetes/client-node/dist/config.js:137:27)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at KubeConfig.loadFromDefault (/app/node_modules/@kubernetes/client-node/dist/config.js:173:18)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at new KubernetesService (/app/dist/k8s_service.js:45:25)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at /app/dist/server.js:46:28
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at Generator.next (<anonymous>)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at /app/dist/server.js:8:71
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at new Promise (<anonymous>)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed         at __awaiter (/app/dist/server.js:4:12) {
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed       errno: -13,
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed       syscall: 'open',
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed       code: 'EACCES',
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed       path: '/var/run/secrets/kubernetes.io/serviceaccount/token'
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     }
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed 2024-05-09T15:18:03Z ERROR cannot start service: exited quickly with code 0
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed -----
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed 
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed The above exception was the direct cause of the following exception:
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed 
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed Traceback (most recent call last):
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "./src/charm.py", line 304, in <module>
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     main(KubeflowDashboardOperator)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/main.py", line 441, in main
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     _emit_charm_event(charm, dispatcher.event_name)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/main.py", line 149, in _emit_charm_event
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     event_to_emit.emit(*args, **kwargs)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/framework.py", line 344, in emit
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     framework._emit(event)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/framework.py", line 841, in _emit
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     self._reemit(event_path)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "/var/lib/juju/agents/unit-kubeflow-dashboard-0/charm/venv/ops/framework.py", line 930, in _reemit
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     custom_handler(event)
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "./src/charm.py", line 284, in main
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     self._update_layer()
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed   File "./src/charm.py", line 214, in _update_layer
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed     raise GenericCharmRuntimeError("Failed to replan") from e
2024-05-09T15:18:03.850Z [container-agent] 2024-05-09 15:18:03 WARNING config-changed <unknown>GenericCharmRuntimeError: Failed to replan
ca-scribner commented 4 months ago

I can also reproduce this behaviour directly using kubernetes and the rock charmedkubeflow/kubeflow-central-dashboard (source rockcraft.yaml) by:

rockcraft pack
skopeo --insecure-policy copy oci-archive:kubeflow-central-dashboard_1.8_amd64.rock docker-daemon:centraldashboard:rock
docker save centraldashboard:rock > tempimage
microk8s ctr image import tempimage
rm tempimage
kubectl -f broken-dashboard-rock.yaml

where broken-dashboard-rock.yaml is:

apiVersion: v1
kind: Pod
metadata:
  name: kubeflow-dashboard-with-security-context
spec:
  automountServiceAccountToken: true
  securityContext:
    fsGroup: 170
    supplementalGroups:
    - 170
  volumes:
  - name: sec-ctx-vol
    emptyDir: {}
  containers:
  - name: kubeflow-dashboard
    image: centraldashboard:rock
    volumeMounts:
    - name: sec-ctx-vol
      mountPath: /data/demo
    securityContext:
      runAsGroup: 0
      runAsUser: 0
    volumeMounts:
    - mountPath: /charm/bin/pebble
      name: charm-data
      readOnly: true
      subPath: charm/bin/pebble
    - mountPath: /charm/container
      name: charm-data
      subPath: charm/containers/kubeflow-dashboard
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-b4hms
      readOnly: true
  volumes:
  - emptyDir: {}
    name: charm-data
  - name: kube-api-access-b4hms
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace

(note that I'm reusing the ServiceAccount that juju created above for kubeflow-dashboard)

This raises the same errors as @NohaIhab mention above. If we remove the securityContexts, the app works as normal.

Looking at the serviceaccount/token that npm reports we cannot access, we see that it is in group 170

kubectl exec kubeflow-dashboard-with-security-context -- ls -l /var/run/secrets/kubernetes.io/serviceaccount/token
lrwxrwxrwx 1 root 170 12 May  9 17:51 /var/run/secrets/kubernetes.io/serviceaccount/token -> ..data/token

which makes sense because setting Pod.spec.securityContext.fsGroup: 170 sets that all volumeMounts will be part of group 170.

ca-scribner commented 4 months ago

Interestingly, if we do the same example as above but use upstream's image (kubeflownotebookswg/centraldashboard:v1.8.0), the dashboard starts successfully. Summary of some attempts:

So for the above cases, whether the upstream image is executed as root or node, they both appear to have the group 170 added to their existing groups. Strangly, typing groups node or groups root in those sessions does not show 170 in the list, making me think Kubernetes adds that group in somehow on the fly to the existing user.

Interestingly if I kubectl exec (which gives me the root user) or runc exec -u 584792 (which gives me the rockcraft _daemon_ user), and then type groups, I see 170 is in my current groups and I can view the serviceaccount token. So somehow in these adhoc sessions, the users are being granted access to group 170 just like above for the upstream docker images.

A theory of what is going on ...

My guess at what happening is that we're hitting trouble because rocks run their services via pebble. My guess is that the pebble entrypoint is likely getting the added group 170 like we see with the docker images, but that pebble does not pass this group down to the processes that it executes. So when pebble runs our service using the _daemon_ user, it does not have group 170 and thus cannot access the service token.

Next steps

ca-scribner commented 4 months ago

Tried deploying the rock pod with the following changes (full yaml given below):

pod.yaml ```yaml apiVersion: v1 kind: Pod metadata: name: kubeflow-dashboard-with-security-context-as-584792 spec: automountServiceAccountToken: true securityContext: fsGroup: 584792 supplementalGroups: - 584792 serviceAccount: kubeflow-dashboard serviceAccountName: kubeflow-dashboard volumes: - name: sec-ctx-vol emptyDir: {} containers: - name: kubeflow-dashboard image: kubeflow-central-dashboard:1.8 volumeMounts: - name: sec-ctx-vol mountPath: /data/demo securityContext: runAsGroup: 584792 runAsUser: 584792 volumeMounts: - mountPath: /charm/bin/pebble name: charm-data readOnly: true subPath: charm/bin/pebble - mountPath: /charm/container name: charm-data subPath: charm/containers/kubeflow-dashboard - mountPath: /var/run/secrets/kubernetes.io/serviceaccount name: kube-api-access-b4hms readOnly: true volumes: - emptyDir: {} name: charm-data - name: kube-api-access-b4hms projected: defaultMode: 420 sources: - serviceAccountToken: expirationSeconds: 3607 path: token - configMap: items: - key: ca.crt path: ca.crt name: kube-root-ca.crt - downwardAPI: items: - fieldRef: apiVersion: v1 fieldPath: metadata.namespace path: namespace ```

So I think this reinforces the theory above that kubernetes natively grants the fsGroup to the entrypoint's process, but pebble does not do the same for its children

DnPlas commented 4 months ago

Based on @NohaIhab's suggestion to use the upstream image instead of the rock, I have pushed two temporal patches (#191 and #192) to help alleviate the issue and unblock other workflows.

Also, based on some comments from the Juju team, it seems like the issue is rather on that side, so not much more we can do on the charm and rock.

DnPlas commented 4 months ago

FYI, this issue has been reported in juju, let's keep an eye on https://bugs.launchpad.net/juju/+bug/2066517.

NohaIhab commented 4 months ago

When deploying the dashboard charm latest/edge now (this deploys the rock due to #194 ) we are seeing: In Juju 3.5.0, the charm is active but it's not functioning correctly, and we can see in the workload logs:

2024-06-04T15:09:52.430Z [pebble] Service "kubeflow-dashboard" starting: npm start
2024-06-04T15:09:52.912Z [kubeflow-dashboard] npm ERR! code ENOENT
2024-06-04T15:09:52.913Z [kubeflow-dashboard] npm ERR! syscall open
2024-06-04T15:09:52.913Z [kubeflow-dashboard] npm ERR! path //package.json
2024-06-04T15:09:52.913Z [kubeflow-dashboard] npm ERR! errno -2
2024-06-04T15:09:52.915Z [kubeflow-dashboard] npm ERR! enoent ENOENT: no such file or directory, open '//package.json'
2024-06-04T15:09:52.915Z [kubeflow-dashboard] npm ERR! enoent This is related to npm not being able to find a file.
2024-06-04T15:09:52.915Z [kubeflow-dashboard] npm ERR! enoent 
2024-06-04T15:09:52.917Z [kubeflow-dashboard] 
2024-06-04T15:09:52.918Z [kubeflow-dashboard] npm ERR! A complete log of this run can be found in:
2024-06-04T15:09:52.918Z [kubeflow-dashboard] npm ERR!     /root/.npm/_logs/2024-06-04T15_09_52_732Z-debug-0.log
2024-06-04T15:09:52.928Z [pebble] Service "kubeflow-dashboard" stopped unexpectedly with code 254
2024-06-04T15:09:52.928Z [pebble] Service "kubeflow-dashboard" on-failure action is "restart", waiting ~30s before restart (backoff 7)
2024-06-04T15:10:01.913Z [pebble] Service "serve" starting: /usr/bin/npm start
2024-06-04T15:10:02.645Z [serve] 
2024-06-04T15:10:02.646Z [serve] > kubeflow-centraldashboard@0.0.2 start
2024-06-04T15:10:02.646Z [serve] > npm run serve
2024-06-04T15:10:02.646Z [serve] 
2024-06-04T15:10:03.393Z [serve] glob error [Error: EACCES: permission denied, scandir '/root/.npm/_logs'] {
2024-06-04T15:10:03.393Z [serve]   errno: -13,
2024-06-04T15:10:03.393Z [serve]   code: 'EACCES',
2024-06-04T15:10:03.393Z [serve]   syscall: 'scandir',
2024-06-04T15:10:03.393Z [serve]   path: '/root/.npm/_logs'
2024-06-04T15:10:03.393Z [serve] }
2024-06-04T15:10:03.396Z [serve] npm WARN logfile Error: EACCES: permission denied, scandir '/root/.npm/_logs'
2024-06-04T15:10:03.397Z [serve] npm WARN logfile  error cleaning log files [Error: EACCES: permission denied, scandir '/root/.npm/_logs'] {
2024-06-04T15:10:03.397Z [serve] npm WARN logfile   errno: -13,
2024-06-04T15:10:03.397Z [serve] npm WARN logfile   code: 'EACCES',
2024-06-04T15:10:03.397Z [serve] npm WARN logfile   syscall: 'scandir',
2024-06-04T15:10:03.397Z [serve] npm WARN logfile   path: '/root/.npm/_logs'
2024-06-04T15:10:03.397Z [serve] npm WARN logfile }
2024-06-04T15:10:03.402Z [serve] 
2024-06-04T15:10:03.402Z [serve] > kubeflow-centraldashboard@0.0.2 serve
2024-06-04T15:10:03.402Z [serve] > node dist/server.js
2024-06-04T15:10:03.402Z [serve] 
2024-06-04T15:10:04.426Z [serve] Initializing Kubernetes configuration
2024-06-04T15:10:04.432Z [serve] [SEVERE] unhandledRejection Error: EACCES: permission denied, open '/var/run/secrets/kubernetes.io/serviceaccount/token'
2024-06-04T15:10:04.432Z [serve]     at Object.openSync (fs.js:462:3)
2024-06-04T15:10:04.432Z [serve]     at Object.readFileSync (fs.js:364:35)
2024-06-04T15:10:04.432Z [serve]     at KubeConfig.loadFromCluster (/app/node_modules/@kubernetes/client-node/dist/config.js:137:27)
2024-06-04T15:10:04.432Z [serve]     at KubeConfig.loadFromDefault (/app/node_modules/@kubernetes/client-node/dist/config.js:173:18)
2024-06-04T15:10:04.432Z [serve]     at new KubernetesService (/app/dist/k8s_service.js:45:25)
2024-06-04T15:10:04.432Z [serve]     at /app/dist/server.js:46:28
2024-06-04T15:10:04.432Z [serve]     at Generator.next (<anonymous>)
2024-06-04T15:10:04.432Z [serve]     at /app/dist/server.js:8:71
2024-06-04T15:10:04.432Z [serve]     at new Promise (<anonymous>)
2024-06-04T15:10:04.432Z [serve]     at __awaiter (/app/dist/server.js:4:12) {
2024-06-04T15:10:04.432Z [serve]   errno: -13,
2024-06-04T15:10:04.432Z [serve]   syscall: 'open',
2024-06-04T15:10:04.432Z [serve]   code: 'EACCES',
2024-06-04T15:10:04.432Z [serve]   path: '/var/run/secrets/kubernetes.io/serviceaccount/token'
2024-06-04T15:10:04.432Z [serve] }
2024-06-04T15:10:04.486Z [pebble] Service "serve" stopped unexpectedly with code 0
2024-06-04T15:10:04.486Z [pebble] Service "serve" on-success action is "restart", waiting ~30s before restart (backoff 7)
2024-06-04T15:10:16.859Z [pebble] GET /v1/notices?after=2024-06-04T15%3A09%3A16.846321248Z&timeout=30s 30.000718606s 200

In Juju 3.5.1, the charm went to error, and we can see in the workload logs:

kubectl logs -nkubeflow kubeflow-dashboard-0 -c kubeflow-dashboard
2024-06-04T15:18:20.917Z [pebble] HTTP API server listening on ":38813".
2024-06-04T15:18:20.917Z [pebble] Started daemon.
2024-06-04T15:18:51.009Z [pebble] GET /v1/notices?timeout=30s 30.000488277s 200
2024-06-04T15:19:12.203Z [pebble] GET /v1/plan?format=yaml 644.109µs 200
2024-06-04T15:19:12.228Z [pebble] POST /v1/layers 785.985µs 200
2024-06-04T15:19:12.257Z [pebble] POST /v1/services 11.122061ms 202
2024-06-04T15:19:12.263Z [pebble] GET /v1/notices?timeout=30s 21.254288538s 200
2024-06-04T15:19:12.263Z [pebble] Service "kubeflow-dashboard" starting: npm start
2024-06-04T15:19:12.806Z [kubeflow-dashboard] npm ERR! code ENOENT
2024-06-04T15:19:12.808Z [kubeflow-dashboard] npm ERR! syscall open
2024-06-04T15:19:12.808Z [kubeflow-dashboard] npm ERR! path //package.json
2024-06-04T15:19:12.809Z [kubeflow-dashboard] npm ERR! errno -2
2024-06-04T15:19:12.813Z [kubeflow-dashboard] npm ERR! enoent ENOENT: no such file or directory, open '//package.json'
2024-06-04T15:19:12.814Z [kubeflow-dashboard] npm ERR! enoent This is related to npm not being able to find a file.
2024-06-04T15:19:12.814Z [kubeflow-dashboard] npm ERR! enoent 
2024-06-04T15:19:12.820Z [kubeflow-dashboard] 
2024-06-04T15:19:12.821Z [kubeflow-dashboard] npm ERR! A complete log of this run can be found in:
2024-06-04T15:19:12.821Z [kubeflow-dashboard] npm ERR!     /var/lib/pebble/default/.npm/_logs/2024-06-04T15_19_12_546Z-debug-0.log
2024-06-04T15:19:12.837Z [pebble] Change 1 task (Start service "kubeflow-dashboard") failed: cannot start service: exited quickly with code 254
2024-06-04T15:19:12.850Z [pebble] GET /v1/notices?after=2024-06-04T15%3A19%3A12.257312723Z&timeout=30s 585.796055ms 200
2024-06-04T15:19:12.850Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 592.081916ms 200
2024-06-04T15:19:20.371Z [pebble] GET /v1/plan?format=yaml 488.842µs 200
2024-06-04T15:19:20.402Z [pebble] POST /v1/layers 264.427µs 200
2024-06-04T15:19:20.419Z [pebble] POST /v1/services 7.468515ms 202
2024-06-04T15:19:20.426Z [pebble] GET /v1/notices?after=2024-06-04T15%3A19%3A12.837810917Z&timeout=30s 7.57522781s 200
2024-06-04T15:19:20.426Z [pebble] Service "kubeflow-dashboard" starting: npm start
2024-06-04T15:19:20.999Z [kubeflow-dashboard] npm ERR! code ENOENT
2024-06-04T15:19:21.000Z [kubeflow-dashboard] npm ERR! syscall open
2024-06-04T15:19:21.000Z [kubeflow-dashboard] npm ERR! path //package.json
2024-06-04T15:19:21.001Z [kubeflow-dashboard] npm ERR! errno -2
2024-06-04T15:19:21.003Z [kubeflow-dashboard] npm ERR! enoent ENOENT: no such file or directory, open '//package.json'
2024-06-04T15:19:21.003Z [kubeflow-dashboard] npm ERR! enoent This is related to npm not being able to find a file.
2024-06-04T15:19:21.003Z [kubeflow-dashboard] npm ERR! enoent 
2024-06-04T15:19:21.004Z [kubeflow-dashboard] 
2024-06-04T15:19:21.005Z [kubeflow-dashboard] npm ERR! A complete log of this run can be found in:
2024-06-04T15:19:21.005Z [kubeflow-dashboard] npm ERR!     /var/lib/pebble/default/.npm/_logs/2024-06-04T15_19_20_762Z-debug-0.log
2024-06-04T15:19:21.019Z [pebble] Change 2 task (Start service "kubeflow-dashboard") failed: cannot start service: exited quickly with code 254
2024-06-04T15:19:21.035Z [pebble] GET /v1/changes/2/wait?timeout=4.000s 614.686733ms 200
2024-06-04T15:19:21.035Z [pebble] GET /v1/notices?after=2024-06-04T15%3A19%3A20.419724889Z&timeout=30s 608.795658ms 200
2024-06-04T15:19:33.805Z [pebble] GET /v1/plan?format=yaml 1.599755ms 200
2024-06-04T15:19:33.831Z [pebble] POST /v1/layers 261.411µs 200
2024-06-04T15:19:33.849Z [pebble] POST /v1/services 6.548049ms 202
2024-06-04T15:19:33.859Z [pebble] GET /v1/notices?after=2024-06-04T15%3A19%3A21.019663273Z&timeout=30s 12.823734778s 200

It looks like in Juju 3.5.1 we are no longer seeing the service account token Permission denied error that we were seeing in Juju 3.5.0 -> See Juju bug that was fixed BUT we are seeing a different error saying no such file or directory, open '//package.json', so we need to investigate whether this is a different issue, and if so file it seperately.

On Juju 3.5.1, when refreshing the charm to use the upstream image instead like:

juju refresh kubeflow-dashboard --resource oci-image=docker.io/kubeflownotebookswg/centraldashboard:v1.8.0

we no longer see any error in the workload logs. So we can safely proceed with fixing #194 by making sure the charm uses the resource specified in the metadata.yaml i.e. the upstream image.

DnPlas commented 3 months ago

I have tested the charm (1.8/stable) with the rock (charmedkubeflow/kubeflow-central-dashboard:1.8-5a3d6ea) and I'm having the same issue as @NohaIhab reported in https://github.com/canonical/kubeflow-dashboard-operator/issues/188#issuecomment-2147934751:

2024-06-19T13:46:11.056Z [pebble] GET /v1/changes/4/wait?timeout=4.000s 426.897172ms 200
2024-06-19T13:46:17.949Z [pebble] GET /v1/notices?timeout=30s 30.001642379s 200
2024-06-19T13:46:47.952Z [pebble] GET /v1/notices?timeout=30s 30.001880459s 200
2024-06-19T13:46:54.960Z [pebble] GET /v1/plan?format=yaml 180.599µs 200
2024-06-19T13:46:54.981Z [pebble] POST /v1/layers 191.613µs 200
2024-06-19T13:46:54.997Z [pebble] POST /v1/services 10.006459ms 202
2024-06-19T13:46:55.006Z [pebble] Service "kubeflow-dashboard" starting: npm start
2024-06-19T13:46:55.370Z [kubeflow-dashboard] npm ERR! code ENOENT
2024-06-19T13:46:55.372Z [kubeflow-dashboard] npm ERR! syscall open
2024-06-19T13:46:55.372Z [kubeflow-dashboard] npm ERR! path //package.json
2024-06-19T13:46:55.372Z [kubeflow-dashboard] npm ERR! errno -2
2024-06-19T13:46:55.373Z [kubeflow-dashboard] npm ERR! enoent ENOENT: no such file or directory, open '//package.json'
2024-06-19T13:46:55.373Z [kubeflow-dashboard] npm ERR! enoent This is related to npm not being able to find a file.
2024-06-19T13:46:55.373Z [kubeflow-dashboard] npm ERR! enoent
2024-06-19T13:46:55.374Z [kubeflow-dashboard]
2024-06-19T13:46:55.374Z [kubeflow-dashboard] npm ERR! A complete log of this run can be found in:
2024-06-19T13:46:55.374Z [kubeflow-dashboard] npm ERR!     /root/.npm/_logs/2024-06-19T13_46_55_207Z-debug-0.log
2024-06-19T13:46:55.408Z [pebble] GET /v1/changes/5/wait?timeout=4.000s 410.115382ms 200
2024-06-19T13:47:17.953Z [pebble] GET /v1/notices?timeout=30s 30.000991436s 200

This could potentially be an issue with the rock, as the package.json file is more related to npm.

EDIT: I have filed https://github.com/canonical/kubeflow-rocks/issues/104, it seems like this is a rocks issue rather than a juju issue.