canonical / kubeflow-dashboard-operator

Operator for Kubeflow Dashboard
Apache License 2.0
1 stars 1 forks source link

upstream dashboard image does not have permission to write logs #232

Open orfeas-k opened 2 months ago

orfeas-k commented 2 months ago

Bug Description

Following up from #222, we noticed that when the charm is run with the upstream image, it exits because the npm run serve does not have the required permissions. This renders it unable to write logs at /root/.npm/logs directory and thus exits with error code 243.

As explained in https://github.com/canonical/kubeflow-dashboard-operator/issues/222#issuecomment-2277782508, the pebble container command npm start is changing which user is running npm run serve. This is due to a native npm behaviour where

When npm is run as root, scripts are always run with the effective uid and gid of the working directory owner.

A proper fix would be to ensure user node (or 1000:1000 as the logs below suggest) has read/write permissions to /root/.npm/_logs before starting the service.

Workaround

A workaround to this would be to modify the pebble container command to npm run serve. This way, we 'll avoid npm running another script (ie npm start running npm run serve), thus npm run serve (called by Pebble) will run as root and have all the permissions it needs.

To Reproduce

juju deploy kubeflow-dashboard --channel latest/edge --trust --resource oci-image=docker.io/kubeflownotebookswg/centraldashboard:v1.9.0
juju deploy kubeflow-profiles --channel latest/edge --trust
juju relate kubeflow-dashboard kubeflow-profiles

Environment

Microk8s 1.29 Juju 3.4.4

Relevant Log Output

╰─$ kl -n kubeflow kubeflow-dashboard-0 -c kubeflow-dashboard -f                                                                                            1 ↵
2024-07-25T01:46:56.015Z [pebble] HTTP API server listening on ":38813".
2024-07-25T01:46:56.015Z [pebble] Started daemon.
^[[A2024-07-25T01:47:26.190Z [pebble] GET /v1/notices?timeout=30s 30.003339425s 200
2024-07-25T01:47:26.791Z [pebble] GET /v1/plan?format=yaml 186.506µs 200
2024-07-25T01:47:26.812Z [pebble] POST /v1/layers 415.857µs 200
2024-07-25T01:47:26.836Z [pebble] POST /v1/services 13.34502ms 202
2024-07-25T01:47:26.841Z [pebble] Service "kubeflow-dashboard" starting: /sbin/tini -- npm start
2024-07-25T01:47:26.842Z [kubeflow-dashboard] [WARN  tini (109)] Tini is not running as PID 1 and isn't registered as a child subreaper.
2024-07-25T01:47:26.842Z [kubeflow-dashboard] Zombie processes will not be re-parented to Tini, so zombie reaping won't work.
2024-07-25T01:47:26.842Z [kubeflow-dashboard] To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1.
2024-07-25T01:47:27.533Z [kubeflow-dashboard] 
2024-07-25T01:47:27.533Z [kubeflow-dashboard] > kubeflow-centraldashboard@0.0.2 start
2024-07-25T01:47:27.533Z [kubeflow-dashboard] > npm run serve
2024-07-25T01:47:27.533Z [kubeflow-dashboard] 
2024-07-25T01:47:27.848Z [pebble] GET /v1/changes/1/wait?timeout=4.000s 1.010314497s 200
2024-07-25T01:47:28.055Z [kubeflow-dashboard] npm WARN logfile Error: EACCES: permission denied, scandir '/root/.npm/_logs'
2024-07-25T01:47:28.055Z [kubeflow-dashboard] npm WARN logfile  error cleaning log files [Error: EACCES: permission denied, scandir '/root/.npm/_logs'] {
2024-07-25T01:47:28.055Z [kubeflow-dashboard] npm WARN logfile   errno: -13,
2024-07-25T01:47:28.056Z [kubeflow-dashboard] npm WARN logfile   code: 'EACCES',
2024-07-25T01:47:28.056Z [kubeflow-dashboard] npm WARN logfile   syscall: 'scandir',
2024-07-25T01:47:28.056Z [kubeflow-dashboard] npm WARN logfile   path: '/root/.npm/_logs'
2024-07-25T01:47:28.056Z [kubeflow-dashboard] npm WARN logfile }
2024-07-25T01:47:28.101Z [kubeflow-dashboard] 
2024-07-25T01:47:28.101Z [kubeflow-dashboard] > kubeflow-centraldashboard@0.0.2 serve
2024-07-25T01:47:28.101Z [kubeflow-dashboard] > node dist/server.js
2024-07-25T01:47:28.101Z [kubeflow-dashboard] 
2024-07-25T01:47:28.167Z [kubeflow-dashboard] npm ERR! code EACCES
2024-07-25T01:47:28.167Z [kubeflow-dashboard] npm ERR! syscall mkdir
2024-07-25T01:47:28.167Z [kubeflow-dashboard] npm ERR! path /root/.npm/_cacache/tmp
2024-07-25T01:47:28.167Z [kubeflow-dashboard] npm ERR! errno -13
2024-07-25T01:47:28.170Z [kubeflow-dashboard] npm ERR! 
2024-07-25T01:47:28.170Z [kubeflow-dashboard] npm ERR! Your cache folder contains root-owned files, due to a bug in
2024-07-25T01:47:28.170Z [kubeflow-dashboard] npm ERR! previous versions of npm which has since been addressed.
2024-07-25T01:47:28.170Z [kubeflow-dashboard] npm ERR! 
2024-07-25T01:47:28.170Z [kubeflow-dashboard] npm ERR! To permanently fix this problem, please run:
2024-07-25T01:47:28.170Z [kubeflow-dashboard] npm ERR!   sudo chown -R 1000:1000 "/root/.npm"
2024-07-25T01:47:28.170Z [kubeflow-dashboard] 
2024-07-25T01:47:28.170Z [kubeflow-dashboard] npm ERR! Log files were not written due to an error writing to the directory: /root/.npm/_logs
2024-07-25T01:47:28.170Z [kubeflow-dashboard] npm ERR! You can rerun the command with `--loglevel=verbose` to see the logs in your terminal
2024-07-25T01:47:28.179Z [kubeflow-dashboard] npm notice 
2024-07-25T01:47:28.179Z [kubeflow-dashboard] npm notice New major version of npm available! 8.19.4 -> 10.8.2
2024-07-25T01:47:28.179Z [kubeflow-dashboard] npm notice Changelog: <https://github.com/npm/cli/releases/tag/v10.8.2>
2024-07-25T01:47:28.179Z [kubeflow-dashboard] npm notice Run `npm install -g npm@10.8.2` to update!
2024-07-25T01:47:28.179Z [kubeflow-dashboard] npm notice 
2024-07-25T01:47:29.434Z [kubeflow-dashboard] Initializing Kubernetes configuration
2024-07-25T01:47:29.533Z [kubeflow-dashboard] Unable to fetch Application information: 404 page not found
2024-07-25T01:47:29.533Z [kubeflow-dashboard] 
2024-07-25T01:47:29.537Z [kubeflow-dashboard] "other" is not a supported platform for Metrics
2024-07-25T01:47:29.537Z [kubeflow-dashboard] Using Profiles service at http://kubeflow-profiles.kubeflow:8081/kfam
2024-07-25T01:47:29.543Z [kubeflow-dashboard] Server listening on port http://localhost:8082 (in production mode)
2024-07-25T01:47:31.513Z [pebble] GET /v1/plan?format=yaml 254.443µs 200
2024-07-25T01:47:32.694Z [pebble] Service "kubeflow-dashboard" stopped unexpectedly with code 243
2024-07-25T01:47:32.694Z [pebble] Service "kubeflow-dashboard" on-failure action is "restart", waiting ~500ms before restart (backoff 1)
2024-07-25T01:47:33.221Z [pebble] Service "kubeflow-dashboard" starting: /sbin/tini -- npm start
2024-07-25T01:47:33.225Z [kubeflow-dashboard] [WARN  tini (161)] Tini is not running as PID 1 and isn't registered as a child subreaper.
2024-07-25T01:47:33.225Z [kubeflow-dashboard] Zombie processes will not be re-parented to Tini, so zombie reaping won't work.
2024-07-25T01:47:33.225Z [kubeflow-dashboard] To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1.
2024-07-25T01:47:34.368Z [kubeflow-dashboard] 
2024-07-25T01:47:34.368Z [kubeflow-dashboard] > kubeflow-centraldashboard@0.0.2 start
2024-07-25T01:47:34.368Z [kubeflow-dashboard] > npm run serve
2024-07-25T01:47:34.368Z [kubeflow-dashboard] 
2024-07-25T01:47:35.492Z [kubeflow-dashboard] npm WARN logfile Error: EACCES: permission denied, scandir '/root/.npm/_logs'
2024-07-25T01:47:35.492Z [kubeflow-dashboard] npm WARN logfile  error cleaning log files [Error: EACCES: permission denied, scandir '/root/.npm/_logs'] {
2024-07-25T01:47:35.493Z [kubeflow-dashboard] npm WARN logfile   errno: -13,
2024-07-25T01:47:35.493Z [kubeflow-dashboard] npm WARN logfile   code: 'EACCES',
2024-07-25T01:47:35.493Z [kubeflow-dashboard] npm WARN logfile   syscall: 'scandir',
2024-07-25T01:47:35.493Z [kubeflow-dashboard] npm WARN logfile   path: '/root/.npm/_logs'
2024-07-25T01:47:35.495Z [kubeflow-dashboard] npm WARN logfile }
2024-07-25T01:47:35.566Z [kubeflow-dashboard] 
2024-07-25T01:47:35.566Z [kubeflow-dashboard] > kubeflow-centraldashboard@0.0.2 serve
2024-07-25T01:47:35.566Z [kubeflow-dashboard] > node dist/server.js
2024-07-25T01:47:35.566Z [kubeflow-dashboard] 
2024-07-25T01:47:35.703Z [kubeflow-dashboard] npm ERR! code EACCES
2024-07-25T01:47:35.703Z [kubeflow-dashboard] npm ERR! syscall mkdir
2024-07-25T01:47:35.711Z [kubeflow-dashboard] npm ERR! path /root/.npm/_cacache/tmp
2024-07-25T01:47:35.711Z [kubeflow-dashboard] npm ERR! errno -13
2024-07-25T01:47:35.729Z [kubeflow-dashboard] npm ERR! 
2024-07-25T01:47:35.737Z [kubeflow-dashboard] npm ERR! Your cache folder contains root-owned files, due to a bug in
2024-07-25T01:47:35.738Z [kubeflow-dashboard] npm ERR! previous versions of npm which has since been addressed.
2024-07-25T01:47:35.738Z [kubeflow-dashboard] npm ERR! 
2024-07-25T01:47:35.738Z [kubeflow-dashboard] npm ERR! To permanently fix this problem, please run:
2024-07-25T01:47:35.739Z [kubeflow-dashboard] npm ERR!   sudo chown -R 1000:1000 "/root/.npm"
2024-07-25T01:47:35.746Z [kubeflow-dashboard] 
2024-07-25T01:47:35.746Z [kubeflow-dashboard] npm ERR! Log files were not written due to an error writing to the directory: /root/.npm/_logs
2024-07-25T01:47:35.746Z [kubeflow-dashboard] npm ERR! You can rerun the command with `--loglevel=verbose` to see the logs in your terminal
2024-07-25T01:47:37.075Z [kubeflow-dashboard] Initializing Kubernetes configuration
2024-07-25T01:47:37.102Z [pebble] GET /v1/plan?format=yaml 258.397µs 200
2024-07-25T01:47:37.142Z [kubeflow-dashboard] Unable to fetch Application information: 404 page not found
2024-07-25T01:47:37.142Z [kubeflow-dashboard] 
2024-07-25T01:47:37.148Z [kubeflow-dashboard] "other" is not a supported platform for Metrics
2024-07-25T01:47:37.148Z [kubeflow-dashboard] Using Profiles service at http://kubeflow-profiles.kubeflow:8081/kfam
2024-07-25T01:47:37.159Z [kubeflow-dashboard] [SEVERE] uncaughtException Error: listen EADDRINUSE: address already in use :::8082
2024-07-25T01:47:37.159Z [kubeflow-dashboard]     at Server.setupListenHandle [as _listen2] (node:net:1463:16)
2024-07-25T01:47:37.159Z [kubeflow-dashboard]     at listenInCluster (node:net:1511:12)
2024-07-25T01:47:37.159Z [kubeflow-dashboard]     at Server.listen (node:net:1599:7)
2024-07-25T01:47:37.159Z [kubeflow-dashboard]     at Function.listen (/usr/src/app/node_modules/express/lib/application.js:635:24)
2024-07-25T01:47:37.159Z [kubeflow-dashboard]     at /usr/src/app/dist/server.js:83:13
2024-07-25T01:47:37.159Z [kubeflow-dashboard]     at Generator.next (<anonymous>)
2024-07-25T01:47:37.159Z [kubeflow-dashboard]     at fulfilled (/usr/src/app/dist/server.js:5:58)
2024-07-25T01:47:37.159Z [kubeflow-dashboard]     at processTicksAndRejections (node:internal/process/task_queues:96:5) {
2024-07-25T01:47:37.159Z [kubeflow-dashboard]   code: 'EADDRINUSE',
2024-07-25T01:47:37.159Z [kubeflow-dashboard]   errno: -98,
2024-07-25T01:47:37.159Z [kubeflow-dashboard]   syscall: 'listen',
2024-07-25T01:47:37.159Z [kubeflow-dashboard]   address: '::',
2024-07-25T01:47:37.159Z [kubeflow-dashboard]   port: 8082
2024-07-25T01:47:37.159Z [kubeflow-dashboard] }
2024-07-25T01:47:37.180Z [pebble] Service "kubeflow-dashboard" stopped unexpectedly with code 243
2024-07-25T01:47:37.180Z [pebble] Service "kubeflow-dashboard" on-failure action is "restart", waiting ~1s before restart (backoff 2)
2024-07-25T01:47:38.244Z [pebble] Service "kubeflow-dashboard" starting: /sbin/tini -- npm start
2024-07-25T01:47:38.246Z [kubeflow-dashboard] [WARN  tini (209)] Tini is not running as PID 1 and isn't registered as a child subreaper.
2024-07-25T01:47:38.246Z [kubeflow-dashboard] Zombie processes will not be re-parented to Tini, so zombie reaping won't work.
2024-07-25T01:47:38.246Z [kubeflow-dashboard] To fix the problem, use the -s option or set the environment variable TINI_SUBREAPER to register Tini as a child subreaper, or run Tini as PID 1.
2024-07-25T01:47:3

Additional Context

No response

syncronize-issues-to-jira[bot] commented 2 months ago

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-6128.

This message was autogenerated

orfeas-k commented 2 months ago

Until Kubeflow 1.8, centraldashboard image was built using node 14.x.x verion which corresponds to npm version 6. Looking at the same part of the docs, it turns out that:

If npm was invoked with root privileges, then it will change the uid to the user account or uid specified by the user config, which defaults to nobody. Set the unsafe-perm flag to run scripts with root privileges.

We need to confirm user nobody permissions in order to be certain this change is the root cause, but since in 1.9 Node version in the image was bumped to version 16 where this behaviour changed, it looks like this is what likely causes the insufficiency in user permissions.

What is still a mystery why we 've seen the charm working with 1.9 upstream image without any changes, although very rarely.