RasaHQ / rasa-x-helm

Rasa Enterprise Helm chart for deploying on Kubernetes (K8s) and OpenShift.
Apache License 2.0
76 stars 104 forks source link

OpenShift Installation Issues #254

Closed michaelsteigman closed 2 years ago

michaelsteigman commented 2 years ago

I have an OpenShift (Azure) 4.6 cluster on which I have been trying to install RasaX. I have a fair amount of experience with OpenShift, both building and deploying my own projects and bootstrapping open source projects.

I followed these instructions. After much reading, digging around and tinkering, I have arrived at what I believe is a working installation. It wasn't easy, though.

I wanted to share some of the roadblocks and solutions I came up with. This was a frustrating experience, especially considering the fact the docs suggest real support for OpenShift. (I don't just want to whine, mind you! I am hoping this information will be helpful to others and I am also willing to do some additional testing for the community if it will help with supporting OpenShift.)

I am also open to feedback about mistakes I have made while going about this.

First steps

I did initially receive the error the instructions warn about relating to user 1001 and got past it by setting securityContext.fsGroup to null, as mentioned.

However, the Postgresql instance would still not start up:

[38;5;6mpostgresql e[38;5;5m17:14:39.98 e[0me[38;5;2mINFO e[0m ==> Initializing PostgreSQL database...
chmod: changing permissions of '/bitnami/postgresql/data': Operation not permitted
e[38;5;6mpostgresql e[38;5;5m17:14:40.01 e[0me[38;5;3mWARN e[0m ==> Lack of permissions on data directory!
chmod: changing permissions of '/bitnami/postgresql/data': Operation not permitted
e[38;5;6mpostgresql e[38;5;5m17:14:40.01 e[0me[38;5;3mWARN e[0m ==> Lack of permissions on data directory!

I read through the Bitnami chart values and issue tracker and tried to set the following, which is recommended for OpenShift:

postgresql:
  volumePermissions:
    securityContext:
      runAsUser: "auto"
  securityContext:
    enabled: false
  containerSecurityContext:
    enabled: false
  shmVolume:
    chmod:
      enabled: false

No luck.

Similar permissions issue with Nginx:

/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: can not modify /etc/nginx/conf.d/default.conf (read-only file system?)
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
20-envsubst-on-templates.sh: ERROR: /etc/nginx/templates exists, but /etc/nginx/conf.d is not writable
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
2021/11/15 17:14:56 [emerg] 1#1: mkdir() "/etc/nginx/client_body" failed (13: Permission denied)
nginx: [emerg] mkdir() "/etc/nginx/client_body" failed (13: Permission denied)

I posted on the forums and got a suggestion to use an unprivileged Nginx image. I added a value to the nginx part of the chart to override the image. That seemed to work.

Deeper into the weeds with Bitnami charts

I still had no PG, Redis or RabbitMQ instance. All the Rasa pods were failing and I could only guess that it might have something to do with the fact there was no database available. I noticed that the subchart version numbers were rather far behind the upstream charts. I tried installing the current PG Bitnami chart and it worked out of the box. My values for the chart are:

securityContext:
  enabled: false
containerSecurityContext:
  enabled: false

I moved on to Redis and was able to get that working with the following values:

master:
  podSecurityContext:
    enabled: false
    fsGroup: ""
  containerSecurityContext:
    enabled: false
    runAsUser: "auto"
replica:
  podSecurityContext:
    enabled: false
    fsGroup: ""
  containerSecurityContext:
    enabled: false
    runAsUser: "auto"
sentinel:
  enabled: true
  containerSecurityContext:
    enabled: false
    runAsUser: "auto"
metrics:
  containerSecurityContext:
    enabled: false
    runAsUser: "auto"
volumePermissions:
  securityContext:
    runAsUser: "auto"

Same for RabbitMQ:

podSecurityContext:
  enabled: false
  fsGroup: ""
  runAsUser: "auto"
rbac:
  create: false
serviceAccount:
  create: false
clustering:
  enabled: false

Once I had the PG, Redis and RabbitMQ backends running, I turned off the installs in the RasaX chart and added the existingHost and related settings to my values files.

Home Stretch

After redeploying, the db migration service logs indicated that the migrations were run and I could see new relations in the database.

However, some of the Rasa pods were displaying an authorization error. I found this issue and noticed that my password salt (randomly generated) had a + in it. I removed the + and the authorization error went away.

Finally, just about everything appeared to be working. The event service, however, would not come up - the readiness probe was failing, leading to constant restarts and eventually, a CrashLoopBackOff. I disabled the probes to see what would happen and to my surprise, the service started up just fine. It appears the initialProbeDelay is just too short. I set it to 30 seconds for now and it seems to be working.

Conclusions

I wonder why the RasaX chart doesn't hew to the upstream Bitnami charts? It appears there is work going on there to ensure compatibility with k8s distros. Couldn't RasaX pin their chart to the image version for compatibility while taking advantage of improvements in the charts? The flexibility to use existing hosts for these backends is nice but it shouldn't be required, should it?

Same thing goes for Nginx. I feel like I've got a bit of a Frankenstein on my hands here and it seems unnecessary.

The authorization issue - e.g., what are restrictions on the passwords/salts? - probably ought to be mentioned somewhere in the docs

It might be good to bump the initialProbeDelay on the event service probes at least.

That's it for now. As I said, I am happy to help with additional testing. Thanks for reading.

sara-tagger commented 2 years ago

Thanks for the issue, @melindaloubser1 will get back to you about it soon!

You may find help in the docs and the forum, too 🤗
indam23 commented 2 years ago

Thank you for the comprehensive write up @michaelsteigman! @virtualroot Seems like some of these pain points could be alleviated with adjustments to the charts and/or better documentation.

tmbo commented 2 years ago

Thanks a lot for all these suggestions, this is incredibly helpful!

@virtualroot how much effort is it to make the necessary documentation and chart changes?

michaelsteigman commented 2 years ago

Just circling back around to check on this. Thanks for the responses @sara-tagger , @melindaloubser1 and @tmbo.

Also wanted to link to #154 which I stumbled on shortly after creating the issue. I probably didn't find it in my initial searches because OpenShift is written with a hyphen. The install instructions instruct OpenShift users to set fsGroup to null but this setting, either on the command line or at the top level of the values file, has no impact on RabbitMQ or Redis, which both fail to start. I tried what the OP in that issue tried as well and ended up throwing up my hands and using the newer Bitnami charts directly.

I have not yet taken down my running stack to try this yet but I assume it still works.

tczekajlo commented 2 years ago

The chart dependencies + docs were updated in the https://github.com/RasaHQ/rasa-x-helm/pull/259.

zoobab commented 2 years ago

Can you share your yamls with the bitnami/nginx?

RasaX claim to run on OpenShift, I have doubts about that.

michaelsteigman commented 2 years ago

I did get everything running on OpenShift, though it's been a while and I haven't tried to bootstrap the project since my suggestions were incorporated.

That said, as I wrote above, I used an unpriveleged Nginx image at the time. From values.yaml:

nginx:
  name: nginxinc/nginx-unprivileged
  tag: stable