geerlingguy / tower-operator

DEPRECATED: This project was moved and renamed to: https://github.com/ansible/awx-operator
82 stars 34 forks source link

Make sure Tower Operator can be deployed easily on OpenShift #15

Closed geerlingguy closed 4 years ago

geerlingguy commented 4 years ago

From a comment on my blog:

  1. For Openshift, postgresql image is unable to write on data directory, so to work we need to set anyuid scc for the namespace.
  2. Maybe for the same reason, the status of "updating" still running forever. (I had to create the route manually).
  3. The App tower-task is unable to have 1 replica running.
  4. I had to change the yaml to place the tower-operator service account on the project I have created. It's all placed to run on default project.

I have only been testing the operator in Kubernetes clusters (Minikube and KinD), I haven't been testing in CRC or other OpenShift-ish clusters. I would like to make sure the operator is easy to deploy into OpenShift/OKD as well, and I know there can be restrictions around things like PVs (which is required for this operator because otherwise tower's data would get blown away any time you updated or any time that container stopped).

tylerauerbeck commented 4 years ago

@geerlingguy I was going to start taking a look at what all is involved to getting this running smoothly on OpenShift. Is there any existing work already in process or completed for this? If not, I'd be happy to lend a hand here.

geerlingguy commented 4 years ago

@tylerauerbeck - Nothing substantial yet, half of the battle is just trying to install it on OpenShift and seeing what (if anything) fails.

tylerauerbeck commented 4 years ago

@geerlingguy Sounds good. I'm happy to help out here then. I hacked around on this briefly this afternoon and it looks like right now the main sticking point is the Postgres instance. Everything else seems to come up without issue. Error that it throws is:

chmod: changing permissions of '/var/lib/postgresql/data': Operation not permitted

I kind of expected this as I've seen this when trying to run helm charts that rely on the postgres image from dockerhub. The two options I see here are:

  1. Hack around with this image to see what we can do to make it work nicely in OpenShift
  2. Add some logic that one way or another decides that the operator is running in OpenShift and then uses the provided Postgres image that is provided inside OpenShift

If I remember correctly, I believe the ansible-tower playbooks themselves do something similar to option 2 here. But I'm happy to play around with option 1 if you think that's a better idea.

tylerauerbeck commented 4 years ago

Went with approach 1 on this and found that the only thing needed to fix this was to add the following env variable to the postgres statefulset:

        - name: PGDATA
          value: /var/lib/postgresql/data/pgdata

According to the dockerhub page:

PGDATA: This optional variable can be used to define another location - like a subdirectory - for the database files. The default is /var/lib/postgresql/data, but if the data volume you're using is a filesystem mountpoint (like with GCE persistent disks), Postgres initdb recommends a subdirectory (for example /var/lib/postgresql/data/pgdata ) be created to contain the data.

As part of the postgres image init script, it was trying to chown the data dir that it uses, which is owned by root. Since in OpenShift the image isn't running as root, it doesn't have the ability to do that. But if we just change the data dir to be a directory underneath that directory -- it then has the ability to chown the directory appropriately. After that Postgres boots up and we're in good shape.

@geerlingguy As long as you're okay with this approach, I can get a PR opened to address this.

geerlingguy commented 4 years ago

@tylerauerbeck - That sounds reasonable to me. Maybe we make that a variable, default it to /var/lib/postgresql/data/pgdata, and then in the notes on that PR (which people can look to when they update the operator if they're using the alpha version already), we can show how the variable can be set to /var/lib/postgresql/data if they've already set up Tower/AWX instances. Otherwise they would lose data as the folder location would change when they update the operator!

tylerauerbeck commented 4 years ago

@geerlingguy So looks like the last hurdle here is going to be the tower_task deployment. Looks like they set up that image to run with

securityContext:
  privileged: true

So the only two ways around this are to either grant the serviceaccount the privileged scc or to remove that securityContext. Not sure if you have a better idea on why it requires this, but my preference would be to remove it. However, I'm also not sure what that may break by doing that.

From looking at the official installer, it looks like they also take this approach (privileged container, give serviceaccount a privileged scc), but I think the general preference is to avoid privileged containers if possible -- so would like to see what we could do there if we can.

geerlingguy commented 4 years ago

@tylerauerbeck - I wanted to figure that out too, it looks like the openshift installer docs have the answer:

When using OpenShift for deploying AWX make sure you have correct privileges to add the security context 'privileged', otherwise the installation will fail. The privileged context is needed because of the use of the bubblewrap tool to add an additional layer of security when using containers.

And it looks like the AWX installer actually adds the privileged security context constraint (scc) to the project's service account:

- name: Add privileged SCC to service account
  shell: |
    {{ openshift_oc_bin }} adm policy add-scc-to-user privileged system:serviceaccount:{{ openshift_project }}:awx

Maybe we need to do something like that here (with a flag for whether it's running in openshift vs. plain kubernetes?).

geerlingguy commented 4 years ago

It seems that bubblewrap can be disabled in the tower settings file (job isolation, it's called in the UI)... but I'm not sure what the plan is for handling bubblewrap or finding a non-privileged way to handle this long-term.

tylerauerbeck commented 4 years ago

Gotcha. So looking at how to disable this, it looks like bubblewrap comes into play when you're running a multi-tenant tower. So it may be safe (and we may be able to remove the privileged role) if you're just looking to run a single-tenant tower?

Or maybe provide a way to do both and just write up the docs that describe some of the benefits/risks of both?