galaxyproject / galaxy-helm

Minimal setup required to run Galaxy under Kubernetes
MIT License
41 stars 39 forks source link

User deployment not working #214

Open viktoriaas opened 3 years ago

viktoriaas commented 3 years ago

Hi, I'm trying to deploy this helm chart under user and it crashes fataly. I've already managed to deploy it as admin which works fine but I would like to let users also deply this chart. I use RKE provider, Kuberentes version 1.19.3 and Docker 20.10.2.

Firtly, it is importnant to answer whether this should be possible or if you have thought about it. If yes, following notes are for you. If no, may I ask why? It would certainly help multitenant environments where users might want to run their own instances of galaxy.

  1. users are forbidden from creating rbacs therefore I set rbac:enable: false. After this, I can deploy the chart.
  2. the deployment crashes on postgres StatefulSet. I use some storage provider which doesn't support dynamic provisioning. Therefore I have to create storage separately. I am allowed to use already existing claim for galaxy but not for postgres. And when I create it, stateful set keeps on crashing because it has to have certain naming (it's not clear to me how this naming is determined) 2.1 When I fulfill the naming, the claim is created under root with drwx------ permissions. Later, the container itself crashes on mkdir: cannot create directory ‘/bitnami/postgresql/data’: Permission denied. I can manually chmod created share (it's on NFS) but it can't work like that. The fix should be somewhere in mounting the claim but there is no trace of postgres configs in chart
  3. When storages are fixed, stateful set is running but other components are in Error state after some time. 3.1 galaxy-job + galaxy web + galaxy workflow all fail on sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL: password authentication failed for user "galaxydbuser". I haven't manipulated with this value at all. 3.2 I thought it might be connected to that disabled rbac so I created cluster role and binding manually and deplyed into cluster. The helm chart then get deployed, but problem with password persists

Could you advise me if there is any way how could I achieve user-deploy possibility?

almahmoud commented 3 years ago

Hey! First of all, I'd want to clarify what you mean by the user/admin distinction. My current understanding is that you're referring to something like a Rancher/Kubernetes admin vs namespace-restricted user (but if I misunderstood, this answer might be less relevant than I thought). If you could give us a bit more info to understand more fully the distinction in terms of permissions and specific requirements in your setup. Our current recommended solution for multi-tenant cluster of Galaxies, where each user gets their own Galaxy instance, is the Genomics Virtual Lab (GVL). Our presentation from BCC2020 (https://bcc2020.sched.com/event/cst8/the-all-new-genomics-virtual-lab) is a great overview if you'd like to consider whether it can solve your goal. From what I understood, I suspect the GVL could be a solution or starting point for what you're trying to accomplish. The stack is a bit complicated from a technical perspective, but it's easy to launch and we're making it as easy as possible to manage long term, and it's all open source.

We have the cloudman layer which abstracts the actual helm install and makes it easy to deploy multiple isolated (or semi-isolated, depending on what you want) Galaxies in the same cluster. In the GVL, we have the notion of a project, which is generally mapped to a separate namespace. For example, you can have yourself be a GVL admin, which has access to the backend, the cluster itself (and we use Rancher), and all the projects. Then you can create a project X for user A and project Y for user B, for example, where you can make each of user A and B a project admin in their respective project, so they can change some of the Galaxy configs and launch/update/tear down their Galaxy within their project, without affecting other users. You can also add more users to each project that would be simple project users if you'd like, so they would have single sign-on for Galaxy and would be able to use it, but would not have rights to modify configs. If that sounds like the kind of scenario you're envisioning, I highly recommend you give the GVL a try (it is deployable on multiple clouds, and I can offer to get you started if you need help).

For the actual questions: 1) we haven't really thought about the chart as a user-deployable resource. In general, we've assumed the chart is used by an admin, and when we wanted to make some features (like launching, updating, etc...) accessible to users that shouldn't have cluster-wide permissions, we've used CloudMan layer in the GVL to create that abstraction. So CloudMan still deploys resources as an admin from the cluster perspective in the back, but from the user perspective they are limited to only manipulating the apps within their project/namespace boundary at the CloudMan auth level. 2) We're using the bitnami postgresql chart as a dependency, which is why you don't see all possible postgres configs here. We have added default working values for the Galaxy chart, but the postgres chart is separately documented by its own developers, as we do not maintain it ourselves. The current dependency is pointing to the "stable" chart that you can find at https://github.com/helm/charts/tree/master/stable/postgresql but given the deprecation of the "stable" helm repository, we will be very soon moving to the updated chart that moved here: https://github.com/bitnami/charts/tree/master/bitnami/postgresql . We have not taken the time to document more than a basic set of values for postgres, but if you're trying to accomplish something specific, I'd be happy to advise if you need some help. For using an already created PVC, you should be able to just set postgres.persistence.existingClaim value similar to how the existingClaim works in our chart. 3) There was some recent work done by Nuwan to allow the postgres password to survive restarts even when randomly generated, which used to be a problem in the past. This was accomplished by using helm lookup function to retrieve the password from the k8s secret if it exists and set it to helm values post-generation if it's random. Our best guess for why you might have a problem is if the lookup function cannot retrieve the password (potentially due to limited permissions accessing the secret on the cluster) and therefore resorts to randomly generating a password instead, despite one already being used by postgres. I think the solution for you here would be to pre-generate a password and specify the value postgresql.galaxyDatabasePassword at launch.

viktoriaas commented 3 years ago

Hello, thanks for a reply. I think this chart doesn't support what I'm looking for.

To clarify user/admin distinction - yes, I'm referring to something like a Rancher/Kubernetes admin vs namespace-restricted user. I have an admin that is cluster-wide, thus eligible to perform any action. And then I have namespaces and users are restricted only to those namespaces.

The GVL looks interesting, however I am looking for a container based solution, not a cloud one. That's why I tried the Helm chart, it is based on containers.

To give a note to questions:

  1. yes, I had a look on whole chart and it feels like only admin should deploy it. It's both good and bad - there might be some settings that need admin privileges (although I do not know if whole deployemnt can't be enclosed in namespace and live there) but on the other side, if I have a bioinformatician who is constantly working on his galaxy instance and adds/removes things from configs, it is very inflexible to always do the redeploy for him. If I assume that only values.yaml is changing and then redeploy is needed, it's not much.
  2. yes, it is possible to override existing claim with postgres.persistence.existingClaim but this option wasn't written (or commented out) in values.yaml therefore I didn't know about it. But if I add it, it works
  3. I think the problem is related to privileges, when I deploy as admin, everything is fine.

However, there is still one problem. When I enable creating PVC and set storageClass, the postgres container still complains about permissions. The PVC is not owned by postgres user (root with drwx------ ) so it's failing constantly again, to mkdir: cannot create directory ‘/bitnami/postgresql/data’: Permission denied. Is there a way to chown this pvc from chart? I think it's not a good practice at all to manually chown it on server.

almahmoud commented 3 years ago

The GVL is in fact a container-based (and even specifically rancher-based) solution, and uses this chart to deploy Galaxy. It is primarily designed for the cloud, but it can also be deployed on local infrastructure. The only real prerequisite is having a kubernetes cluster. If you're deploying locally it will require some manual deployment rather than just using cloudlaunch to target a cloud, but it is possible and would allow you to expose a UI from which users can configure, deploy, shut down, and more generally manage their own Galaxy instances within a single namespace. Are you starting from a VM on which you're deploying the cluster or are you deploying it on a local machine?

2) You should be looking at the original chart repo to see the full configurability of the postgres chart that we're just using as a dependency. We do not plan on documenting all options for our dependency charts, given that their maintainers have a very well documented README and we cannot anticipate all customizations that someone might want to use. I will add to my list to link the postgres chart in our Readme so users can more easily find that documentation.

For the issue relating to postgres, it might be better to open an issue at https://github.com/bitnami/charts/tree/master/bitnami/postgresql and ask the postgres maintainers what they would suggest as the best way to do so. My suggestion would probably be to use extraInitContainers on the postgres chart to accomplish it, but I am not entirely sure if that is the best solution.

viktoriaas commented 3 years ago

Well, I might have not understood you correctly but if I am able to deploy GVL on kubernetes, that would be awesome. Our setup: Rancher Kubernetes distribution v.2.5.1 + cluster nodes are deployed in OpenStack. So I'm starting from a VM. I've read this blogpost but couldn't find any information on how to deploy differently than on https://launch.usegalaxy.org/catalog/appliance/genomics-virtual-lab Could you please point me to any direction? I don't mind manual installation

And thank you for postgres support, I couldn't find info. Will ask in their repo.

nuwang commented 3 years ago

@viktoriaas From what I've gathered, it seems like you already have a specific managed k8s environment and user management model, and as such, the GVL may not necessarily suit without some significant effort to delve into its guts.

However, I believe that the Galaxy helm-chart should be usable in the way you've outlined. Apart from RBAC controls (which can be disabled), and the priorityclass (which currently cannot be disabled - but this is a small change to enable this and I'll get this done today), all other resources are namespaced, and should be deployable in the manner you envision.

For postgres, have you tried specifying an fsGroup in the postgres securityContext? https://github.com/bitnami/charts/blob/d4603391fbf3f95148d315536cf383143b7ce6af/bitnami/postgresql/templates/statefulset.yaml#L64

viktoriaas commented 3 years ago

Hello, @nuwang,

You are right about priorityclass - it can't be disabled now. However, for testing purposes I enabled it for all users through ClusterRole and disabled rbac and then, I faced problem with postgres about password. I performed all of this under user but as it wasn't functional, I again use admin but still doesn't work. Maybe if the problem with DB PASSWORD resolves, everything will work just fine also under user.

I've tried to specify fsGroup for postgres but it didn't help. Specifying

postgres:
  volumePermissions:
    enabled: true

did the trick and pvc is created under right user and group (1001), postgres stateful set is not failing anymore.

2 problems persist:

  1. The deployment creates total of 4 PVCs, one of them galaxy PVC. This pvc is owned by user 101 but there is config directory inside owned by root with drwxr-xr-x and galaxy-init-mounts fails on

    cp: cannot create regular file '/galaxy/server/config/mutable/integrated_tool_panel.xml': Permission denied

    again, for testing purposes I've manually chowned this whole directory to 777 and that particular error disappears. ( 28/01/2021 15:14:14 cp: cannot stat '/galaxy/server/config/sanitize_whitelist.txt': No such file or directory still exits)

  2. galaxy-db-migrations still error with DB PASSWORD

    sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) FATAL:  password authentication failed for user "galaxydbuser"
nuwang commented 3 years ago

Glad to hear the postgres permissions issue was sorted out.

For the password issue, this happens even after setting postgresql.galaxyDatabasePassword on a fresh install? It needs to be a fresh install though, since once the database password is set in postgres, you can't change it (other than manually from within postgres).

viktoriaas commented 3 years ago

@nuwang After quite complex search and log study I discovered that all of these password have to be set

  galaxyDatabasePassword: XXX                                           
  postgresqlPassword: XXX                                                  
  postgresqlPostgresPassword: XXX

Persisting problem is that directory /galaxy/server/config/mutable/ keeps on appearing in logs of one of galaxy-job containers, with something like python IO error: shed_tool_data_table_conf.xml can't be parsed. I think it is related to permissions of PVCs, actually only to /galaxy/server/config directory after mounting.

In deployment-job.yaml I had to add in init-postgres container in run command recursive option

'sh', '-c', 'chown -R 101:101 {{ $.Values.persistence.mountPath }}; until nc -z -w3 {{ template "galaxy-postgresql.fullname" $ }} 5432; do echo waiting for galaxy-postgres service; sleep 1; done;'

Init kept on failing with permissions denied to, again, config directory which was owned by root and could not be accessed. I think /galaxy/server/config directory has similar problem or some key:val values.yaml is not set correctly.

I've managed to deploy whole chart under user but I had to deploy the chart, change permissions, uninstall the chart and deploy it again with initdbScriptsSecret set to null . Whole workflow:

  1. disabled rbac
  2. add cluster wide role for priority class (let me know when disabling will be supported!)
  3. create PVCs, set persistence to existing Claim + initdbScriptsSecret: "{{ .Release.Name }}-galaxy-initdb"
  4. deploy 1st time
  5. galaxy-job failing on something like python IO error: shed_tool_data_table_conf.xml can't be parsed -> check config dir in mounted pvc
    drwxr-xr-x  2 root root  4096 Jan 28 23:43 config # change owner and group
    chown -R 101:101 pvc-5293a22b-4fc0-4113-a011-e7595a48c248/config/
  6. uninstall chart
  7. change initdbScriptsSecret: null, everything other stays as is
  8. deploy again
  9. works!

I feel like some steps shouldn't be necessary. Once, I deployed the chart correctly on the first try but it was under admin.

pcm32 commented 3 years ago

I suspect that if you remove the RBAC you won't be able to send jobs to run through the k8s runner (or you will send them, but they won't run on the k8s cluster, or they might run, but the galaxy pod won't be able to monitor them). Have you tried if that works? Make sure that we are talking of jobs that are sent to containers, there is a chance that things like Upload will be still running locally. Unless that you ask the admin to add those permissions separately (which I think defies a bit the purpose).

I wonder if one could allow users to handle permissions inside specific namespaces.

pcm32 commented 3 years ago

My recollection when I first added RBACs back then (version 1 easily), it was because of that. At some point in the Kubernetes versions (I think after 1.3 or 1.6, some very old version) you needed the RBACs to allow the service account in the galaxy pod to create/monitor/stop k8s Job API objects (that are used by the Galaxy k8s runner).

viktoriaas commented 3 years ago

@pcm32 That's an interesting note, I haven't thought about it. I would like to add some specific options to shed_tool.conf which can be easily done but I don't know if new tools can be easily added to /galaxy/server/config/tools. I suspect all tools are already part of the docker image?

Then I can test it but if disabling RBACs actually breaks the galaxy, it's useless. I will try

pcm32 commented 3 years ago

I suspect that still many of the built-in tools will send jobs as containers through kubernetes. The setup should also allow you to install tools from the toolshed through the Galaxy UI (just to test a few tools) in the admin section.

viktoriaas commented 3 years ago

I tried to load some files but the job can't be scheduled because cvmfs pvc is not found. I haven't enabled cvmfs, I don't want to use it. Is there another way how can I store loaded files, e.g with nfs?

pcm32 commented 3 years ago

@almahmoud you were asking me if there were traces of CVMFS in another issue, I think that this

I tried to load some files but the job can't be scheduled because cvmfs pvc is not found. I haven't enabled cvmfs, I don't want to use it. Is there another way how can I store loaded files, e.g with nfs?

answers you question.

pcm32 commented 3 years ago

in previous versions I have had to explicitly disable CVMFS for things to work.

viktoriaas commented 3 years ago

@pcm32 I resolved this in a very basic way - I've created PVCs with names galaxy2-cvmfs-gxy-main-pvc and galaxy2-cvmfs-gxy-data-pvc and with my preferred storage class (jobs asked for these specific names) and then files were added. I tried to do the same with helm chart when I specidfied storage class under cvmfs as csi-nfs (our preferred) but whole deployment failed.

Now I'm trying to use my added tool. It appeared in galaxy and looks right but when I try to run it, error appears /galaxy/server/database/jobs_directory/000/6/tool_script.sh: line 9: /usr/bin/python: No such file or directory

Checkin for /usr/bin/python in galaxy_job container wasn't successful, it's not there. I will try to manually create link inside every container python3 to python, it might help ... Or there could be a problem in my tool? (I'm not a Galaxy user, I'm sys admin and I'm trying to deploy it before I presnet to users)

EDIT it was in my tool, I had to change python version

pcm32 commented 3 years ago

So did it run the tool without RBAC on k8s or was it a local execution (running inside the same Galaxy container)? You can check and see if galaxy job pods have been created in k8s. A big part of running Galaxy in k8s is to offload the jobs as containers into the k8s cluster. If turning off RBACs take that away is less appealing (I think).

nuwang commented 3 years ago

@viktoriaas The pod priority class fix is in: https://github.com/galaxyproject/galaxy-helm/pull/215

viktoriaas commented 3 years ago

@nuwang thanks @pcm32 sorry for late reply. For a while, I'm not working on user deployment but I'm trying to deploy using admin account. This went fine, I have working galaxy instance. Also, I have successfully added my own tool. However, when I submit the job I have a missing dependecy

    import bz2file as bz2
ModuleNotFoundError: No module named 'bz2file'

I would like to ask, how to resolve this. I thought there might be some direcotry in pvc which serves as dependency dir but I haven't seen any option like this in values.yaml. It is very much likely that I need to pack everything needed in one container. But then, how can I add my tool? I add appropriate section in tool_conf.xml but how I map it? Shoul I add a new tool id under this section ? I can;t understand how to achieve adding my own tool

Thanks for help!

pcm32 commented 3 years ago

If your tool is not resolved (in the requirements field within its XML) to a bioconda package, then you would need to use an explicit mapping to a container that has everything needed for your tool. If your tool uses more than a single conda package as dependency, it might be that that combination has not yet made it into a multi-tool mulled container (https://github.com/BioContainers/multi-package-containers). My suggestion would be that you make sure that your tool is resolved to a bioconda package, then Galaxy in k8s should find a container automatically to run it (and avoid the issue that you see).

I handled explicit tool to containers through dynamic destinations using some YAML files, but I haven't used the tool to container mapping that is setup in the current version of helm chart, but surely @nuwang or @almahmoud know the details (it is probably documented in the values file).

viktoriaas commented 3 years ago

@pcm32 Thanks, I went through files a couple of times and achieved a successful run!

The job did finish and handled some results. Suppose I would like to download them, is it possible? Now, when I click on save icon, I get nginx 404 not found. Also, I don't seem to find any information on whether the output is saved on any PVC or not and if yes, what is the name of it.

pcm32 commented 3 years ago

Now, when I click on save icon, I get nginx 404 not found.

This should work and is most likely a bug of the helm setup (provided that the tool didn't just generate an empty file).

viktoriaas commented 3 years ago

could you point me in any direction what can be wrong? for storage i have used dynamic provisioning on nfs, cvmfs is disabled although i had to create pvc for uploading files (without it it failed), but actually i haven't found any of input or output files on shared storage

viktoriaas commented 3 years ago

Today I have successfuly deployed under user. I had to add 2 ClusterRoles for this particular user but that is perfectly okay.

There is still one issue that always happens when deploying galaxy. In galaxy-job container, the deployment fails with

Traceback (most recent call last):
17/02/2021 15:38:10   File "/galaxy/server/scripts/galaxy-main", line 298, in <module>
17/02/2021 15:38:10     main()
17/02/2021 15:38:10   File "/galaxy/server/scripts/galaxy-main", line 294, in main
17/02/2021 15:38:10     app_loop(args, log)
17/02/2021 15:38:10   File "/galaxy/server/scripts/galaxy-main", line 137, in app_loop
17/02/2021 15:38:10     galaxy_app = load_galaxy_app(
17/02/2021 15:38:10   File "/galaxy/server/scripts/galaxy-main", line 105, in load_galaxy_app
17/02/2021 15:38:10     app = UniverseApplication(
17/02/2021 15:38:10   File "/galaxy/server/lib/galaxy/app.py", line 117, in __init__
17/02/2021 15:38:10     self._configure_tool_data_tables(from_shed_config=False)
17/02/2021 15:38:10   File "/galaxy/server/lib/galaxy/config/__init__.py", line 1164, in _configure_tool_data_tables
17/02/2021 15:38:10     self.tool_data_tables.load_from_config_file(config_filename=self.config.shed_tool_data_table_config,
17/02/2021 15:38:10   File "/galaxy/server/lib/galaxy/tools/data/__init__.py", line 120, in load_from_config_file
17/02/2021 15:38:10     tree = util.parse_xml(filename)
17/02/2021 15:38:10   File "/galaxy/server/lib/galaxy/util/__init__.py", line 233, in parse_xml
17/02/2021 15:38:10     tree = etree.parse(fname, parser=parser)
17/02/2021 15:38:10   File "src/lxml/etree.pyx", line 3521, in lxml.etree.parse
17/02/2021 15:38:10   File "src/lxml/parser.pxi", line 1839, in lxml.etree._parseDocument
17/02/2021 15:38:10   File "src/lxml/parser.pxi", line 1865, in lxml.etree._parseDocumentFromURL
17/02/2021 15:38:10   File "src/lxml/parser.pxi", line 1769, in lxml.etree._parseDocFromFile
17/02/2021 15:38:10   File "src/lxml/parser.pxi", line 1163, in lxml.etree._BaseParser._parseDocFromFile
17/02/2021 15:38:10   File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
17/02/2021 15:38:10   File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
17/02/2021 15:38:10   File "src/lxml/parser.pxi", line 638, in lxml.etree._raiseParseError
17/02/2021 15:38:10 OSError: Error reading file '/galaxy/server/config/mutable/shed_tool_data_table_conf.xml': failed to load external entity "/galaxy/server/config/mutable/shed_tool_data_table_conf.xml"

When I check physical location, the config directory is the only one owned by root.

[root@ntc1 wes]$ ls -la pvc-5a9a9f62-dac5-4c12-a5ae-c712ccc4f221
total 64
drwxr-xr-x  8  101  101  4096 Feb 17 15:35 .
drwxrwxrwx 90 1001 1001 65536 Feb 17 15:34 ..
drwx------  2  101  101  4096 Feb 17 15:35 compiled_templates
drwxr-xr-x  2 root root  4096 Feb 17 15:35 config

I suspect user 101 cannot access config directory in init-mounts container, which is in the end confirmed by error log from container

cp: cannot create regular file '/galaxy/server/config/mutable/integrated_tool_panel.xml': Permission denied
17/02/2021 15:35:43 

I can solve this problem by adding -R' indeployment-job.yaml, deployment-workflow.yaml, deployment-web.yaml` in init container for postgres

      initContainers:                                                           
        - name: {{ $.Chart.Name }}-init-postgres                                
          image: alpine:3.7                                                     
          command: ['sh', '-c', 'chown -R 101:101 {{ $.Values.persistence.mountPath }}; until nc -z -w3 {{ template "galaxy-postgresql.fullname" $ }} 5432; do echo waiting for galaxy-postgres service; sleep 1; done;']

It ensures that before all config files are copied, the directory is owned by 101. I tried to perform this chown with using extraInitCommands or extraInitContainers but with no luck. I would create a PR for this but maybe there is reason why config directory is not owned by user 101. If that's the case, what would be the correct solution to the problem?