After toolshed tools installation through REST API (ephemeris), I need to restart web handlers or job handlers

pcm32 commented 4 years ago

Both on Kubernetes and on HPC I see this issue. When having multiple web handlers and sending calls to install tools from the toolshed with ephemeris, I find that I need to restart afterwards the web handlers or otherwise each web handler sees different set of tools. I have noticed as well that I sometimes need to restart workflow handlers.

How is the successful installation of a tool notified by the installing web handlers to the other handler processes? Am I missing to configure some message passing here or something alike?

Thanks!

mvdbeek commented 4 years ago

The handlers choose one of the handlers among themselves to watch the config files. The watcher then send a message via kombu to all other handlers telling them to reload the config files. This means you need to have unique handler names -- is that the case ?

pcm32 commented 4 years ago

Yes, my handlers do have unique names AFAIK. How is kombu communicating through different processes and machines? Through file system? network connections - does it need any ports accessible? Is there any logging of kombu? Could it be that it is overwhelmed by too many tool installations? Thanks @mvdbeek!

mvdbeek commented 4 years ago

It uses the database connection or any amqp connection if that is configured. Can you post the logs of a tool installation that doesn't trigger a reload ?

pcm32 commented 4 years ago

Will try to find one, I remember looking at them at some point when installing but didn't see anything strange.

pcm32 commented 4 years ago

Ok, I'm going give as much detail as I can get here.

How do we run processes (to define whether they have different names):

Currently on our HPC, processes are controlled by supervisor and they are executed like this:

supervisor_programs:
  - name: 'nginx'
    command: /usr/bin/bash -c "source ~/.bashrc && nginx -c {{ nginx_config_file }} -g 'daemon off;'"
    state: present
    configuration: |
      autostart: true
      autorestart: true
      stopasgroup: true
      stopsignal: QUIT
  - name: 'galaxy_web'
    command: "{{ galaxy_venv_dir }}/bin/uwsgi --virtualenv {{ galaxy_venv_dir }} --{{ galaxy_config_style }} {{ galaxy_cf_file_sv }} --logdate --thunder-lock --master --processes {{ galaxy_config.uwsgi.processes }} --threads {{ galaxy_config.uwsgi.threads }} --logto {{ galaxy_server_dir }}/galaxy_web_%(host_node_name)s.log --socket {{ galaxy_config.uwsgi.socket }} --pythonpath lib --pidfile2 {{ galaxy_server_dir }}/galaxy_web_%(host_node_name)s.pid --stats 127.0.0.1:9191 -b 16384"
    state: present
    configuration: |
      directory: {{ galaxy_server_dir }}
      environment: TMPDIR="{{ nobackup_base }}galaxy_working_dirs/tmp_gxa_galaxy"
      umask: 022
      autostart: true
      autorestart: true
      startsecs: 20
      environment: PATH="{{ galaxy_venv_dir }}:{{ galaxy_venv_dir }}/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"
      numprocs: 1
      stopsignal: QUIT
      startretries: 1
  - name: 'galaxy_handler'
    command: /usr/bin/bash -c "source ~/.bashrc && source {{ galaxy_venv_dir }}/bin/activate && python ./lib/galaxy/main.py -c {{ galaxy_cf_file_sv }} --server-name=handler_%(host_node_name)s_%(process_num)s --attach-to-pool job-handlers --log-file={{ galaxy_server_dir }}/handler_%(host_node_name)s_%(process_num)s.log"
    state: present
    configuration: |
      directory: {{ galaxy_server_dir }}
      environment: TMPDIR="{{ nobackup_base }}galaxy_working_dirs/tmp_gxa_galaxy"
      process_name: galaxy_handler%(process_num)s
      process_num: 1
      numprocs: {{ galaxy_handler_processes }}
      umask: 022
      autostart: true
      autorestart: true
      startsecs: 20
      ;user: {{ galaxy_user }}
      stopasgroup: true
      startretries: 1
  - name: 'galaxy_workflow_handler_refresh'
    command: /usr/bin/bash -c "source ~/.bashrc && refresh_workflow_handler.sh"
    state: present
    configuration: |
      environment: PGPASSFILE="/homes/{{ galaxy_user }}/.pgpass",PGUSER="{{ db_user }}",PGHOST="{{ db_host }}",PGDATABASE="{{ db_database }}"
      stopsignal: KILL
      stopasgroup: true

So that should guarantee that each job handler as a different server name, even accross hosts.

2.- Logs on update of tools

For this setup, the last update of tools via ephemeris happened on 18/03/2020 at 4:12 AM (and on) roughly. If I look at the logs on all the handlers at that time:

host_1_handler_1: nothing on the tool update:

galaxy.jobs DEBUG 2020-03-17 17:07:23,735 job 22472 ended (finish() executed in (333.733 ms))
galaxy.jobs.handler DEBUG 2020-03-18 18:54:52,952 Grabbed job(s): 22493

host_1_handler_2:

galaxy.jobs DEBUG 2020-03-17 17:08:19,572 job 22481 ended (finish() executed in (683.609 ms))
galaxy.jobs.handler DEBUG 2020-03-18 18:54:57,084 Grabbed job(s): 22501, 22500

host_1_handler_3: nothing on the tool update:

galaxy.jobs DEBUG 2020-03-17 17:08:07,949 job 22479 ended (finish() executed in (361.555 ms))
galaxy.jobs.handler DEBUG 2020-03-18 18:55:04,011 Grabbed job(s): 22518, 22519, 22517

host_1_webhandler: you see a lot about installs:

tool_shed.galaxy_install.repository_dependencies.repository_dependency_manager DEBUG 2020-03-18 04:14:15,905 [p:115667,w:2,m:0] [uWSGIWorker2Core0] Creating repository dependency objects...
tool_shed.galaxy_install.repository_dependencies.repository_dependency_manager DEBUG 2020-03-18 04:14:17,892 [p:115667,w:2,m:0] [uWSGIWorker2Core0] Skipping installation of revision 7558a3d8338e of repository 'scanpy_find_variab
le_genes' because it was installed with the (possibly updated) revision cb007db0857d and its current installation status is 'Installed'.
tool_shed.galaxy_install.repository_dependencies.repository_dependency_manager DEBUG 2020-03-18 04:14:17,892 [p:115667,w:2,m:0] [uWSGIWorker2Core0] Building repository dependency relationships...
10.7.127.33 - - [18/Mar/2020:04:14:14 +0100] "POST /api/tool_shed_repositories/new/install_repository_revision?key=33d4655493e67634ecedfeef9c9b4158 HTTP/1.1" 200 - "-" "python-requests/2.22.0"
[pid: 115667|app: 0|req: 6966/27873] 10.7.127.33 () {42 vars in 773 bytes} [Wed Mar 18 04:14:14 2020] POST /api/tool_shed_repositories/new/install_repository_revision?key=33d4655493e67634ecedfeef9c9b4158 => generated 131 bytes i
n 2906 msecs (HTTP/1.1 200) 3 headers in 139 bytes (1 switches on core 0)
...
tool_shed.galaxy_install.repository_dependencies.repository_dependency_manager DEBUG 2020-03-18 04:16:38,954 [p:115669,w:3,m:0] [uWSGIWorker3Core0] Skipping installation of revision 544d45fca48f of repository 'scanpy_run_pca' because it was installed with the (possibly updated) revision 242cf7e1fd0c and its current installation status is 'Installed'.
tool_shed.galaxy_install.repository_dependencies.repository_dependency_manager DEBUG 2020-03-18 04:16:38,955 [p:115669,w:3,m:0] [uWSGIWorker3Core0] Building repository dependency relationships...
10.7.127.33 - - [18/Mar/2020:04:16:36 +0100] "POST /api/tool_shed_repositories/new/install_repository_revision?key=33d4655493e67634ecedfeef9c9b4158 HTTP/1.1" 200 - "-" "python-requests/2.22.0"
[pid: 115669|app: 0|req: 6976/27918] 10.7.127.33 () {42 vars in 773 bytes} [Wed Mar 18 04:16:36 2020] POST /api/tool_shed_repositories/new/install_repository_revision?key=33d4655493e67634ecedfeef9c9b4158 => generated 131 bytes in 2745 msecs (HTTP/1.1 200) 3 headers in 139 bytes (1 switches on core 0)
tool_shed.galaxy_install.repository_dependencies.repository_dependency_manager DEBUG 2020-03-18 04:16:39,915 [p:115671,w:4,m:0] [uWSGIWorker4Core0] Creating repository dependency objects...
tool_shed.galaxy_install.repository_dependencies.repository_dependency_manager DEBUG 2020-03-18 04:16:42,327 [p:115671,w:4,m:0] [uWSGIWorker4Core0] Skipping installation of revision c990e40fbd46 of repository 'scanpy_run_pca' because it was installed with the (possibly updated) revision 8b4dae3748d3 and its current installation status is 'Installed'.
tool_shed.galaxy_install.repository_dependencies.repository_dependency_manager DEBUG 2020-03-18 04:16:42,327 [p:115671,w:4,m:0] [uWSGIWorker4Core0] Building repository dependency relationships...
10.7.127.33 - - [18/Mar/2020:04:16:38 +0100] "POST /api/tool_shed_repositories/new/install_repository_revision?key=33d4655493e67634ecedfeef9c9b4158 HTTP/1.1" 200 - "-" "python-requests/2.22.0"
[pid: 115671|app: 0|req: 6989/27919] 10.7.127.33 () {42 vars in 773 bytes} [Wed Mar 18 04:16:38 2020] POST /api/tool_shed_repositories/new/install_repository_revision?key=33d4655493e67634ecedfeef9c9b4158 => generated 131 bytes in 3363 msecs (HTTP/1.1 200) 3 headers in 139 bytes (1 switches on core 0)
tool_shed.galaxy_install.repository_dependencies.repository_dependency_manager DEBUG 2020-03-18 04:16:43,447 [p:115665,w:1,m:0] [uWSGIWorker1Core0] Creating repository dependency objects...
tool_shed.galaxy_install.repository_dependencies.repository_dependency_manager DEBUG 2020-03-18 04:16:45,387 [p:115665,w:1,m:0] [uWSGIWorker1Core0] Skipping installation of revision c990e40fbd46 of repository 'scanpy_run_pca' because it was installed with the (possibly updated) revision 8b4dae3748d3 and its current installation status is 'Installed'.
...
tool_shed.galaxy_install.repository_dependencies.repository_dependency_manager DEBUG 2020-03-18 04:17:27,565 [p:115669,w:3,m:0] [uWSGIWorker3Core0] Creating repository dependency objects...
tool_shed.galaxy_install.repository_dependencies.repository_dependency_manager DEBUG 2020-03-18 04:17:29,465 [p:115669,w:3,m:0] [uWSGIWorker3Core0] Skipping installation of revision dbbe1ea8ecb1 of repository 'scanpy_plot' because it was installed with the (possibly updated) revision 8599fd07d140 and its current installation status is 'Installed'.
tool_shed.galaxy_install.repository_dependencies.repository_dependency_manager DEBUG 2020-03-18 04:17:29,466 [p:115669,w:3,m:0] [uWSGIWorker3Core0] Building repository dependency relationships...
10.7.127.33 - - [18/Mar/2020:04:17:26 +0100] "POST /api/tool_shed_repositories/new/install_repository_revision?key=33d4655493e67634ecedfeef9c9b4158 HTTP/1.1" 200 - "-" "python-requests/2.22.0"
[pid: 115669|app: 0|req: 6981/27938] 10.7.127.33 () {42 vars in 773 bytes} [Wed Mar 18 04:17:26 2020] POST /api/tool_shed_repositories/new/install_repository_revision?key=33d4655493e67634ecedfeef9c9b4158 => generated 131 bytes in 2769 msecs (HTTP/1.1 200) 3 headers in 139 bytes (1 switches on core 0)
10.52.16.158 - - [18/Mar/2020:12:14:07 +0100] "GET / HTTP/1.1" 200 - "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.132 Safari/537.36"

So nothing out of the ordinary.

3.- Error manifestation: during the next hours we see in the handler 3:

galaxy.workflow.modules INFO 2020-03-18 18:55:27,869 Exact tool specified during workflow module creation for [toolshed.g2.bx.psu.edu/repos/ebi-gxa/scanpy_run_pca/scanpy_run_pca/1.4.3+galaxy9] but couldn't find correct version [1.4.3+galaxy10].
galaxy.workflow.modules WARNING 2020-03-18 18:55:27,873 The tool 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/scanpy_run_pca/scanpy_run_pca/1.4.3+galaxy9' is missing. Cannot build workflow module.
galaxy.workflow.run ERROR 2020-03-18 18:55:27,873 Failed to execute scheduled workflow.
Traceback (most recent call last):
  File "/nfs/galaxy-19.05/lib/galaxy/workflow/run.py", line 83, in __invoke
    outputs = invoker.invoke()
  File "/nfs/galaxy-19.05/lib/galaxy/workflow/run.py", line 174, in invoke
    remaining_steps = self.progress.remaining_steps()
  File "/nfs/galaxy-19.05/lib/galaxy/workflow/run.py", line 306, in remaining_steps
    self.module_injector.inject(step, step_args=self.param_map.get(step.id, {}))
  File "/nfs/galaxy-19.05/lib/galaxy/workflow/modules.py", line 1492, in inject
    module.add_dummy_datasets(connections=step.input_connections, steps=steps)
  File "/nfs/galaxy-19.05/lib/galaxy/workflow/modules.py", line 1056, in add_dummy_datasets
    tool_id=self.tool_id)
ToolMissingException: Tool toolshed.g2.bx.psu.edu/repos/ebi-gxa/scanpy_run_pca/scanpy_run_pca/1.4.3+galaxy9 missing. Cannot add dummy datasets.

The same workflow runs fine in the other handlers. I have seen sometimes web handlers fail similarly, although the manifestation in that case can be seen in the web application in a different manner. I have posted an example from our HPC, but the same happens on our k8s setup. Issue is fixed on restart of the handlers. Thanks!

mvdbeek commented 4 years ago

Is that 19.05 ? So many things changed since then regarding tool reloads that I am not comfortable with changing anything there, and most likely this has been resolved already.

pcm32 commented 4 years ago

Well, on k8s the same happens on 20.01.

pcm32 commented 4 years ago

We are on 19.05 at the HPC because the ansible roles had issues when trying to move to 19.09 or so end of last year.

mvdbeek commented 4 years ago

That's something we can look into then. You're looking for

galaxy.queue_worker INFO 2020-03-24 14:54:14,302 Sending reload_toolbox control task.
galaxy.queue_worker INFO 2020-03-24 14:54:14,309 Instance 'main' received 'reload_toolbox' task, executing now.

in the logs

mvdbeek commented 4 years ago

And could you maybe check if https://github.com/galaxyproject/galaxy/pull/9528/commits/b67044e6377ec955c683da0a57c370adcbdfc65e makes a difference, in case you're not seeing all of your handlers respond with Instance '<your server name>' received 'reload_toolbox' task, executing now.

pcm32 commented 4 years ago

Thanks, will give those a try/check and report as soon as I can.

pcm32 commented 4 years ago

I spotted an error when Galaxy starts (20.01 release branch), where kombu throws an exception on some of the uwsgi processes, which I believe could be leaving the kombu part inactive on those processes (and hence they don't get updated on new tools installs):

https://gist.github.com/pcm32/7b7ca9c53105570e30d445ef7be2b0fb#file-logs_with_issues_kombu_tools_installations-log-L1639 (this is right at the beginning).

The database was setup through the ansible role galaxyproject.galaxy, and then this is the first run on top of it (so before the ansible role, the db was completely empty).

This is set with ansible roles on Centos. Let me know @mvdbeek if you need to know any details of the config that I'm setting up in galaxy.yml or nginx.

That also contains other errors on tool installations that I need to investigate (socket related - any advice appreciated), but this instance also shows the issue that tools don't appear on all the web process (so tools that you see depend on which web process you landed). As I mention, I see this behaviour in 19.05 and 20.01 indistinctly.

pcm32 commented 4 years ago

Also, you can see the syncing errors to one of the UWSGI processes here: https://gist.github.com/pcm32/7b7ca9c53105570e30d445ef7be2b0fb#file-logs_with_issues_kombu_tools_installations-log-L2074

pcm32 commented 4 years ago

All of those occurrences happen to web thread 4, which is the same that has the initial exception, so I guess that this explains it. For some reason, on startup, one of the web threads will suffer the problem shown in L1639 above and then everytime that another process needs to send in an update of tools, it fails to grab it.

pcm32 commented 4 years ago

For the record @mvdbeek https://github.com/galaxyproject/galaxy/commit/b67044e6377ec955c683da0a57c370adcbdfc65e is already part of what I'm trying :-(. Any ideas? Thanks!

mvdbeek commented 4 years ago

@pcm32 you need to make sure you don't see https://gist.github.com/pcm32/7b7ca9c53105570e30d445ef7be2b0fb#file-logs_with_issues_kombu_tools_installations-log-L2390

One way to deal with this is to switch to using rabbitmq or any other transport support supported by kombu. amqp_internal_connection is the setting in galaxy.yml.

galaxyproject / galaxy

After toolshed tools installation through REST API (ephemeris), I need to restart web handlers or job handlers #9483