galaxyproject / galaxy-helm

Minimal setup required to run Galaxy under Kubernetes
MIT License
38 stars 36 forks source link

Galaxy workflow handler seems to have issues in communicating with Rabbit/AMPQ #392

Open pcm32 opened 1 year ago

pcm32 commented 1 year ago

For our setup, we normally install (and keep tools updated to new versions) through ephemeris shed-tools calls from a CI.

It seems that the process of making all the different processes aware beyond the web handler on tools installation is not working well. First I see authentication/timeout issues between the workflow container and AMPQ:

galaxy.queue_worker INFO 2022-12-04 16:03:04,510 [pN:workflow_scheduler0,p:8,tN:Thread-6 (check)] Queuing sync task reload_toolbox for workflow_scheduler0.
galaxy.queue_worker ERROR 2022-12-04 16:03:14,519 [pN:workflow_scheduler0,p:8,tN:Thread-6 (check)] Error waiting for task: '{'task': 'reload_toolbox', 'kwargs': {}}' sent with routing key 'control.workflow_scheduler0@galaxy-dev-workflow-7b8577f98c-v4kq5'
Traceback (most recent call last):
  File "/galaxy/server/lib/galaxy/queue_worker.py", line 124, in send_task
    self.connection.drain_events(timeout=timeout)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/kombu/connection.py", line 316, in drain_events
    return self.transport.drain_events(self.connection, **kwargs)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/kombu/transport/pyamqp.py", line 169, in drain_events
    return connection.drain_events(**kwargs)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/amqp/connection.py", line 525, in drain_events
    while not self.blocking_read(timeout):
  File "/galaxy/server/.venv/lib/python3.10/site-packages/amqp/connection.py", line 530, in blocking_read
    frame = self.transport.read_frame()
  File "/galaxy/server/.venv/lib/python3.10/site-packages/amqp/transport.py", line 294, in read_frame
    frame_header = read(7, True)
  File "/galaxy/server/.venv/lib/python3.10/site-packages/amqp/transport.py", line 627, in _read
    s = recv(n - len(rbuf))
TimeoutError: timed out
galaxy.queue_worker INFO 2022-12-04 16:03:14,520 [pN:workflow_scheduler0,p:8,tN:Thread-6 (check)] Sending reload_toolbox control task.

and then when executing a workflow:

galaxy.workflow.modules WARNING 2022-12-05 09:25:33,720 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] The tool 'toolshed.g2.bx.psu.edu/repos/ebi-gxa/scanpy_multiplet_scrublet/scanpy_multiplet_scrublet/1.8.1+3+galaxy0' is missing. Cannot build workflow module.
galaxy.workflow.run ERROR 2022-12-05 09:25:33,721 [pN:workflow_scheduler0,p:8,tN:WorkflowRequestMonitor.monitor_thread] Failed to execute scheduled workflow.
Traceback (most recent call last):
  File "/galaxy/server/lib/galaxy/workflow/run.py", line 42, in __invoke
    outputs = invoker.invoke()
  File "/galaxy/server/lib/galaxy/workflow/run.py", line 142, in invoke
    remaining_steps = self.progress.remaining_steps()
  File "/galaxy/server/lib/galaxy/workflow/run.py", line 275, in remaining_steps
    self.module_injector.inject(step, step_args=self.param_map.get(step.id, {}))
  File "/galaxy/server/lib/galaxy/workflow/modules.py", line 2194, in inject
    module.add_dummy_datasets(connections=step.input_connections, steps=steps)
  File "/galaxy/server/lib/galaxy/workflow/modules.py", line 1749, in add_dummy_datasets
    raise ToolMissingException(f"Tool {self.tool_id} missing. Cannot add dummy datasets.", tool_id=self.tool_id)
galaxy.exceptions.ToolMissingException: Tool toolshed.g2.bx.psu.edu/repos/ebi-gxa/scanpy_multiplet_scrublet/scanpy_multiplet_scrublet/1.8.1+3+galaxy0 missing. Cannot add dummy datasets.

The different pods look like this:

galaxy-dev-celery-7b9d9f9585-4vmdp                                1/1     Running     0               2d22h
galaxy-dev-celery-beat-77b64fd4db-gdq2j                           1/1     Running     0               2d22h
galaxy-dev-job-0-6745c74d47-ht7jr                                 1/1     Running     1 (2d22h ago)   2d22h
galaxy-dev-maintenance-27833880-2zzgl                             0/1     Completed   0               2d7h
galaxy-dev-maintenance-27835320-lnb82                             0/1     Completed   0               31h
galaxy-dev-maintenance-27836760-wkc4h                             0/1     Completed   0               7h37m
galaxy-dev-nginx-75fc94497f-m7w4v                                 1/1     Running     0               2d22h
galaxy-dev-postgres-77d867c998-8bq7f                              1/1     Running     0               2d22h
galaxy-dev-rabbitmq-865b44f65f-vgvkl                              1/1     Running     0               2d22h
galaxy-dev-rabbitmq-messaging-topology-operator-7b67965f9444lsk   1/1     Running     0               2d22h
galaxy-dev-rabbitmq-server-server-0                               1/1     Running     0               2d22h
galaxy-dev-tusd-6bf6456765-k7ksw                                  1/1     Running     0               2d22h
galaxy-dev-web-568f8c6f75-lmmx5                                   1/1     Running     2 (2d22h ago)   2d22h
galaxy-dev-workflow-7b8577f98c-v4kq5                              1/1     Running     1 (2d22h ago)   2d22h
galaxy-galaxy-dev-postgres-0                                      1/1     Running     0               18d

so all healthy in my view.

my values.yaml doesn't change any aspect of the rabbit config. I suspect that on a restart, the install will pick up the tools, sorting the problem transiently. But of course what should happen is that new tool version installs should appear on all processes (web, job, workflow handlers) without a restart.

pcm32 commented 1 year ago

The requested tool does seem to be installed on web:

image

pcm32 commented 1 year ago

Restarting the workflow and job containers made this work (but I haven't re-installing tools to see if they would be picked up).

nuwang commented 1 year ago

@pcm32 I spoke to Enis and Keith, and none of us recall experiencing this issue on k8s. If you're experiencing this regularly, maybe there is a networking issue in your k8s cluster? Anything in the kubeproxy logs or other host logs that may indicate an issue?

We have however, seen this error on usegalaxy.au (non-kubernetes), where a rabbitmq restart would result in the above error, and handlers would need to be restarted to recover. That is a resilience issue on the Galaxy side, and probably needs a bug logged in the Galaxy repo.