Open pcm32 opened 1 year ago
The requested tool does seem to be installed on web:
Restarting the workflow and job containers made this work (but I haven't re-installing tools to see if they would be picked up).
@pcm32 I spoke to Enis and Keith, and none of us recall experiencing this issue on k8s. If you're experiencing this regularly, maybe there is a networking issue in your k8s cluster? Anything in the kubeproxy logs or other host logs that may indicate an issue?
We have however, seen this error on usegalaxy.au (non-kubernetes), where a rabbitmq restart would result in the above error, and handlers would need to be restarted to recover. That is a resilience issue on the Galaxy side, and probably needs a bug logged in the Galaxy repo.
For our setup, we normally install (and keep tools updated to new versions) through ephemeris shed-tools calls from a CI.
It seems that the process of making all the different processes aware beyond the web handler on tools installation is not working well. First I see authentication/timeout issues between the workflow container and AMPQ:
and then when executing a workflow:
The different pods look like this:
so all healthy in my view.
my values.yaml doesn't change any aspect of the rabbit config. I suspect that on a restart, the install will pick up the tools, sorting the problem transiently. But of course what should happen is that new tool version installs should appear on all processes (web, job, workflow handlers) without a restart.