Open nuwang opened 5 years ago
While we have a liveness/readiness probe for the web handler, we do not have a readiness/liveness probe for the job handler.
A first pass at the job handler readiness probe (by @luke-c-sargent): https://github.com/galaxyproject/galaxy-helm/pull/45
There is also some WIP on the same branch for the liveness probe that relies on a (future) PR to Galaxy that will provide an opportunity to determine the status of a job handler. This is a bit more challenging undertaking because job handlers operate as two independent loops. One monitors running jobs at a regular interval (e.g., https://github.com/galaxyproject/galaxy/blob/aa442b3dc0958cec697fd90997be13508e0555e7/lib/galaxy/jobs/runners/__init__.py#L674). The 'outer' loop that checks for submitted jobs however, waits on something to show up in the queue before running (e.g., https://github.com/galaxyproject/galaxy/blob/aa442b3dc0958cec697fd90997be13508e0555e7/lib/galaxy/jobs/runners/__init__.py#L115), hence making regular updates to the process status not possible. One option that was discussed with @natefoo was to change that loop to have it continuously loop with sleep instead of the current wait.
Finally, while not directly related to liveness/readiness probes, removing job handlers does not reassign a previously associated job with the new handlers. Hence, if a handler is removed while a job is running, the job will be 'lost'. Probably best to track that as a separate issue under the galaxy repo but just wanted to note it down for now until we experiment with it and understand better what needs to be done.
To start off, we can reuse the probes from the v2 chart: https://github.com/galaxyproject/galaxy-kubernetes/blob/develop/galaxy-stable/templates/deployment.yaml#L96