galaxyproject / pulsar

Distributed job execution application built for Galaxy
https://pulsar.readthedocs.io
Apache License 2.0
37 stars 50 forks source link

Kubernetes Coexecution jobs are only removed under some circumstances #286

Open natefoo opened 2 years ago

natefoo commented 2 years ago

Such as:

Circumstances where the job is not removed:

This last one is a source of job "loss" (stuck non-terminal) because Pulsar will never send a terminal status update. The runner should probably poll (as in galaxyproject/galaxy#9911) for this case.

The quickest and easiest (and IMO correct) solution would be to set the TTL in the template as described in the docs. But it would also be a good idea to call MessageCoexecutionPodJobClient.kill() for all jobs when their terminal message is received.