Open AlanCoding opened 2 years ago
maybe we could make use of the AutoscalePool here
This should be tremendously simple to do. The callback receiver is not an autoscale pool, nor do we want it to be. If you look at AutoscalePool.cleanup
, it forgets about the workers by doing self.workers.remove(w)
. In the case of the callback receiver, you don't need to do any of the other stuff it does there. Eventually, the removed worker will be remedied by a natural call to self.pool.up()
, which could be called directly in this case. Alternatively, you could do something as mindlessly simple as w.start() for w in self.pool workers if (not w.alive)
. I don't know if that's the best idea, but at surface level would fix the issue I described here.
@AlanCoding please link that other issue
Yes, the related issue I had in mind was https://github.com/ansible/awx/issues/12103
Then out of the solutions discussed for that issue, one was that the API sends a websocket message when all events for a job are processed. The WIP I toyed with for that purpose involved having the callback receiver main process aggregate counts from worker queues.
This is only relevant, because from the development side, it would be the responsibility of the same process discussed here - the main callback receiver process, which currently sleeps in an infinite loop.
Please confirm the following
Feature Summary
To demonstrate with the development environment, running
ps aux --forest
:If you kill the process
421
, then supervisor will restart it, as you would expect. However, if you kill any of the subprocesses like915
, then they will never restart, and the system will continue on forever with fewer workers (including none).You can observe this if you log
[w.alive for w in self.pool.workers]
in the main loop for the parent process.As @jainnikhil30 made a config change (that will affect some systems, eventually, maybe all systems) to give more callback receiver workers, that somewhat elevates the priority of this, since with more workers, I can imagine more potential problems.
Normally, workers will never exit, and everything should be captured fully in a try-except loop. You can see this by manually doing a kill command to the process, but I only know of this being a thing in that contrived situation.
This issue proposes that the parent process should recognize when a worker is dead, and restart it.
Select the relevant components