Callback receiver worker processes are not restarted if they are killed

AlanCoding commented 2 years ago

Please confirm the following

[X] I agree to follow this project's code of conduct.
[X] I have checked the current issues for duplicates.
[X] I understand that AWX is open source software provided for free and that I might not receive a timely response.

Feature Summary

To demonstrate with the development environment, running ps aux --forest:

awx          191  0.0  0.0   6044  2692 pts/0    S    14:01   0:00  \_ make receiver
awx          421  1.4  0.4 159236 136004 pts/0   S    14:02   0:03  |   \_ python3.9 manage.py run_callback_receiver
awx          915  0.0  0.3 158788 122820 pts/0   S    14:02   0:00  |       \_ python3.9 manage.py run_callback_receiver
awx          918  0.0  0.3 158800 122716 pts/0   S    14:02   0:00  |       \_ python3.9 manage.py run_callback_receiver
awx          921  0.0  0.3 163316 127912 pts/0   S    14:02   0:00  |       \_ python3.9 manage.py run_callback_receiver
awx          924  0.0  0.3 158824 122524 pts/0   S    14:02   0:00  |       \_ python3.9 manage.py run_callback_receiver

If you kill the process 421, then supervisor will restart it, as you would expect. However, if you kill any of the subprocesses like 915, then they will never restart, and the system will continue on forever with fewer workers (including none).

You can observe this if you log [w.alive for w in self.pool.workers] in the main loop for the parent process.

As @jainnikhil30 made a config change (that will affect some systems, eventually, maybe all systems) to give more callback receiver workers, that somewhat elevates the priority of this, since with more workers, I can imagine more potential problems.

Normally, workers will never exit, and everything should be captured fully in a try-except loop. You can see this by manually doing a kill command to the process, but I only know of this being a thing in that contrived situation.

This issue proposes that the parent process should recognize when a worker is dead, and restart it.

Select the relevant components

[ ] UI
[X] API
[ ] Docs
[ ] Collection
[ ] CLI
[ ] Other

fosterseth commented 2 years ago

maybe we could make use of the AutoscalePool here

AlanCoding commented 2 years ago

This should be tremendously simple to do. The callback receiver is not an autoscale pool, nor do we want it to be. If you look at AutoscalePool.cleanup, it forgets about the workers by doing self.workers.remove(w). In the case of the callback receiver, you don't need to do any of the other stuff it does there. Eventually, the removed worker will be remedied by a natural call to self.pool.up(), which could be called directly in this case. Alternatively, you could do something as mindlessly simple as w.start() for w in self.pool workers if (not w.alive). I don't know if that's the best idea, but at surface level would fix the issue I described here.

jbradberry commented 2 years ago

@AlanCoding please link that other issue

AlanCoding commented 2 years ago

Yes, the related issue I had in mind was https://github.com/ansible/awx/issues/12103

Then out of the solutions discussed for that issue, one was that the API sends a websocket message when all events for a job are processed. The WIP I toyed with for that purpose involved having the callback receiver main process aggregate counts from worker queues.

This is only relevant, because from the development side, it would be the responsibility of the same process discussed here - the main callback receiver process, which currently sleeps in an infinite loop.

ansible / awx