API should not report job finished running while events are still processing

ansible / awx

AWX provides a web-based user interface, REST API, and task engine built on top of Ansible. It is one of the upstream projects for Red Hat Ansible Automation Platform.

Other

14.08k stars 3.43k forks source link

API should not report job finished running while events are still processing #12103

Open akus062381 opened 2 years ago

akus062381 commented 2 years ago

in the job output screen, the API reports a job is finished running while the UI is still gathering and rendering all of the events. As a result, the job output expand/collapse feature does not always render when it should, and the user must know that they need to refresh in order to see that feature.

What should be happening is that the job output screen should automatically refresh to show the expand/collapse arrows when the job completes. The user should not have to click refresh to see those.

After job finishes, before clicking refresh:

After clicking refresh:

AlanCoding commented 2 years ago

I believe I follow what the issue is, but I don't love the title of this issue.

the API reports a job is finished running while the UI is still gathering and rendering all of the events

I don't want to change the definition of the API "status" field. There is an event lag which causes events to come in after the status has changed, and this will worsen depending on overall system load. It is well-established that events may continue to come in after the API status changes to a finished status, and after the job status websocket message is sent. The API does not currently provide a hook (like a websocket message) for the UI to know when all events have been processed. We can talk about adding that, but it would be separate from the status change.

keithjgrant commented 2 years ago

@AlanCoding Yeah I agree there's an important distinction between status change and processing finished. A pertinent question is how to treat that interim between in the UI.

The current flow is once the status changes to successful or failed, the UI refreshed events in the hope of getting the job event tree information, but that information is not available until processing is complete. I suppose a reasonable UX might be to change the status icon when the status changes, but not refresh the events tree until processing completes, assuming we can indicate that to the UI somehow

AlanCoding commented 2 years ago

Let me ping @fosterseth, we discussed at pretty good length the options we have to do this.

Right now, there's just no precedent for collecting data from all the callback receiver worker processes. The only possible exception to this would be the --status command, gives gives data compiled by redis from all workers. This also happens to have been broken.

It's technically conceivable to develop a hook for "all events processed" by possibly:

some inspection of the subprocess object in the main process, by self.pool.workers, which gives the PoolWorker object. It could put data into the IPC queue for us to get, but that is determined to be very risky
most likely workable would be to have the workers stash stats (about number of events processed) in Redis, and then the main process would read and gather those. But specifically when? This could be more polling ultimately.

So for the current phase of things, we would like to decline developing a trigger for the "all events processed" criteria.

AlanCoding commented 2 years ago

I went and developed this proof of concept of the idea:

https://github.com/ansible/awx/compare/devel...AlanCoding:ipc_done?expand=1

I'm liking this idea more and more, however, it's still incomplete. It needs to handle that case where events are lost. In the current branch, it will wait forever for events to come in, and that's not what we want.

I can propose solutions... but it starts to get very complex. Each worker saves stats on a regular interval, and this includes the Redis queue size. Since there's only 1 queue, I'd like to move that knowledge of queue size up to the main thread, and have it contribute to the status data as well. Then, if the queue is lower than a certain threshold, you know that events are not lagging. If this is the case, then it may be acceptable to do a database query to get all in-memory jobs with a finished status. Out of these, you might reasonably filter for those that finished a time greater than some threshold, and decide that those will never get all its events, and force sending of notifications and whatnot.

I'm not sure if I'll ever get the time to finish that work, as there is still some risk around the ability to wrap up this exception case.

hesmithrh commented 2 years ago

per refinement, "quick fix" idea:

In the short-term can we just poll for the value of processing_finished? It sounds like changing websocket events could be a bigger lift.

mabashian commented 2 years ago

Long-term, we should deep dive on this and discuss a more elegant solution around sending an event back to the UI to indicate that event_processing_finished has flipped to true but in the short term we'll do this: https://github.com/ansible/awx/issues/12543

AlanCoding commented 2 years ago

It sounded like there was interest in an API-side deep dive on this topic. I can elaborate in much greater detail about my last comment, and what I consider a must-have before that can be merged. I believe a part of the problem is scope-creep here. A hook for all-events-processed is just too useful of a construct in our app, and there's a lot of unrelated logic out there that this code hook should rightfully absorb, but that makes it a daunting piece of work with a much larger potential bug surface. Instead of abandoning a job's count by cross-referencing with Redis and the database, we could just set an ultimate give-up point at 1 week and call it done.

We also need to need to discuss some very technical points of that. In the branch I have, I commandeer an existing unused queue. After reading and understanding more about the dispatcher, I've become convinced that we shouldn't use this, and should instead use the worker finished queue, which does not yet exist, but can be turned on with a simple kwarg.