Open amfage opened 7 months ago
after a second or maybe 5 seconds, show the correct status
That sounds correct. The Workflows List page runs a ListWatch
on the visible k8s resources, so there can be a time delay between when the retry occurs and when it shows up in the ListWatch
.
The Workflows Details page can have similar issues as it relies on a RetryWatch
.
Albeit there is a difference, due to a copy+paste issue, when a retry is started, the page refreshes, same as with resubmit (the code is copied from resubmit, but a retry does not create a new Workflow, so there isn't actually a need to change pages. and a page change should happen via the internal router as well instead of the browser URL). The retry does not necessarily happen immediately though (especially not after #12538), so it might not be caught during the refresh either.
The main way to improve that would be for the UI to keep a cache of Workflows and reconcile them with the Server, that way the Details page and the List page could re-use the data for a faster update, but that would require a sizeable refactor.
It is definitely the case that sometimes the status does not change to running at all - that happened the first time I submitted a retry on version 3.5.5 and I have seen it several times since.
but sometimes it stays displaying the failed status.
It is definitely the case that sometimes the status does not change to running at all - that happened the first time I submitted a retry on version 3.5.5 and I have seen it several times since.
You mentioned that before, thanks for confirming it's happened multiple times. That's odd, that really shouldn't happen that an update is missed... there are some race conditions that can happen, but not frequently 🤔 Do you have a rough estimate of how often that happens? Like "1/20 retries" for example
I also checked the UI merge logic which seems to be correct afaict
This is how it behaved in the previous version we were running which was 3.4.11.
This part is interesting, suggesting a regression. A few things changed in 3.5 here. The options panel was added for retries in #11632, which also added the hard refresh after a retry. And most of the UI was refactored to use React hooks; in particular, I did a massive refactor of the Workflows List page in #11891 which made it substantially more efficient as there were many duplicate network calls. That PR has had a few (less than one-liner) bugs in it though. This sounds like one of the duplicate network calls may have been catching this in 3.4.x 🤔 IIRC one of them was a full re-list, which should refresh the state... However the hard refresh after a retry should also end up refreshing the state
technically speaking, the back-end k8s client Informer libraries rebuild their cache on a certain timeframe as well (the "resync period" is every 20min by default), specifically because they might miss an update (cache invalidation is hard)... I wonder if we should perhaps do some similar logic here in the UI 🤔
You mentioned that before, thanks for confirming it's happened multiple times. That's odd, that really shouldn't happen that an update is missed... there are some race conditions that can happen, but not frequently 🤔 Do you have a rough estimate of how often that happens? Like "1/20 retries" for example
I am not sure how often it happens but it did happen last week and was consistent for at least 10 minutes over several workflows, but then reverted to the behaviour of updating the status after about 5 seconds just as I was about to troubleshoot it. I do know that some of those workflow retries were stuck in a "Pending" state (InvalidImageName
error, but the workflow didn't fail) but I haven't been able to reproduce it since.
@amfage i see a related PR was merged, have u tested on latest version?
Pre-requisites
:latest
What happened/what did you expect to happen?
If a workflow fails, when viewing the workflow and clicking "retry", when returning to the main workflows listing page, the workflow still shows failed, although the workflow is running. Sometimes the workflow listing page will, after a second or maybe 5 seconds, show the correct status, but sometimes it stays displaying the failed status. The expected behaviour is that the icon shows the blue circle running icon, and the started, finished, duration and message areas show that the workflow is running, as it would if the workflow was running for the first time. This is how it behaved in the previous version we were running which was 3.4.11.
Version
v3.5.5
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container