argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.53k stars 3.12k forks source link

workflows list page delay in showing retry status from details page #12868

Open amfage opened 2 months ago

amfage commented 2 months ago

Pre-requisites

What happened/what did you expect to happen?

If a workflow fails, when viewing the workflow and clicking "retry", when returning to the main workflows listing page, the workflow still shows failed, although the workflow is running. Sometimes the workflow listing page will, after a second or maybe 5 seconds, show the correct status, but sometimes it stays displaying the failed status. retry-workflow-argo-3-5-5 The expected behaviour is that the icon shows the blue circle running icon, and the started, finished, duration and message areas show that the workflow is running, as it would if the workflow was running for the first time. This is how it behaved in the previous version we were running which was 3.4.11.

Version

v3.5.5

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

# This example sleeps for 30 seconds then fails, so it can be retried
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: test-simple-fail-for-retry-
spec:
  entrypoint: template-fail
  templates:
  - name: template-fail
    script:
      image: python:alpine3.6
      command: ["python"]
      # fail
      source: |
        import sys;
        import time;
        time.sleep(30); 
        sys.exit(1)

Logs from the workflow controller

kubectl logs -n argo deploy/workflow-controller | grep ${workflow}

Logs from in your workflow's wait container

kubectl logs -n argo -c wait -l workflows.argoproj.io/workflow=${workflow},workflow.argoproj.io/phase!=Succeeded
agilgur5 commented 2 months ago

after a second or maybe 5 seconds, show the correct status

That sounds correct. The Workflows List page runs a ListWatch on the visible k8s resources, so there can be a time delay between when the retry occurs and when it shows up in the ListWatch.

The Workflows Details page can have similar issues as it relies on a RetryWatch.

Albeit there is a difference, due to a copy+paste issue, when a retry is started, the page refreshes, same as with resubmit (the code is copied from resubmit, but a retry does not create a new Workflow, so there isn't actually a need to change pages. and a page change should happen via the internal router as well instead of the browser URL). The retry does not necessarily happen immediately though (especially not after #12538), so it might not be caught during the refresh either.

The main way to improve that would be for the UI to keep a cache of Workflows and reconcile them with the Server, that way the Details page and the List page could re-use the data for a faster update, but that would require a sizeable refactor.

amfage commented 2 months ago

It is definitely the case that sometimes the status does not change to running at all - that happened the first time I submitted a retry on version 3.5.5 and I have seen it several times since.

agilgur5 commented 2 months ago

but sometimes it stays displaying the failed status.

It is definitely the case that sometimes the status does not change to running at all - that happened the first time I submitted a retry on version 3.5.5 and I have seen it several times since.

You mentioned that before, thanks for confirming it's happened multiple times. That's odd, that really shouldn't happen that an update is missed... there are some race conditions that can happen, but not frequently 🤔 Do you have a rough estimate of how often that happens? Like "1/20 retries" for example

I also checked the UI merge logic which seems to be correct afaict

This is how it behaved in the previous version we were running which was 3.4.11.

This part is interesting, suggesting a regression. A few things changed in 3.5 here. The options panel was added for retries in #11632, which also added the hard refresh after a retry. And most of the UI was refactored to use React hooks; in particular, I did a massive refactor of the Workflows List page in #11891 which made it substantially more efficient as there were many duplicate network calls. That PR has had a few (less than one-liner) bugs in it though. This sounds like one of the duplicate network calls may have been catching this in 3.4.x 🤔 IIRC one of them was a full re-list, which should refresh the state... However the hard refresh after a retry should also end up refreshing the state

technically speaking, the back-end k8s client Informer libraries rebuild their cache on a certain timeframe as well (the "resync period" is every 20min by default), specifically because they might miss an update (cache invalidation is hard)... I wonder if we should perhaps do some similar logic here in the UI 🤔

amfage commented 2 months ago

You mentioned that before, thanks for confirming it's happened multiple times. That's odd, that really shouldn't happen that an update is missed... there are some race conditions that can happen, but not frequently 🤔 Do you have a rough estimate of how often that happens? Like "1/20 retries" for example

I am not sure how often it happens but it did happen last week and was consistent for at least 10 minutes over several workflows, but then reverted to the behaviour of updating the status after about 5 seconds just as I was about to troubleshoot it. I do know that some of those workflow retries were stuck in a "Pending" state (InvalidImageName error, but the workflow didn't fail) but I haven't been able to reproduce it since.