argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
14.75k stars 3.15k forks source link

Workflow Archiving: `Workflow gone` error shortly after Workflow completion #13305

Closed ryancurrah closed 1 month ago

ryancurrah commented 1 month ago

Pre-requisites

What happened/what did you expect to happen?

Issue Summary: Since enabling Workflow Archiving, in the UI we have been observing an error popup saying "Workflow gone" which appears shortly after a workflow completes in the UI. This issue does not seem to leave any relevant information in the argo-server logs.

Steps to Reproduce:

  1. Enable Workflow Archiving.
  2. Execute a workflow.
  3. Observe the UI shortly after the workflow completes.

Observed Behavior:

Expected Behavior:

Additional Information:

Screenshots:

Screenshot 2024-07-04 at 5 38 12 PM

Workaround:

References:

Version

v3.5.8

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

Any workflow that uses Workflow Archiving will work.

Logs from the workflow controller

N/A

Logs from in your workflow's wait container

N/A
agilgur5 commented 1 month ago
  • Workflow TTL is configured to 1 hour, but the popup appears before the 1-hour TTL, right after the workflow completes.

🤔 The Workflow gone error message is only supposed to show when a Workflow is deleted, but in this case yours wasn't deleted (yet), just archived. That sounds like the Server is incorrectly sending a delete event?

This case shouldn't happen in 3.4 either, since your Workflow still exists in-cluster.

agilgur5 commented 1 month ago
  • Alan Clucas mentioned that the error message made sense when workflows were moving between tabs in the UI, but now it isn't relevant and should be handled better.

That would be a nice improvement for 3.5 if the Workflow is both archived and deleted. Which is a separate case since yours was not deleted. The error message should still show if it is only deleted (e.g. you don't have the Workflow Archive enabled)

ryancurrah commented 1 month ago

Yes that would be a nice improvement to workflow archiving, which is not needing to set a workflow TTL. Which it's not obvious a TTL needs to be set, I expected workflows to delete automatically once archived.

agilgur5 commented 1 month ago

which is not needing to set a workflow TTL.

That's not what I was referring to. That would be a separate feature entirely, and one I would personally reject as well.

Which it's not obvious a TTL needs to be set, I expected workflows to delete automatically once archived.

They're separate and independent features and you can use one without the other. Also note there's another feature for cleaning up Workflows, retentionPolicy, which can also be used together or independently with the other two.

Combining them actually causes a lot more confusing edge cases very, very quickly (for an existing example within Argo, stop vs terminate is one of the single most confusing things -- precisely because the difference is in dependent feature edge cases). Independent features follow SRP or Unix philosophy more closely, and from personal experience maintaining OSS libraries for years, I can concretely say that following it helps a heck of a lot with sustainability, maintainability, extensibility, and usability 😅

The improvement I was referring to was to automatically try retrieving an archived workflow upon deletion in the UI in Argo 3.5+, which is what Alan said as well. That is half bug half missing feature for 3.5. But your error shouldn't happen at all, since it wasn't deleted in the first place, that's entirely a bug

ryancurrah commented 1 month ago

Ah I'm going to pull the Canadian thing to do and say... Sorry! I misunderstood what you were talking about but thank you for clarifying, it makes sense now.