fix: mark taskresult complete when failed or error. Fixes #12993, Fixes #13533 - Githubissues

argoproj / argo-workflows

Workflow Engine for Kubernetes

https://argo-workflows.readthedocs.io/

Apache License 2.0

15.11k stars 3.21k forks source link

fix: mark taskresult complete when failed or error. Fixes #12993, Fixes #13533 #13798

Closed isubasinghe closed 3 weeks ago

isubasinghe commented 1 month ago

Fixes #12993 and Fixes #13533

Motivation

The previous fix relied upon a Message field. There is no guarantee that this Message is always given to us. We now directly check if a pod exists.

Modifications

Check if pod exists.

Verification

Unable to verify with certainty due to being a rare edge case.

isubasinghe commented 4 weeks ago

/retest

isubasinghe commented 4 weeks ago

/retest

isubasinghe commented 4 weeks ago

Seems to be failing to pull argocli:latest since the imagePullPolicy is Never. I guess this image doesn't exist on the cluster as well.

isubasinghe commented 4 weeks ago

This won't quite work. The previous comment ^ is incorrect. The code fails tests due to a race condition. Turns out we can use a mounted volume as side channel effectively.

isubasinghe commented 4 weeks ago

Made a change that checks if the pod exists instead of the node Message.

isubasinghe commented 4 weeks ago

/retest

Joibel commented 3 weeks ago

The main race I can think of here is if the Task Result was seen in the Informer before the Pod was. Wasn't this effectively the purpose of the POD_ABSENT_TIMEOUT from #13454?

I have a worry around this too. I'd like there to be some timeout between when we've noticed that a pod has disappeared so that a delayed but completed WorkflowTaskResult can arrive and be actioned.

As this PR is now I don't think we're guaranteed to see the completed task result before the pod removal event, and something like POD_ABSENT_TIMEOUT (or similar) would give us that window.

agilgur5 commented 3 weeks ago

As this PR is now I don't think we're guaranteed to see the completed task result before the pod removal event

Ah I actually said the inverse race, which is less likely; this variant is more likely and possible too.

agilgur5 commented 3 weeks ago

I approved on the basis that this would catch more races than the code before this, but it creates a few too 😅 A more holistic fix would be great either way for sure