Closed isubasinghe closed 3 weeks ago
/retest
/retest
Seems to be failing to pull argocli:latest since the imagePullPolicy is Never. I guess this image doesn't exist on the cluster as well.
This won't quite work. The previous comment ^ is incorrect. The code fails tests due to a race condition. Turns out we can use a mounted volume as side channel effectively.
Made a change that checks if the pod exists instead of the node Message.
/retest
The main race I can think of here is if the Task Result was seen in the Informer before the Pod was. Wasn't this effectively the purpose of the
POD_ABSENT_TIMEOUT
from #13454?
I have a worry around this too. I'd like there to be some timeout between when we've noticed that a pod has disappeared so that a delayed but completed WorkflowTaskResult
can arrive and be actioned.
As this PR is now I don't think we're guaranteed to see the completed task result before the pod removal event, and something like POD_ABSENT_TIMEOUT
(or similar) would give us that window.
As this PR is now I don't think we're guaranteed to see the completed task result before the pod removal event
Ah I actually said the inverse race, which is less likely; this variant is more likely and possible too.
I approved on the basis that this would catch more races than the code before this, but it creates a few too 😅 A more holistic fix would be great either way for sure
Fixes #12993 and Fixes #13533
Motivation
The previous fix relied upon a Message field. There is no guarantee that this Message is always given to us. We now directly check if a pod exists.
Modifications
Check if pod exists.
Verification
Unable to verify with certainty due to being a rare edge case.