argoproj / argo-rollouts

Progressive Delivery for Kubernetes
https://argo-rollouts.readthedocs.io/
Apache License 2.0
2.73k stars 853 forks source link

Job Analysis Run never fails when the job image can't be pulled (it just succeeds and promotes). #3562

Open MohammedShetaya opened 4 months ago

MohammedShetaya commented 4 months ago

Checklist:

Describe the bug

Analysis Run of type job never fails when the job image can't be pulled (it just succeeds and promotes). When Job pods can't pull image it goes into ErrImagePull indefinitely and the job itself does not Fail. However, Argo rollouts waits for some time and considers this as success and then promotes the canary.

To Reproduce

  1. create a namespace called test-namespace
  2. run the following kubernetes configs
  3. Wait for the pods to come up
  4. change the deployment image to nginx:1.19.0 to trigger a rollout
apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deployment
  namespace: test-namespace
spec:
  replicas: 0
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
        - name: nginx
          image: nginx:1.9.8 # update it to 1.19.0 to trigger a rollout, as the first time will only create the pods.
          ports:
            - containerPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-stable
  namespace: test-namespace
spec:
  selector:
    app: nginx
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: v1
kind: Service
metadata:
  name: nginx-canary
  namespace: test-namespace
spec:
  selector:
      app: nginx
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: argoproj.io/v1alpha1
kind: ClusterAnalysisTemplate
metadata:
  name: acceptance-test
spec:
  metrics:
    - name: acceptance-test
      provider:
        job:
          spec:
            backoffLimit: 2
            completions: 1
            template:
              spec:
                containers:
                  - name: test
                    image: docker.io/library/lol:5 # this image does not exist.
                    command: ["bash"]
                    args: ["-c", "echo 'Hello world! going to exit with 0 (success).' && sleep 10 && exit 0"]
                restartPolicy: Never
---
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: awesome-rollout
  namespace: test-namespace
spec:
  replicas: 2
  workloadRef:
    apiVersion: apps/v1
    kind: Deployment
    name: test-deployment
  strategy:
    canary:
      canaryService: nginx-canary
      stableService: nginx-stable
      analysis:
        templates:
          - templateName: acceptance-test
            clusterScope: true
      steps:
        - setWeight: 20
        - pause: {duration: 10s}
        - setWeight: 40
        - pause: {duration: 10s}
        - setWeight: 80
        - pause: {duration: 10s}

Expected behavior

Screenshots

Screenshot 2024-05-17 at 4 41 52 PM Screenshot 2024-05-17 at 4 42 09 PM

Version Tested on v1.6.6 and previous versions.

Impacted by this bug? Give it a 👍. We prioritize the issues with the most 👍.

MohammedShetaya commented 4 months ago

@zachaller any update on this issue?

kostis-codefresh commented 2 months ago

@MohammedShetaya According to this issue https://github.com/argoproj/argo-rollouts/issues/2250 some people might prefer the result of the analysis to be inconclusive if there is an image pull error.

Would you agree with this implementation?

MohammedShetaya commented 2 months ago

@kostis-codefresh It should stay inconclusive then, in our case it considers the inconclusive state as success and proceeds with the rollout. This allows the canary deployment to go unchecked.

kostis-codefresh commented 1 month ago

Also see discussion at https://github.com/argoproj/argo-rollouts/issues/1329

zachaller commented 1 month ago

I think if Jobs supported successConditions you could maybe feed successConditions the output of a few things like pod exits codes and pod errors that could get used to define a success condition. This support would need to be added I think though.

MohammedShetaya commented 1 month ago

I think if Jobs supported successConditions you could maybe feed successConditions the output of a few things like pod exits codes and pod errors that could get used to define a success condition. This support would need to be added I think though.

Exactly, I tried to do it assuming jobs, like other metrics, support successConditions, but it didn't work.