fluxcd / flagger

Progressive delivery Kubernetes operator (Canary, A/B Testing and Blue/Green deployments)
https://docs.flagger.app
Apache License 2.0
4.89k stars 731 forks source link

Canary status stuck in WaitingPromotion #1450

Open kurekbharath opened 1 year ago

kurekbharath commented 1 year ago

Canary status stuck in WaitingPromotion for a long duration.

Canary status is stuck in WaitingPromotion status for more than hours with the message Halt sampleapp.testnamespace advancement waiting for promotion approval pre-rollout, where in the canary manifest, we have mentioned timeout for 2-3 min, even after 2-3 min if webhook doesn't return 200 response we expect the canary status to mark as failed. But the canary status is stuck in WaitingPromotion status.

I have tried to use webhook of the type confirm-promotion and pre-rollout for this test testing still status is stuck on WaitingPromotion status.

To Reproduce

Deploy a new change. Once the canary load test is successful, the webhook return 200 then roll out the changes to primary pods(Its working) If the webhook doesn't return 200 within a certain period of time(2m timeout we set in our case), it should timeout and mark the canary as failed status. (NOt working)

Below is the sample canary yaml file used

apiVersion:` flagger.app/v1beta1
kind: Canary
metadata:
  name: sampleapp-sampleapp
  namespace: testnamespace
spec:
  analysis:
    interval: 1m
    maxWeight: 40
    metrics:
      - interval: 15s
        name: 2xx 3xx percentage
        templateRef:
          name: sampleapp
          namespace: testnamespace
        thresholdRange:
          min: 80
    stepWeight: 10
    threshold: 3
    webhooks:
      - metadata:
          type: canary-deployment
        name: pre-rollout
        timeout: 2m
        type: confirm-promotion
        url: <API endpoint which returns 200 if exist>
      - metadata:
          cmd: >-
            hey -z 20m -q 10 -c 2
            <sampleapp endpoint>
        name: load-test
        timeout: 5s
        url: http://flagger-loadtester.flagger/
  autoscalerRef:
    apiVersion: autoscaling/v2beta2
    kind: HorizontalPodAutoscaler
    name: sampleapp-sampleapp
  progressDeadlineSeconds: 600
  service:
    appProtocol: TCP
    gateways:
      - default/cobalt-ingressgateway
    headers:
      response:
        set:
          Strict-Transport-Security: max-age=31536000; includeSubDomains
    match:
      - uri:
          prefix: /sampleapp-qa/
      - uri:
          prefix: /sampleapp-sampleapp-qa/
      - uri:
          prefix: /sampleapp-qa/
    name: sampleapp-sampleapp
    port: 80
    portDiscovery: true
    portName: sampleapp-port
    rewrite:
      uri: /
    targetPort: 80
    timeout: 10s
    trafficPolicy:
      tls:
        mode: DISABLE
  skipAnalysis: false
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: sampleapp-sampleapp**

Below is the canary event

  Normal   Synced  28m (x2 over 3d22h)  flagger  New revision detected! Restarting analysis for sampleapp-sampleapptestnamespace
  Warning  Synced  24m (x4 over 27m)    flagger  canary deployment sampleapp-sampleapp.testnamespace not ready: waiting for rollout to finish: 1 old replicas are pending termination
  Normal   Synced  23m (x5 over 3d22h)  flagger  Starting canary analysis for sampleapp-sampleapp.testnamespace
  Normal   Synced  23m (x5 over 3d22h)  flagger  Advance sampleapp-sampleapp.testnamespacecanary weight 10
  Normal   Synced  22m (x5 over 3d22h)  flagger  Advance sampleapp-sampleapp.testnamespace canary weight 20
  Normal   Synced  21m (x5 over 3d22h)  flagger  Advance sampleapp-sampleapp.testnamespace canary weight 30
  Normal   Synced  20m (x5 over 3d22h)  flagger  Advance sampleapp-sampleapp.testnamespace canary weight 40
  Warning  Synced  19m                  flagger  Halt sampleapp-sampleapp.testnamespace advancement waiting for promotion approval pre-rollout**

Expected behavior

If webhook doesn't return 200 without timeout set mark the canary as failed status

kurekbharath commented 1 year ago

Can someone provide input on this issue? Is this expected behavior or some config I have done is wrong ? or does This need to fixed from code side?

aryan9600 commented 1 year ago

this is expected behavior, from https://fluxcd.io/flagger/usage/webhooks/:

confirm-promotion hooks are executed before the promotion step. The canary promotion is paused until the hooks return HTTP 200. While the promotion is paused, Flagger will continue to run the metrics checks and rollout hooks.

if you want to rollback, then specify another webhook of type rollback and make the webhook server return a response with a 2xx status code after the Canary is stuck at WaitingPromotion after your desired timeout.

cxftrue commented 9 months ago

@aryan9600 If canary stuck in state Promoting, how do I make Canary fail ?