Rollout with canary analysis is going into a loop when HPA updates replica count

nishaad78 commented 4 years ago

I'm using istio 1.4 on K8s 1.13 with flagger v1.0.0 and here's my canary spec:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: podinfo
  namespace: nishaad-test
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: podinfo
  autoscalerRef:
    apiVersion: autoscaling/v2beta1
    kind: HorizontalPodAutoscaler
    name: podinfo
  progressDeadlineSeconds: 3600
  service:
    port: 80
    targetPort: 9898
    hosts:
    - podinfo.nishaad-test.svc.cluster.local
    trafficPolicy:
      tls:
        # use ISTIO_MUTUAL when mTLS is enabled
        mode: DISABLE
    retries:
      attempts: 3
      perTryTimeout: 1s
      retryOn: "gateway-error,connect-failure,refused-stream"
  analysis:
    # schedule interval (default 60s)
    interval: 60s
    # max number of failed metric checks before rollback
    threshold: 10
    maxWeight: 80
    stepWeight: 10
    stepWeightPromotion: 10
    metrics:
    - name: request-success-rate
      # minimum req success rate (non 5xx responses)
      # percentage (0-100)
      thresholdRange:
        min: 99
      interval: 1m
    - name: latency
      templateRef:
        name: latency
        namespace: fluxcd
      thresholdRange:
        max: 0.500
      interval: 1m

When canary weight changes during the rollout, this triggers hpa to change the number of replicas accordingly and I see that flagger restarts the canary analysis. This is from flagger logs:

{"level":"info","ts":"2020-06-23T10:12:11.621Z","caller":"controller/events.go:16","msg":"Advance podinfo.nishaad-test canary weight 30","canary":"podinfo.nishaad-test"}
{"level":"info","ts":"2020-06-23T10:13:11.645Z","caller":"controller/events.go:16","msg":"Advance podinfo.nishaad-test canary weight 40","canary":"podinfo.nishaad-test"}
{"level":"info","ts":"2020-06-23T10:14:11.591Z","caller":"router/kubernetes_default.go:211","msg":"Service podinfo updated","canary":"podinfo.nishaad-test"}
{"level":"info","ts":"2020-06-23T10:14:11.654Z","caller":"router/istio.go:237","msg":"VirtualService podinfo.nishaad-test updated","canary":"podinfo.nishaad-test"}
{"level":"info","ts":"2020-06-23T10:14:11.709Z","caller":"controller/events.go:16","msg":"Starting canary analysis for podinfo.nishaad-test","canary":"podinfo.nishaad-test"}
{"level":"info","ts":"2020-06-23T10:14:11.745Z","caller":"controller/events.go:16","msg":"Advance podinfo.nishaad-test canary weight 10","canary":"podinfo.nishaad-test"}

I also noticed that when restarting canary analysis, it is not taking into account the stepWeightPromotion. This can be a problem if the primary replica has scaled down during the canary analysis phase. I'll create another ticket for that.

stefanprodan commented 4 years ago

@nishaad78 please make sure your deployment doesn't have replicas filed in the spec as this makes Flux fight the HPA controler.

nishaad78 commented 4 years ago

This is my deployment spec (no replicas):

apiVersion: apps/v1
kind: Deployment
metadata:
  name: podinfo
  namespace: nishaad-test
  labels:
    app: podinfo
spec:
  strategy:
    rollingUpdate:
      maxUnavailable: 0
    type: RollingUpdate
  selector:
    matchLabels:
      app: podinfo
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "true"
        prometheus.io/scrape: "true"
        prometheus.io/port: "9797"
      labels:
        app: podinfo
    spec:
      initContainers:
      - name: init
        image: alpine:3.10.1
        command:
        - sleep
        - "1"
      containers:
      - name: podinfod
        image: stefanprodan/podinfo:3.1.0
        imagePullPolicy: IfNotPresent
        ports:
          - name: http
            containerPort: 9898
            protocol: TCP
          - name: http-metrics
            containerPort: 9797
            protocol: TCP
          - name: grpc
            containerPort: 9999
            protocol: TCP
        command:
          - ./podinfo
          - --port=9898
          - --port-metrics=9797
          - --grpc-port=9999
          - --grpc-service-name=podinfo
          - --level=info
          - --random-delay=false
          - --random-error=false
        env:
          - name: PODINFO_UI_COLOR
            value: "#34577c"
        livenessProbe:
          httpGet:
            path: /healthz
            port: 9898
        readinessProbe:
          httpGet:
            path: /readyz
            port: 9898
        resources:
          limits:
            cpu: 500m
            memory: 128Mi
          requests:
            cpu: 100m
            memory: 64Mi

stefanprodan commented 4 years ago

How did you triggered the canary? Have you used kubectl or did you changed the deployment in git? If you used kubectl then Flux reverted the change.

nishaad78 commented 4 years ago

Canary was triggered when Flux detected changes in git. What's the expected scaling behaviour during a canary analysis? Is it supposed to lock the number of replicas?

stefanprodan commented 4 years ago

What's the expected scaling behaviour during a canary analysis? Is it supposed to lock the number of replicas?

No, Flagger doesn't take into account changes to the replicas field, so HPA shouldn't restart the analysis.

stefanprodan commented 4 years ago

Ah I think I know what's going on, Flagger generates the ClusterIP and Flux overrides it, can you please remove the service manifest from git and try once more.

nishaad78 commented 4 years ago

You're right. Removing Service and VirtualService from my git solved this issue. However, I think this is just a workaround since I would still want those definitions in git.

stefanprodan commented 4 years ago

Why would you want those in git? Every time Flux applies them, the routing will be broken until Flagger corrects it. The Canary resource contains a service section that defines the desired state of the app routing including cluster ips and istio services/destination rules.

nishaad78 commented 4 years ago

Thanks for your quick responses.

We use helm templates to generate a manifest for our mircoservices and rely on CD to deploy all the generated components, including Service and VirtualService.

Our manifests are a source for declarative and reproducible deployments. Therefore, I was expecting canary analysis to not be disrupted when Flux syncs with git.

Seems like this is not something you plan on supporting?

stefanprodan commented 4 years ago

You can place the canary definition inside the chart and add a toggle to disable the services when canary is enabled.

fluxcd / flagger

Rollout with canary analysis is going into a loop when HPA updates replica count #634