fluxcd / flagger

Progressive delivery Kubernetes operator (Canary, A/B Testing and Blue/Green deployments)
https://docs.flagger.app
Apache License 2.0
4.92k stars 736 forks source link

Canary deployment getting failed #1720

Open infrawizard opened 3 weeks ago

infrawizard commented 3 weeks ago

I'm implementing canary deployment using Flagger to monitor my application. However, despite configuring the request-success-rate metric, Flagger isn't sending any metrics or requests to the endpoint. I am using traefik provider.

I am installing flagger like below:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: flagger
  namespace: kube-system
spec:
  releaseName: flagger
  chart:
    spec:
      chart: flagger
      version: 1.36.0
      interval: 6h
      sourceRef:
        kind: HelmRepository
        name: flagger
        namespace: flux-system
      verify:
        provider: cosign 
  values:
    meshProvider: traefik
    prometheus:
      install: true    
    nodeSelector:
      kubernetes.io/os: linux
  install:
    crds: CreateReplace
  upgrade:
    crds: CreateReplace
  interval: 1h

And canary with below:

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: test-service
  namespace: test
spec:
  provider: traefik
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: test-service
  progressDeadlineSeconds: 600
  service:
    port: 3000
    targetPort: 3000
  analysis:
    interval: 10s
    threshold: 10
    maxWeight: 50
    stepWeight: 5
    metrics:
      - name: request-success-rate
        interval: 1m
        thresholdRange:
          min: 99
      - name: request-duration
        interval: 1m
        thresholdRange:
          max: 500
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 10s
        metadata:
          type: bash
          cmd: "curl -X GET http://test-service:3000/ping"
      - name: load-test
        type: rollout
        url: http://flagger-loadtester.test/
        timeout: 5s
        metadata:
          type: cmd
          cmd: "hey -z 10s -q 10 -c 2 http://test-service:3000/ping"
          logCmdOutput: "true"

Canary is getting succeeded without the metrics field but getting failed:

Events:
Type Reason Age From Message

Warning Synced 4m19s flagger test-service-primary.test not ready: waiting for rollout to finish: observed deployment generation less than desired generation
Warning Synced 3m29s (x5 over 4m9s) flagger test-service-primary.test not ready: waiting for rollout to finish: 0 of 1 (readyThreshold 100%) updated replicas are available
Normal Synced 3m19s (x7 over 4m19s) flagger all the metrics providers are available!
Normal Synced 3m19s flagger Initialization done! test-service.test
Normal Synced 2m49s flagger New revision detected! Scaling up test-service.test
Warning Synced 119s (x5 over 2m39s) flagger canary deployment test-service.test not ready: waiting for rollout to finish: 0 of 1 (readyThreshold 100%) updated replicas are available
Normal Synced 109s flagger Starting canary analysis for test-service.test
Normal Synced 109s flagger Pre-rollout check acceptance-test passed
Normal Synced 109s flagger Advance test-service.test canary weight 5
Warning Synced 89s (x2 over 99s) flagger Halt advancement no values found for traefik metric request-success-rate probably test-service.test is not receiving traffic: running query failed: no values found

Below is my traefik config:

apiVersion: helm.toolkit.fluxcd.io/v2beta1
kind: HelmRelease
metadata:
  name: traefik
  namespace: kube-system
spec:
  chart:
    spec:
      chart: traefik
      sourceRef:
        kind: HelmRepository
        name: traefik
        namespace: flux-system
      version: '23.0.1'
  values:
    additionalArguments:
      - "--entryPoints.web.forwardedHeaders.trustedIPs=10.0.0.0/16,10.5.0.0/16,10.21.0.0/16"
    updateStrategy:
      rollingUpdate:
        maxUnavailable: 1

    providers:
      kubernetesCRD:
        enabled: true
        allowCrossNamespace: true
        allowExternalNameServices: true

      kubernetesIngress:
        enabled: true
        allowExternalNameServices: true

    ports:
      web:
        nodePort: 32080
      websecure:
        nodePort: 32443

    service:
      type: NodePort

  interval: 10m0s

I am installing prometheus with flagger. The setup works without metrics but fails when its added. Not sure if I am missing anything in the setup. I see flagger-prometheus pod in the setup. Do I need to install anything else for inbuilt metrics to work? Or anything else missing in the setup?

hrvatskibogmars commented 1 week ago

I am having the same issues with istio.

I see that flagger is hiting prometheus. I see the query but for some uknown reason to me its just not getting any traffic to new pod. Canary deployment has 0 or 1 value when I query this metric. Traffic to old pod works and its showing on in prometheus.

infrawizard commented 1 week ago

@stefanprodan would really appreciate your input here.

hrvatskibogmars commented 1 week ago
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: request-duration
  namespace: flagger
spec:
  provider:
    type: prometheus
    address: http://mimir-distributed-gateway.observability:8080/prometheus
  query: |
    histogram_quantile(0.99,
      sum(
        irate(
          istio_request_duration_milliseconds_bucket{
            reporter="destination",
            destination_workload=~"{{ target }}",
            destination_workload_namespace=~"{{ namespace }}"
          }[{{ interval }}]
        )
      ) by (le)
    )

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: request-success-rate
  namespace: flagger
spec:
  provider:
    type: prometheus
    address: http://mimir-distributed-gateway.observability:8080/prometheus
  query: |
    sum(
        rate(
            istio_requests_total{
              reporter="destination",
              destination_workload_namespace=~"{{ namespace }}",
              destination_workload=~"{{ target }}",
              response_code!~"5.*"
            }[{{ interval }}]
        )
    )
    /
    sum(
        rate(
            istio_requests_total{
              reporter="destination",
              destination_workload_namespace=~"{{ namespace }}",
              destination_workload=~"{{ target }}"
            }[{{ interval }}]
        )
    )
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: echo-server-cannary
  namespace: debug
spec:
  # deployment reference
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: echo-server
  # the maximum time in seconds for the canary deployment
  # to make progress before it is rollback (default 600s)
  progressDeadlineSeconds: 600
  # HPA reference (optional)
  autoscalerRef:
    apiVersion: autoscaling/v2
    kind: HorizontalPodAutoscaler
    name: echo-server
  service:
    # service port number
    port: 80
    # container port number or name (optional)
    targetPort: 80
    # Istio gateways (optional)
    gateways:
    - default/gw-dev-imba-com
    # Istio virtual service host names (optional)
    hosts:
    -imba.com
    match:
      - uri:
          prefix: /api/echo
    # Istio traffic policy (optional)
    trafficPolicy:
      tls:
        # use ISTIO_MUTUAL when mTLS is enabled
        mode: ISTIO_MUTUAL
    # Istio retry policy (optional)
    retries:
      attempts: 3
      perTryTimeout: 1s
      retryOn: "gateway-error,connect-failure,refused-stream"
  analysis:
    # schedule interval (default 60s)
    interval: 1m
    # max number of failed metric checks before rollback
    threshold: 10
    # max traffic percentage routed to canary
    # percentage (0-100)
    maxWeight: 50
    # canary increment step
    # percentage (0-100)
    stepWeight: 10
    metrics:
      - name: request-success-rate
        templateRef:
          name: request-success-rate
          namespace: flagger
        thresholdRange:
          max: 500
        interval: 5m
      - name: request-duration
        templateRef:
          name: request-duration
          namespace: flagger
        thresholdRange:
          max: 500
        interval: 5m
    # testing (optional)
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: https://imba.com/api/echo
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' https://imba.com/api/echo | grep token"
      - name: load-test
        url: https://imba.com/api/echo
        timeout: 5s
        metadata:
          cmd: "hey -z 1m -q 10 -c 2 http://imba.com/api/echo"

I found that the problem is with metrics. Its not generating enough traffic to show any value for given metric, thus resulting in failed rollout.

➜  ~ istioctl version
client version: 1.24.0
control plane version: 1.21.0
data plane version: 1.21.0 (61 proxies)
➜  ~
infrawizard commented 1 week ago

@hrvatskibogmars its working for me also when I remove the metrics part but not working when I add it. Apparently we need prometheus for that but prometheus comes with flagger thats not getting the metrics. Are you tryiing anything else?

infrawizard commented 1 week ago

@aryan9600 I would really appreciate your input here