fluxcd / flagger

Progressive delivery Kubernetes operator (Canary, A/B Testing and Blue/Green deployments)
https://docs.flagger.app
Apache License 2.0
4.79k stars 716 forks source link

Progressive Canary with Istio uses default URL to Prometheus #1671

Open joedborg opened 1 week ago

joedborg commented 1 week ago

Describe the bug

When defining a Canary with Istio, Flagger appears to attempt to use a default Prometheus address.

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: my-canary
  namespace: api
spec:
  analysis:
    canaryReadyThreshold: 100
    interval: 10m
    maxWeight: 100
    metrics:
    - interval: 1m
      name: request-success-rate
      thresholdRange:
        min: 100
    - interval: 1m
      name: request-duration
      thresholdRange:
        max: 500
    - interval: 5m
      name: kafka-tx
      templateRef:
        name: kafka-tx-bytes
      thresholdRange:
        min: 100
    - interval: 5m
      name: kafka-rx
      templateRef:
        name: kafka-rx-bytes
      thresholdRange:
        min: 100
    primaryReadyThreshold: 100
    stepWeight: 10
    threshold: 5
  autoscalerRef:
    apiVersion: autoscaling/v2beta2
    kind: HorizontalPodAutoscaler
    name: api-hpa
  progressDeadlineSeconds: 900
  service:
    gateways:
    - mesh-ingress-gateway.istio-system.svc.cluster.local
    hosts:
    - myaddress.io
    port: 8080
    portDiscovery: true
    retries:
      attempts: 3
      perTryTimeout: 1s
      retryOn: gateway-error,connect-failure,refused-stream
    targetPort: 8080
    trafficPolicy:
      tls:
        mode: DISABLE
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api

The custom MetricTemplates can have an endpoint defined, but the default Istio metrics lookups seem to only attempt the default Prometheus address, leading to these errors:

{"level":"error","ts":"2024-06-27T22:45:48.929Z","caller":"controller/events.go:39","msg":"Prometheus query failed: running query failed: request failed: Get \"http://prometheus:9090/api/v1/query?query=+sum%28+rate%28+istio_requests_total%7B+reporter%3D%22destination%22%2C+destination_workload_namespace%3D%22api%22%2C+destination_workload%3D~%22api%22%2C+response_code%21~%225.%2A%22+%7D%5B1m%5D+%29+%29+%2F+sum%28+rate%28+istio_requests_total%7B+reporter%3D%22destination%22%2C+destination_workload_namespace%3D%22api%22%2C+destination_workload%3D~%22api%22+%7D%5B1m%5D+%29+%29+%2A+100\": dial tcp: lookup prometheus on 10.0.0.10:53: no such host","canary":"api-canary.api","stacktrace":"github.com/fluxcd/flagger/pkg/controller.(*Controller).recordEventErrorf\n\t/workspace/pkg/controller/events.go:39\ngithub.com/fluxcd/flagger/pkg/controller.(*Controller).runBuiltinMetricChecks\n\t/workspace/pkg/controller/scheduler_metrics.go:145\ngithub.com/fluxcd/flagger/pkg/controller.(*Controller).runAnalysis\n\t/workspace/pkg/controller/scheduler.go:748\ngithub.com/fluxcd/flagger/pkg/controller.(*Controller).advanceCanary\n\t/workspace/pkg/controller/scheduler.go:442\ngithub.com/fluxcd/flagger/pkg/controller.CanaryJob.Start.func1\n\t/workspace/pkg/controller/job.go:39"}

Is it possible to set a custom endpoint with the builtin Istio metrics, or will I have to define all of these myself? It doesn't seem that I can add the provider block to Canary spec.

To Reproduce

Deploy an Istio backed Canary with an external Prometheus endpoint.

Expected behavior

An endpoint on the CRD to define a custom Prometheus endpoint for Istio.

Additional context

joedborg commented 1 week ago

Digging into the source, I can see that I might be able to specify this here

https://github.com/fluxcd/flagger/blob/main/cmd/flagger/main.go#L96

...via setting the argument on the Deployment, but it seems that I cannot pass a secrets ref

https://github.com/fluxcd/flagger/blob/main/pkg/metrics/observers/factory.go#L34

Meaning I cannot reach out to an external provider.