fluxcd / flagger

Progressive delivery Kubernetes operator (Canary, A/B Testing and Blue/Green deployments)
https://docs.flagger.app
Apache License 2.0
4.79k stars 716 forks source link

Seeing `Not Implemented` error with Canary and MetricTemplate #1669

Open joedborg opened 5 days ago

joedborg commented 5 days ago

Describe the bug

I'm getting this error as a new image is being rolled out:

{"level":"error","ts":"2024-06-27T15:12:22.835Z","caller":"controller/events.go:39","msg":"Metric query failed for consumer-lag: error response: {\"code\":5,\"message\":\"Not Implemented (category=INVALID_REQUEST_ERROR code=NOT_FOUND)\",\"details\":[{\"type_url\":\"type.googleapis.com/apierrors.Error\",\"value\":\"CAIQoNQYGg9Ob3QgSW1wbGVtZW50ZWQ=\"}]}","canary":"my-canary.my-ns","stacktrace":"github.com/fluxcd/flagger/pkg/controller.(*Controller).recordEventErrorf\n\t/workspace/pkg/controller/events.go:39\ngithub.com/fluxcd/flagger/pkg/controller.(*Controller).runMetricChecks\n\t/workspace/pkg/controller/scheduler_metrics.go:285\ngithub.com/fluxcd/flagger/pkg/controller.(*Controller).runAnalysis\n\t/workspace/pkg/controller/scheduler.go:753\ngithub.com/fluxcd/flagger/pkg/controller.(*Controller).advanceCanary\n\t/workspace/pkg/controller/scheduler.go:442\ngithub.com/fluxcd/flagger/pkg/controller.CanaryJob.Start.func1\n\t/workspace/pkg/controller/job.go:39"}

With

apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: my-canary
spec:
  provider: kubernetes
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-deployment
  progressDeadlineSeconds: 60
  service:
    port: 8080
  analysis:
    interval: 30s
    iterations: 10
    threshold: 2
    metrics:
    - name: consumer-lag
      templateRef:
        name: my-deployment-lag
      thresholdRange:
        max: 1500
      interval: 30m
apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: my-deployment-lag
spec:
  provider:
    type: prometheus
    address: https://myorg.chronosphere.io:443
    secretRef:
      name: chronosphere
  query: |
    sum by (
      kafka_id, topic, consumer_group_id
    ) (
      confluent_kafka_server_consumer_lag_offsets{
        job="my-job",
        cluster="my-cluster",
        consumer_group_id="my-consumer-group"
      }
    )

Which results in

NAME                             STATUS        WEIGHT   LASTTRANSITIONTIME
my-canary                      Failed            0              2024-06-27T15:13:22Z

My first guess would be that Chronosphere's API isn't exactly the same as Prometheus', but I'm not sure.

To Reproduce

Use manifests above and attempt a rollout.

Expected behavior

I expect to not get this error and canary promotion to succeed.

Additional context

joedborg commented 5 days ago

Tried to address this in https://github.com/fluxcd/flagger/pull/1670