Open joedborg opened 5 days ago
I'm getting this error as a new image is being rolled out:
{"level":"error","ts":"2024-06-27T15:12:22.835Z","caller":"controller/events.go:39","msg":"Metric query failed for consumer-lag: error response: {\"code\":5,\"message\":\"Not Implemented (category=INVALID_REQUEST_ERROR code=NOT_FOUND)\",\"details\":[{\"type_url\":\"type.googleapis.com/apierrors.Error\",\"value\":\"CAIQoNQYGg9Ob3QgSW1wbGVtZW50ZWQ=\"}]}","canary":"my-canary.my-ns","stacktrace":"github.com/fluxcd/flagger/pkg/controller.(*Controller).recordEventErrorf\n\t/workspace/pkg/controller/events.go:39\ngithub.com/fluxcd/flagger/pkg/controller.(*Controller).runMetricChecks\n\t/workspace/pkg/controller/scheduler_metrics.go:285\ngithub.com/fluxcd/flagger/pkg/controller.(*Controller).runAnalysis\n\t/workspace/pkg/controller/scheduler.go:753\ngithub.com/fluxcd/flagger/pkg/controller.(*Controller).advanceCanary\n\t/workspace/pkg/controller/scheduler.go:442\ngithub.com/fluxcd/flagger/pkg/controller.CanaryJob.Start.func1\n\t/workspace/pkg/controller/job.go:39"}
With
apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: my-canary spec: provider: kubernetes targetRef: apiVersion: apps/v1 kind: Deployment name: my-deployment progressDeadlineSeconds: 60 service: port: 8080 analysis: interval: 30s iterations: 10 threshold: 2 metrics: - name: consumer-lag templateRef: name: my-deployment-lag thresholdRange: max: 1500 interval: 30m
apiVersion: flagger.app/v1beta1 kind: MetricTemplate metadata: name: my-deployment-lag spec: provider: type: prometheus address: https://myorg.chronosphere.io:443 secretRef: name: chronosphere query: | sum by ( kafka_id, topic, consumer_group_id ) ( confluent_kafka_server_consumer_lag_offsets{ job="my-job", cluster="my-cluster", consumer_group_id="my-consumer-group" } )
Which results in
NAME STATUS WEIGHT LASTTRANSITIONTIME my-canary Failed 0 2024-06-27T15:13:22Z
My first guess would be that Chronosphere's API isn't exactly the same as Prometheus', but I'm not sure.
Use manifests above and attempt a rollout.
I expect to not get this error and canary promotion to succeed.
Tried to address this in https://github.com/fluxcd/flagger/pull/1670
Describe the bug
I'm getting this error as a new image is being rolled out:
With
Which results in
My first guess would be that Chronosphere's API isn't exactly the same as Prometheus', but I'm not sure.
To Reproduce
Use manifests above and attempt a rollout.
Expected behavior
I expect to not get this error and canary promotion to succeed.
Additional context