fluxcd / flagger

Progressive delivery Kubernetes operator (Canary, A/B Testing and Blue/Green deployments)
https://docs.flagger.app
Apache License 2.0
4.85k stars 725 forks source link

Flagger with StackDriver Metric Template: Request was missing field name #1336

Open AnkitAdarsh opened 1 year ago

AnkitAdarsh commented 1 year ago

We are facing issues with MQL while trying to Integration StackDriver with Flagger to Perform Canary Analysis. We have a GKE Cluster Setup and Workload Identity Configured for the Service Account.

During the analysis, events reported are as below:

test             0s          Normal    Synced                    canary/ankit                                                 Starting canary analysis for podinfo.test
test             0s          Normal    Synced                    canary/ankit                                                 Pre-rollout check acceptance-test passed
test             0s          Normal    Synced                    canary/ankit                                                 Advance ankit.test canary weight 10
test             0s          Warning   Synced                    canary/ankit                                                 Metric query failed for error-rate: error requesting stackdriver: rpc error: code = InvalidArgument desc = Request was missing field name.
test             0s          Warning   Synced                    canary/ankit                                                 Metric query failed for error-rate: error requesting stackdriver: rpc error: code = InvalidArgument desc = Request was missing field name.
test             0s          Warning   Synced                    canary/ankit                                                 Metric query failed for error-rate: error requesting stackdriver: rpc error: code = InvalidArgument desc = Request was missing field name.
test             0s          Warning   Synced                    canary/ankit                                                 Metric query failed for error-rate: error requesting stackdriver: rpc error: code = InvalidArgument desc = Request was missing field name.
test             0s          Warning   Synced                    canary/ankit                                                 Metric query failed for error-rate: error requesting stackdriver: rpc error: code = InvalidArgument desc = Request was missing field name.

I have a metric template as below that uses the query to fetch a sample metric which in this case if limit utilization.

apiVersion: flagger.app/v1beta1
kind: MetricTemplate
metadata:
  name: error-rate
  namespace: test
spec:
  provider:
    type: stackdriver
  query: |
    fetch k8s_container
    | metric 'kubernetes.io/container/cpu/limit_utilization'
    | filter (resource.namespace_name == 'flagger-system')
    | align delta(1m)
    | every 1m
    | group_by 1m, [value_limit_utilization_mean: mean(value.limit_utilization)]

Could you please provide suggestions on where we might be going wrong?

russellrc-keebo commented 9 months ago

I am facing the same issue. In my case, I am trying using a PromQL query with Stackdriver as provider. The error I get is this:

Metric query failed for request-success-rate-stackdriver: error requesting stackdriver: rpc error: code = InvalidArgument desc = Request was missing field name
russellrc-keebo commented 9 months ago

I tried using PromQL in a custom stackdriver metric but it didn't work. I think that the Flagger stackdriver MetricTemplate only supports MQL queries, therefore passing a PromQL query fails as a bad request.

On the other hand, Cloud Monitoring does support PromQL using a specific endpoint (https://monitoring.googleapis.com/v1/projects/PROJECT_ID/location/global/prometheus/api/v1/query), but it uses OAuth 2 authentication, which I don't know if / how Flagger supports this.

See: https://cloud.google.com/stackdriver/docs/managed-prometheus/query#api-prometheus,

russellrc-keebo commented 9 months ago

I tried using an MQL query to compute request success rate on a custom stackdriver metric, but I also got the same Request was missing field name error message in the flagger logs. Here are the error snippet and my query

Error:

msg: "Metric query failed for request-success-rate-gcp: error requesting stackdriver: rpc error: code = InvalidArgument desc = Request was missing field name."
stacktrace: "github.com/fluxcd/flagger/pkg/controller.(*Controller).recordEventErrorf
    /workspace/pkg/controller/events.go:39
github.com/fluxcd/flagger/pkg/controller.(*Controller).runMetricChecks
    /workspace/pkg/controller/scheduler_metrics.go:285
github.com/fluxcd/flagger/pkg/controller.(*Controller).runAnalysis
    /workspace/pkg/controller/scheduler.go:744
github.com/fluxcd/flagger/pkg/controller.(*Controller).advanceCanary
    /workspace/pkg/controller/scheduler.go:433
github.com/fluxcd/flagger/pkg/controller.CanaryJob.Start.func1
    /workspace/pkg/controller/job.go:39"

Query:

    fetch istio_canonical_service
    | {
        metric istio.io/service/server/request_count
        | filter resource.namespace_name=="{{ namespace }}" && resource.canonical_service_name=="{{ target }}" && metric.destination_service_name=="{{ target }}" && metric.response_code!=200
        | align rate(1m)
        | every 1m
        | sum
        ;
        metric istio.io/service/server/request_count
        | filter resource.namespace_name=="{{ namespace }}" && resource.canonical_service_name=="{{ target }}" && metric.destination_service_name=="{{ target }}"
        | align rate(1m)
        | every 1m
        | sum
      }
    | outer_join 0
    | div
    | mul(-100)
    | add(100)
russellrc-keebo commented 9 months ago

@AnkitAdarsh After delving into the flagger source code, I figured out the problem: the MetricTemplate is missing the secret that includes the name of the GCP project. On the stackdriver metric documentation (https://docs.flagger.app/usage/metrics#google-cloud-monitoring-stackdriver), just follow these steps:

Create a secret that contains your project-id (and, if workload identity is not enabled on your cluster, your service account json). Then reference the secret in the metric template. Note: The particular MQL query used here works if Istio is installed on GKE.