fluxcd / flagger

Progressive delivery Kubernetes operator (Canary, A/B Testing and Blue/Green deployments)
https://docs.flagger.app
Apache License 2.0
4.85k stars 725 forks source link

Incomplete error messages obscure HTTP 403 issue when accessing prometheus #1434

Open cer opened 1 year ago

cer commented 1 year ago

I'm trying: https://docs.flagger.app/tutorials/linkerd-progressive-delivery

When I describe the canary deployment - which ultimately failed - I see the following:

  Warning  Synced  28m (x6 over 30m)    flagger  Error checking metric providers: prometheus not avaiable: running query failed: error response:
  Normal   Synced  28m                  flagger  Initialization done! customer-service.default

  Warning  Synced  45s (x10 over 22m)   flagger  Prometheus query failed: running query failed: error response:

The first warning has a typo and was most likely caused by delays during starting pods.

The second warning occurred during the canary rollout which failed. It's unclear what the problem is.

I successfully executed these queries in the Prometheus console so the data is present.

sum(
        rate(
            response_total{
                namespace="default",
                deployment=~"customer-service-canary",
                classification!="failure",
                direction="inbound"
            }[30s]
        )
    ) 
    / 
    sum(
        rate(
            response_total{
                namespace="default",
                deployment=~"customer-service-canary",
                direction="inbound"
            }[30s]
        )
    ) 
    * 100

    histogram_quantile(
        0.99,
        sum(
            rate(
                response_latency_ms_bucket{
                    namespace="default",
                    deployment=~"customer-service",
                    direction="inbound"
                }[30s]
            )
        ) by (le)
    )
aryan9600 commented 1 year ago

hello @cer, could you please post the versions of Flagger and Linkerd you're running? thanks

stefanprodan commented 1 year ago

@aryan9600 i think he is running https://github.com/stefanprodan/gitops-linkerd which uses latest charts

cer commented 1 year ago

@aryan9600 i think he is running https://github.com/stefanprodan/gitops-linkerd which uses latest charts

Yes.

The underlying problem was that Flagger was missing: linkerdAuthPolicy.create=true, which resulted in HTTP 403s:

kubectl logs -n linkerd-viz prometheus-5cbbbcd594-7lsvn -f

[ 46648.721175s]  INFO ThreadId(01) inbound:server{port=9090}: linkerd_app_inbound::policy::http: Request denied server.group=policy.linkerd.io server.kind=server server.name=prometheus-admin route.group= route.kind=default route.name=default client.tls=None(NoClientHello) client.ip=10.244.0.62
aryan9600 commented 1 year ago

@cer yes linkerdAuthPolicy.create=true is required for Linkerd >= 2.13, as an AuthorizationPolicy is now needed to access the prom server running in the linkerd-viz ns, see: https://github.com/fluxcd/flagger/blob/main/CHANGELOG.md#1310

aryan9600 commented 1 year ago

we could also include the status code in the error message? @stefanprodan