fluxcd / flagger

Progressive delivery Kubernetes operator (Canary, A/B Testing and Blue/Green deployments)
https://docs.flagger.app
Apache License 2.0
4.85k stars 725 forks source link

Flagger is broken with Gloo v1.8 #976

Closed Boes-man closed 3 years ago

Boes-man commented 3 years ago

Describe the bug

Applying the image update does not cause progressive rollout. kubectl get canaries -Aw NAMESPACE NAME STATUS WEIGHT LASTTRANSITIONTIME test podinfo Progressing 0 2021-08-13T07:03:14Z test podinfo Progressing 5 2021-08-13T07:03:44Z test podinfo Progressing 5 2021-08-13T07:03:54Z test podinfo Progressing 5 2021-08-13T07:04:04Z test podinfo Progressing 5 2021-08-13T07:04:14Z test podinfo Progressing 5 2021-08-13T07:04:24Z test podinfo Progressing 5 2021-08-13T07:04:34Z test podinfo Failed 0 2021-08-13T07:04:44Z

Events: Type Reason Age From Message


Warning Synced 6m2s flagger podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less then desired generation Warning Synced 5m52s flagger podinfo-primary.test not ready: waiting for rollout to finish: 0 of 2 updated replicas are available Normal Synced 5m42s (x3 over 6m2s) flagger all the metrics providers are available! Normal Synced 5m42s flagger Initialization done! podinfo.test Normal Synced 3m42s flagger New revision detected! Scaling up podinfo.test Warning Synced 3m32s flagger canary deployment podinfo.test not ready: waiting for rollout to finish: 0 of 2 updated replicas are available Warning Synced 3m22s flagger canary deployment podinfo.test not ready: waiting for rollout to finish: 1 of 2 updated replicas are available Normal Synced 3m12s flagger Starting canary analysis for podinfo.test Normal Synced 3m12s flagger Pre-rollout check acceptance-test passed Normal Synced 3m12s flagger Advance podinfo.test canary weight 5 Warning Synced 2m22s (x5 over 3m2s) flagger Halt advancement no values found for gloo metric request-success-rate probably podinfo.test is not receiving traffic: running query failed: no values found Warning Synced 2m12s flagger Rolling back podinfo.test failed checks threshold reached 5 Warning Synced 2m12s flagger Canary failed! Scaling down podinfo.test

To Reproduce

on macOS microk8s install --cpu 4 --mem 8 -y microk8s enable dns rbac storage metallb:192.168.64.50-192.168.64.100 Use tut

Did port-forward to flagger-prometheus and its up.

Expected behavior

kubectl -n test describe canary/podinfo

Status: Canary Weight: 0 Failed Checks: 0 Phase: Succeeded Events: Type Reason Age From Message


Normal Synced 3m flagger New revision detected podinfo.test Normal Synced 3m flagger Scaling up podinfo.test Warning Synced 3m flagger Waiting for podinfo.test rollout to finish: 0 of 1 updated replicas are available Normal Synced 3m flagger Advance podinfo.test canary weight 5 Normal Synced 3m flagger Advance podinfo.test canary weight 10 Normal Synced 3m flagger Advance podinfo.test canary weight 15 Normal Synced 2m flagger Advance podinfo.test canary weight 20 Normal Synced 2m flagger Advance podinfo.test canary weight 25 Normal Synced 1m flagger Advance podinfo.test canary weight 30 Normal Synced 1m flagger Advance podinfo.test canary weight 35 Normal Synced 55s flagger Advance podinfo.test canary weight 40 Normal Synced 45s flagger Advance podinfo.test canary weight 45 Normal Synced 35s flagger Advance podinfo.test canary weight 50 Normal Synced 25s flagger Copying podinfo.test template spec to podinfo-primary.test Warning Synced 15s flagger Waiting for podinfo-primary.test rollout to finish: 1 of 2 updated replicas are available Normal Synced 5s flagger Promotion completed! Scaling down podinfo.test

Additional context

Have tried it on KinD and 3 node GKE cluster too, same result. I suspect the loadtester(generator) isn't working?

helm ls -A WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /Users/danwessels/.kube/config WARNING: Kubernetes configuration file is world-readable. This is insecure. Location: /Users/danwessels/.kube/config NAME NAMESPACE REVISION UPDATED STATUS CHART APP VERSION flagger gloo-system 1 2021-08-13 16:52:35.827309 +1000 AEST deployed flagger-1.12.1 1.12.1
gloo gloo-system 1 2021-08-13 16:51:37.555865 +1000 AEST deployed gloo-1.8.6

Boes-man commented 3 years ago

9m16s Normal Started pod/podinfo-primary-cf54546c6-svqn9 Started container podinfod 9m17s Normal SuccessfulCreate replicaset/podinfo-primary-cf54546c6 Created pod: podinfo-primary-cf54546c6-99v9d 9m17s Normal SuccessfulCreate replicaset/podinfo-primary-cf54546c6 Created pod: podinfo-primary-cf54546c6-svqn9 9m17s Normal ScalingReplicaSet deployment/podinfo-primary Scaled up replica set podinfo-primary-cf54546c6 to 2 8m16s Warning FailedGetResourceMetric horizontalpodautoscaler/podinfo-primary did not receive metrics for any ready pods 8m16s Warning FailedGetResourceMetric horizontalpodautoscaler/podinfo-primary failed to get cpu utilization: did not receive metrics for any ready pods 8m16s Warning FailedComputeMetricsReplicas horizontalpodautoscaler/podinfo-primary failed to compute desired number of replicas based on listed metrics for Deployment/test/podinfo-primary: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: did not receive metrics for any ready pods 11m Normal ScalingReplicaSet deployment/podinfo Scaled up replica set podinfo-99dc84b6f to 1 7m43s Normal SuccessfulRescale horizontalpodautoscaler/podinfo New size: 2; reason: Current number of replicas below Spec.MinReplicas 11m Normal ScalingReplicaSet deployment/podinfo Scaled up replica set podinfo-99dc84b6f to 2 6m55s Warning FailedGetResourceMetric horizontalpodautoscaler/podinfo did not receive metrics for any ready pods 6m55s Warning FailedGetResourceMetric horizontalpodautoscaler/podinfo failed to get cpu utilization: did not receive metrics for any ready pods 6m56s Warning FailedComputeMetricsReplicas horizontalpodautoscaler/podinfo failed to compute desired number of replicas based on listed metrics for Deployment/test/podinfo: invalid metrics (1 invalid out of 1), first error is: failed to get cpu utilization: did not receive metrics for any ready pods 8m47s Normal Synced canary/podinfo all the metrics providers are available! 9m17s Warning Synced canary/podinfo podinfo-primary.test not ready: waiting for rollout to finish: observed deployment generation less then desired generation 9m7s Warning Synced canary/podinfo podinfo-primary.test not ready: waiting for rollout to finish: 0 of 2 updated replicas are available 8m57s Warning Synced canary/podinfo podinfo-primary.test not ready: waiting for rollout to finish: 1 of 2 updated replicas are available 8m47s Normal ScalingReplicaSet deployment/podinfo Scaled down replica set podinfo-99dc84b6f to 0

stefanprodan commented 3 years ago

I tried updating the e2e test to Gloo v1.8 and indeed routing is broken. Works fine with Gloo v1.6.

Boes-man commented 3 years ago

Thanks @stefanprodan its still not working? I cloned the flagger repo and kicked off flagger/test/gloo/run.sh on a fresh cluster. This fails at flagger install.

NOTES:
Flagger installed
deployment.apps/flagger image updated
Waiting for deployment "flagger" rollout to finish: 0 out of 1 new replicas have been updated...
Waiting for deployment "flagger" rollout to finish: 0 out of 1 new replicas have been updated...
Waiting for deployment "flagger" rollout to finish: 0 out of 1 new replicas have been updated...
Waiting for deployment "flagger" rollout to finish: 0 out of 1 new replicas have been updated...
Waiting for deployment "flagger" rollout to finish: 0 of 1 updated replicas are available...

gloo-system      flagger-79ff7c8b8b-fnhpw                  0/1     ImagePullBackOff   0          2m2s
❯ kubectl -n gloo-system describe po/flagger-79ff7c8b8b-fnhpw
Name:         flagger-79ff7c8b8b-fnhpw
Namespace:    gloo-system
Priority:     0
Node:         microk8s-vm/192.168.64.2
Start Time:   Wed, 25 Aug 2021 21:32:49 +1000
Labels:       app.kubernetes.io/instance=flagger
              app.kubernetes.io/name=flagger
              pod-template-hash=79ff7c8b8b
Annotations:  appmesh.k8s.aws/sidecarInjectorWebhook: disabled
              cni.projectcalico.org/podIP: 10.1.254.75/32
              cni.projectcalico.org/podIPs: 10.1.254.75/32
              prometheus.io/port: 8080
              prometheus.io/scrape: true
Status:       Pending
IP:           10.1.254.75
IPs:
  IP:           10.1.254.75
Controlled By:  ReplicaSet/flagger-79ff7c8b8b
Containers:
  flagger:
    Container ID:  
    Image:         test/flagger:latest
    Image ID:      
    Port:          8080/TCP
    Host Port:     0/TCP
    Command:
      ./flagger
      -log-level=info
      -mesh-provider=gloo
      -metrics-server=http://flagger-prometheus:9090
      -enable-config-tracking=true
      -slack-user=flagger
    State:          Waiting
      Reason:       ImagePullBackOff
    Ready:          False
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  512Mi
    Requests:
      cpu:        10m
      memory:     32Mi
    Liveness:     exec [wget --quiet --tries=1 --timeout=4 --spider http://localhost:8080/healthz] delay=0s timeout=5s period=10s #success=1 #failure=3
    Readiness:    exec [wget --quiet --tries=1 --timeout=4 --spider http://localhost:8080/healthz] delay=0s timeout=5s period=10s #success=1 #failure=3
    Environment:  <none>
    Mounts:
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-55xlg (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  kube-api-access-55xlg:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason     Age                  From               Message
  ----     ------     ----                 ----               -------
  Normal   Scheduled  2m52s                default-scheduler  Successfully assigned gloo-system/flagger-79ff7c8b8b-fnhpw to microk8s-vm
  Normal   Pulling    49s (x4 over 2m51s)  kubelet            Pulling image "test/flagger:latest"
  Warning  Failed     45s (x4 over 2m28s)  kubelet            Failed to pull image "test/flagger:latest": rpc error: code = Unknown desc = failed to pull and unpack image "docker.io/test/flagger:latest": failed to resolve reference "docker.io/test/flagger:latest": pull access denied, repository does not exist or may require authorization: server message: insufficient_scope: authorization failed
  Warning  Failed     45s (x4 over 2m28s)  kubelet            Error: ErrImagePull
  Warning  Failed     33s (x6 over 2m28s)  kubelet            Error: ImagePullBackOff
  Normal   BackOff    18s (x7 over 2m28s)  kubelet            Back-off pulling image "test/flagger:latest"
❯ 
❯ 
❯ 
❯ glooctl version
Client: {"version":"1.8.8"}
Server: {"type":"Gateway","kubernetes":{"containers":[{"Tag":"1.8.9","Name":"gloo-envoy-wrapper","Registry":"quay.io/solo-io"},{"Tag":"1.8.9","Name":"gloo","Registry":"quay.io/solo-io"},{"Tag":"1.8.9","Name":"gateway","Registry":"quay.io/solo-io"}],"namespace":"gloo-system"}}
stefanprodan commented 3 years ago

Use ghcr.io/fluxcd/flagger:1.13.0, this release fixes the issues with Gloo 1.8.

Boes-man commented 3 years ago

@stefanprodan, for running the repo e2e test (flagger/test/gloo/run.sh) I commented out the #kubectl -n gloo-system set image deployment/flagger flagger=test/flagger:latest from install.sh then it works. In the tut docs helm upgrade -i flagger flagger/flagger still installs 1.12.x which fails. Adding --set image.tag=1.13.0 does fix it.

Boes-man commented 3 years ago

FYI, also noticed that using kubectl -n test describe canary/podinfo for the "Automated rollback" section doesn't post the events as expected. Tailing the flagger pod logs does however show the events k -n gloo-system logs flagger-55f9868585-xq5wg -f Thanks