fluxcd / flagger

Progressive delivery Kubernetes operator (Canary, A/B Testing and Blue/Green deployments)
https://docs.flagger.app
Apache License 2.0
4.81k stars 722 forks source link

GatewayAPI Session Affinity not honored #1647

Open ethankhall opened 2 months ago

ethankhall commented 2 months ago

Describe the bug

When setting a Canary object to so session affinity with an Kubernete API Gateway like in Session Affinity. I was running a K6 test to verify that users were assigned to a version, and weren't shifted back on a successful deploy.

I noticed that within 1 second, all the users were assigned to the next version.

I believe this is happening because the HTTPRoute being created doesn't pin the user to the primary version.

HTTPRoute

```yaml spec: hostnames: - charmander.example.com parentRefs: - group: gateway.networking.k8s.io kind: Gateway name: default-gateway namespace: istio-ingress rules: - backendRefs: - group: "" kind: Service name: charmander-primary port: 9898 weight: 0 - group: "" kind: Service name: charmander-canary port: 9898 weight: 100 matches: - headers: - name: Cookie type: RegularExpression value: .*flagger-cookie.*nROEvCteRd.* path: type: PathPrefix value: / - backendRefs: - group: "" kind: Service name: charmander-primary port: 9898 weight: 95 - filters: - responseHeaderModifier: add: - name: Set-Cookie value: flagger-cookie=nROEvCteRd; Max-Age=3600 type: ResponseHeaderModifier group: "" kind: Service name: charmander-canary port: 9898 weight: 5 matches: - path: type: PathPrefix value: / ```

Note, charmander is a deployment of ghcr.io/stefanprodan/podinfo

To Reproduce

K8s Yaml and K6 script

```yaml --- # Source: charmander/templates/deployment.yaml apiVersion: apps/v1 kind: Deployment metadata: name: charmander namespace: charmander labels: app.kubernetes.io/name: charmander app.kubernetes.io/component: "web" spec: minReadySeconds: 5 replicas: 3 revisionHistoryLimit: 5 progressDeadlineSeconds: 60 strategy: rollingUpdate: maxUnavailable: 1 type: RollingUpdate selector: matchLabels: app.kubernetes.io/name: charmander app.kubernetes.io/component: "web" template: metadata: annotations: prometheus.io/scrape: "true" prometheus.io/port: "9797" unique-title: 'greetings from deploy v1' labels: app.kubernetes.io/name: charmander app.kubernetes.io/component: "web" spec: containers: - name: podinfod image: ghcr.io/stefanprodan/podinfo:6.5.0 imagePullPolicy: IfNotPresent ports: - name: http containerPort: 9898 protocol: TCP - name: http-metrics containerPort: 9797 protocol: TCP - name: grpc containerPort: 9999 protocol: TCP command: - ./podinfo - --port=9898 - --port-metrics=9797 - --grpc-port=9999 - --grpc-service-name=podinfo - --level=info - --random-delay=false - --random-error=true env: - name: PODINFO_UI_COLOR value: "#34577c" - name: PODINFO_UI_MESSAGE valueFrom: fieldRef: fieldPath: metadata.annotations['unique-title'] startupProbe: exec: command: - podcli - check - http - localhost:9898/healthz initialDelaySeconds: 30 timeoutSeconds: 5 resources: limits: cpu: 2000m memory: 512Mi requests: cpu: 100m memory: 64Mi --- # Source: charmander/templates/canary.yaml apiVersion: flagger.app/v1beta1 kind: Canary metadata: name: charmander-canary namespace: charmander spec: # when set to true, deploy will auto succeed, only use during an emergency. skipAnalysis: false # deployment reference targetRef: apiVersion: apps/v1 kind: Deployment name: charmander # the maximum time in seconds for the canary deployment # to make progress before it is rollback (default 600s) progressDeadlineSeconds: 120 service: gatewayRefs: - group: gateway.networking.k8s.io kind: Gateway name: default-gateway namespace: istio-ingress hosts: - 'charmander.example.com' port: 9898 targetPort: 9898 analysis: interval: 1m maxWeight: 50 metrics: [] sessionAffinity: cookieName: flagger-cookie maxAge: 3600 stepWeight: 10 threshold: 5 ``` And running the k6 script ```javascript import http from 'k6/http'; import { check, sleep } from 'k6'; export const URL = "https://charmander.example.com/" export const options = { // A number specifying the number of VUs to run concurrently. vus: 6, // A string specifying the total duration of the test run. duration: '600s', // Disable clearing cookies noCookiesReset: true }; function parseRevision(resp) { try { return resp.json().message; } catch (e) { return null } } export function setup() { return { revision: null, changeCount: 0 }; } export default function (data) { var resp = http.get(URL); var revision = parseRevision(resp); if (data.revision == null) { console.log(`VU initial version ${revision}`) data.revision = revision; } if (revision && revision !== data.revision) { data.changeCount++; console.log(data.revision + " : " + revision) data.revision = revision; } check(resp, { 'changeCount < 2': () => data.changeCount < 2 }); } export function teardown(data) { console.log(data); } ```

The output looks like

    scenarios: (100.00%) 1 scenario, 6 max VUs, 10m30s max duration (incl. graceful stop):
              * default: 6 looping VUs for 10m0s (gracefulStop: 30s)

INFO[0000] VU initial version greetings from deploy v2   source=console
INFO[0000] VU initial version greetings from deploy v1   source=console
INFO[0000] VU initial version greetings from deploy v1   source=console
INFO[0000] VU initial version greetings from deploy v2   source=console
INFO[0000] VU initial version greetings from deploy v1   source=console
INFO[0000] VU initial version greetings from deploy v1   source=console
INFO[0000] greetings from deploy v1 : greetings from deploy v2  source=console
INFO[0000] greetings from deploy v1 : greetings from deploy v2  source=console
INFO[0000] greetings from deploy v1 : greetings from deploy v2  source=console
INFO[0001] greetings from deploy v1 : greetings from deploy v2  source=console
INFO[0600] {"changeCount":0,"revision":null}             source=console

     ✓ changeCount < 2

     █ setup

     █ teardown

     checks.........................: 100.00% ✓ 63985      ✗ 0
     data_received..................: 27 MB   46 kB/s
     data_sent......................: 3.0 MB  4.9 kB/s
     http_req_blocked...............: avg=50.85µs min=0s      med=1µs     max=695.65ms p(90)=1µs     p(95)=1µs
     http_req_connecting............: avg=11.94µs min=0s      med=0s      max=86.31ms  p(90)=0s      p(95)=0s
     http_req_duration..............: avg=55.93ms min=33.96ms med=53.5ms  max=461.31ms p(90)=64.63ms p(95)=78.13ms
       { expected_response:true }...: avg=56.53ms min=33.96ms med=53.33ms max=461.31ms p(90)=66.94ms p(95)=87.43ms
     http_req_failed................: 35.18%  ✓ 22515      ✗ 41470
     http_req_receiving.............: avg=1.57ms  min=6µs     med=46µs    max=308.44ms p(90)=122µs   p(95)=413.79µs
     http_req_sending...............: avg=80.69µs min=8µs     med=43µs    max=26.45ms  p(90)=85µs    p(95)=130µs
     http_req_tls_handshaking.......: avg=32.48µs min=0s      med=0s      max=301.47ms p(90)=0s      p(95)=0s
     http_req_waiting...............: avg=54.28ms min=33.81ms med=53.06ms max=461.21ms p(90)=61.73ms p(95)=65.97ms
     http_reqs......................: 63985   106.637746/s
     iteration_duration.............: avg=56.24ms min=1.79µs  med=53.74ms max=772.84ms p(90)=64.99ms p(95)=78.55ms
     iterations.....................: 63985   106.637746/s
     vus............................: 6       min=6        max=6
     vus_max........................: 6       min=6        max=6

running (10m00.0s), 0/6 VUs, 63985 complete and 0 interrupted iterations
default ✓ [======================================] 6 VUs  10m0s

Expected behavior

When running, the users are ~ the correct percent of assigned users.

Additional context

ethankhall commented 2 months ago

Maybe related to https://github.com/fluxcd/flagger/issues/1532