ir/cache mismatch - when modifying existing (and working) mappings

kriptor commented 1 year ago

Describe the bug I have a cluster running dev and stg namespaced services (two services per namespace). I have a working setup and all my mappings are working. Then I change the existing mappings for one environment to point to nonexisting services, to put the environment into maintenance mode (returning 503). And then I get the IR MISMATCH nad ENVOY CONFIG MISMATCH errors and some mappings from the other environment that wasn't changed disappear from http://127.0.0.1:8877/ambassador/v0/diag/ page. Those of course stop working (returning 404).

[emissary-ingress]: 2023-09-01 11:42:49 diagd 3.7.2 [P25TAEW] ERROR: CACHE: IR MISMATCH
[emissary-ingress]: 2023-09-01 11:42:49 diagd 3.7.2 [P25TAEW] ERROR: CACHE: ENVOY CONFIG MISMATCH
[emissary-ingress]: 2023-09-01 11:42:49 diagd 3.7.2 [P25TAEW] INFO: CACHE: check failed

To Reproduce

I have an initial setup (all is working and no mismatch errors)
- mapping for dev web site (pointing to service web.dev:80; hostname dev-web.example.com, prefix /, rewrite /)
- mapping for dev api (pointing to service api.dev:8080; hostname dev-api.example.com, prefix /v1, rewrite /v1)
- mapping for dev api webhooks (pointing to service api.dev:8080; hostname dev-api.example.com, prefix /webhooks, rewrite /webhooks)
- mapping for stg web site (pointing to service web.stg:80; hostname stg-web.example.com, prefix /, rewrite /)
- mapping for stg api (pointing to service api.stg:8080; hostname stg-api.example.com, prefix /v1, rewrite /v1)
- mapping for stg api webhooks (pointing to service api.stg:8080; hostname stg-api.example.com, prefix /webhooks, rewrite /webhooks)
I then want to setup a maintenance mode for dev services by making all my dev services return 503. What I do is change the dev mappings to point to nonexistent services within my k8s cluster (all other parts of the mappings stay the same).
- mapping for dev web site (pointing to service web-nonexisting.dev:80; hostname dev-web.example.com, prefix /, rewrite /)
- mapping for dev api (pointing to service api-nonexisting.dev:8080; hostname dev-api.example.com, prefix /v1, rewrite /v1)
- mapping for dev api webhooks (pointing to service api-nonexisting.dev:8080; hostname dev-api.example.com, prefix /webhooks, rewrite /webhooks)
- mapping for stg web site (pointing to service web.stg:80; hostname stg-web.example.com, prefix /, rewrite /)
- mapping for stg api (pointing to service api.stg:8080; hostname stg-api.example.com, prefix /v1, rewrite /v1)
- mapping for stg api webhooks (pointing to service api.stg:8080; hostname stg-api.example.com, prefix /webhooks, rewrite /webhooks)
By doing that I effectively manage to put my dev environment into maintenance mode (503 responses), but for some reason I start getting mismatch errors and probably because of that both stg api and stg webooks mappings disappear from diag page.
If I then rolling restart the emissary, the situation is as it should have been without restarting the emissary... so, dev is in maintenance mode and stg fully working.

Expected behavior I should be able to modify the existing mappings without running into strange mismatching errors and without the need to restart the emissary.

Versions (please complete the following information):

Ambassador: 3.6.0, 3.7.2
Kubernetes environment: Digitalocean managed Kubernetes service
Kubernetes version: 1.26.5-do.0

Additional context I believe there is a bug (looks unrelated to the actual reason fo mismatching errors but it would be nice to see econf diff insteda of ir diffs twice): https://github.com/emissary-ingress/emissary/blob/v3.7.2/python/ambassador_diag/diagd.py#L553 Should be: errors += self.json_diff("econf", e1, e2)

kriptor commented 1 year ago

And then I get the IR MISMATCH nad ENVOY CONFIG MISMATCH errors and some mappings from the other environment that wasn't changed disappear from http://127.0.0.1:8877/ambassador/v0/diag/ page. Those of course stop working (returning 404).

Here when I mentioned disappeared mappings from diag page, I meant disappeard from Ambassador Route Table on diag page.

kriptor commented 1 year ago

This is the actual change that triggers the mismatching errors and very likely the more problematic disappearance of analogous api.stg routes... causing 404s on the stg environment.

ambassador, dev-api.example.com, Mapping (getambassador.io) has changed:
  # Source: ambassador-setup/templates/mappings.yaml
  apiVersion: getambassador.io/v3alpha1
  kind: Mapping
  metadata:
    name: dev-api.example.com
    labels:
      helm.sh/chart: ambassador-setup-0.1.0
      app.kubernetes.io/name: ambassador-setup
      app.kubernetes.io/instance: ambassador-setup-dev
      app.kubernetes.io/version: "0.1.0"
      app.kubernetes.io/managed-by: Helm
      my-host: wildcard-subdomain
  spec:
    connect_timeout_ms: 1000
    docs:
      ignored: true
+   error_response_overrides:
+   - body:
+       content_type: application/json
+       text_format_source:
+         filename: /ambassador/ambassador-errorpages/503-maintenance-lockdown.json
+     on_status_code: 503
    hostname: dev-api.example.com
    precedence: 9
    prefix: /v1
    rewrite: /v1
-   service: api.dev:8080
+   service: api-nonexistent.dev:8080
    timeout_ms: 10000
ambassador, dev-api.example.com-webhooks, Mapping (getambassador.io) has changed:
  # Source: ambassador-setup/templates/mappings.yaml
  apiVersion: getambassador.io/v3alpha1
  kind: Mapping
  metadata:
    name: dev-api.example.com-webhooks
    labels:
      helm.sh/chart: ambassador-setup-0.1.0
      app.kubernetes.io/name: ambassador-setup
      app.kubernetes.io/instance: ambassador-setup-dev
      app.kubernetes.io/version: "0.1.0"
      app.kubernetes.io/managed-by: Helm
      my-host: wildcard-subdomain
  spec:
    connect_timeout_ms: 1000
    docs:
      ignored: true
+   error_response_overrides:
+   - body:
+       content_type: application/json
+       text_format_source:
+         filename: /ambassador/ambassador-errorpages/503-maintenance-lockdown.json
+     on_status_code: 503
    hostname: dev-api.example.com
    precedence: 8
    prefix: /webhooks
    rewrite: /webhooks
-   service: api.dev:8080
+   service: api-nonexistent.dev:8080
    timeout_ms: 10000
ambassador, dev-www.example.com, Mapping (getambassador.io) has changed:
  # Source: ambassador-setup/templates/mappings.yaml
  apiVersion: getambassador.io/v3alpha1
  kind: Mapping
  metadata:
    name: dev-www.example.com
    labels:
      helm.sh/chart: ambassador-setup-0.1.0
      app.kubernetes.io/name: ambassador-setup
      app.kubernetes.io/instance: ambassador-setup-dev
      app.kubernetes.io/version: "0.1.0"
      app.kubernetes.io/managed-by: Helm
      my-host: wildcard-subdomain
  spec:
    connect_timeout_ms: 1000
    docs:
      ignored: true
    error_response_overrides:
    - body:
        content_type: text/html
        text_format_source:
          filename: /ambassador/ambassador-errorpages/429.html
      on_status_code: 429
    - body:
        content_type: text/html
        text_format_source:
          filename: /ambassador/ambassador-errorpages/500.html
      on_status_code: 500
    - body:
        content_type: text/html
        text_format_source:
          filename: /ambassador/ambassador-errorpages/502.html
      on_status_code: 502
    - body:
        content_type: text/html
        text_format_source:
-         filename: /ambassador/ambassador-errorpages/503.html
+         filename: /ambassador/ambassador-errorpages/503-maintenance-lockdown.html
      on_status_code: 503
    - body:
        content_type: text/html
        text_format_source:
          filename: /ambassador/ambassador-errorpages/504.html
      on_status_code: 504
    hostname: dev-www.example.com
    precedence: 10
    prefix: /
    rewrite: /
-   service: web.dev:80
+   service: web-nonexistent.dev:80
    timeout_ms: 10000

h4ckroot commented 8 months ago

I am having the same issue! Whenever I update working mappings, routes get screwed!, and I can see IR MISMATCH errors. @kriptor any luck with that?

juanjoku commented 2 months ago

Hello @Kriptor,

Do you reproduce the error on a cluster containing only those 6 mappings? Or is it a cluster in which, in addition to these mappings, there are many others (many = hundreds)?

I am investigating a problem that, I believe, may be the same as your case (but with a lot of mappings).

emissary-ingress / emissary

ir/cache mismatch - when modifying existing (and working) mappings #5279