VictoriaMetrics / operator

Kubernetes operator for Victoria Metrics
Apache License 2.0
406 stars 142 forks source link

Operator duplicates values from VMAlertManagerConfig to config secret #954

Closed mariapoliss closed 1 month ago

mariapoliss commented 1 month ago

Hello!

I found a problem with updating Secret with VM Alert Manager config. I deployed the last release version v0.44.0 for this moment. For example, if I create VMAlertManagerConfig like this:

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMAlertmanagerConfig
metadata:
  name: test-amc
spec:
  inhibit_rules:
    - equal:
        - alertname
      source_matchers:
        - severity="critical"
      target_matchers:
        - severity="warning"
    - source_matchers:
        - alertname="QuietWeeklyNotifications"
      target_matchers:
        - alert_group="l2ci_weekly"
  receivers:
    - name: l2ci_receiver
      webhook_configs:
        - url: http://notification_stub_ci1:8080
    - name: blackhole
    - name: ca_em_receiver
      webhook_configs:
        - url: http://notification_stub_ci2:8080
  route:
    group_by:
      - alertname
      - l2ci_channel
    routes:
      - matchers:
          - alertname="QuietWeeklyNotifications"
        receiver: blackhole
      - matchers:
          - alertname="QuietDailyNotifications"
        receiver: blackhole
      - matchers:
          - alert_group=~"^l2ci.*"
        receiver: l2ci_receiver

Victoria Metrics operator watches for test-amc, updates secret with name set in VMAlertManager (spec.configSecret) and makes it every reconcile cycle.

After updating the secret I see duplicated values (receivers, inhibit_rules and route.routes) in secret data alertmanager.yaml:

route:
  receiver: default-test-amc-blackhole
  routes:
    - routes:
        - matchers:
            - alertname="QuietWeeklyNotifications"
          receiver: default-test-amc-blackhole
          continue: false
        - matchers:
            - alertname="QuietDailyNotifications"
          receiver: default-test-amc-blackhole
          continue: false
        - matchers:
            - alert_group=~"^l2ci.*"
          receiver: default-test-amc-l2ci_receiver
          continue: false
      matchers:
        - namespace = "default"
      group_by:
        - alertname
        - l2ci_channel
      receiver: default-test-amc-blackhole
      continue: true
    - routes:
        - matchers:
            - alertname="QuietWeeklyNotifications"
          receiver: default-test-amc-blackhole
          continue: false
        - matchers:
            - alertname="QuietDailyNotifications"
          receiver: default-test-amc-blackhole
          continue: false
        - matchers:
            - alert_group=~"^l2ci.*"
          receiver: default-test-amc-l2ci_receiver
          continue: false
      matchers:
        - namespace = "default"
      group_by:
        - alertname
        - l2ci_channel
      receiver: default-test-amc-blackhole
      continue: true
inhibit_rules:
  - target_matchers:
      - severity="warning"
      - namespace = "default"
    source_matchers:
      - severity="critical"
      - namespace = "default"
    equal:
      - alertname
  - target_matchers:
      - alert_group="l2ci_weekly"
      - namespace = "default"
    source_matchers:
      - alertname="QuietWeeklyNotifications"
      - namespace = "default"
  - target_matchers:
      - severity="warning"
      - namespace = "default"
    source_matchers:
      - severity="critical"
      - namespace = "default"
    equal:
      - alertname
  - target_matchers:
      - alert_group="l2ci_weekly"
      - namespace = "default"
    source_matchers:
      - alertname="QuietWeeklyNotifications"
      - namespace = "default"
receivers:
  - name: default-test-amc-l2ci_receiver
    webhook_configs:
      - url: http://notification_stub_ci1:8080
  - name: default-test-amc-blackhole
  - name: default-test-amc-ca_em_receiver
    webhook_configs:
      - url: http://notification_stub_ci2:8080
  - name: default-test-amc-l2ci_receiver
    webhook_configs:
      - url: http://notification_stub_ci1:8080
  - name: default-test-amc-blackhole
  - name: default-test-amc-ca_em_receiver
    webhook_configs:
      - url: http://notification_stub_ci2:8080
templates: []

Seems like rootcause is in function buildConfig https://github.com/VictoriaMetrics/operator/blob/v0.44.0/controllers/factory/alertmanager/config.go#L55. There is no check for duplicates in existing receivers, inhibit rules and subroutes.

Finally, I see error logs in vmalertmanager:

ts=2024-01-17T12:12:04.662Z caller=coordinator.go:118 level=error component=configuration msg="Loading configuration file failed" file=/etc/alertmanager/config/alertmanager.yaml err="notification config name \"test-amc-l2ci_receiver\" is not unique"

and in the operator after some time (several hours), because secret has too many duplicated values:

{"level":"error","ts":"2024-05-16T11:00:37Z","logger":"manager","msg":"Reconciler error","controller":"vmalertmanager","controllerGroup":"operator.victoriametrics.com","controllerKind":"VMAlertmanager","VMAlertmanager":{"name":"test-am","namespace":"test"},"namespace":"test","name":"test-am","reconcileID":"45f70f50-ccdd-4542-9c56-51835b8cdc99","error":"callback error: failed to check default Alertmanager config: Secret \"vmalertmanager-config\" is invalid: data: Too long: must have at most 1048576 bytes","stacktrace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\tsigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:329\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\tsigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\tsigs.k8s.io/controller-runtime@v0.17.2/pkg/internal/controller/controller.go:227"}

I created a test to check the behavior locally. The problem reproduces again.

alertmanager_test.zip

Could you please investigate and fix the problem?

f41gh7 commented 1 month ago

Hello, thanks for reporting!

Issue related to the name clash of configSecret and secret created by operator itself. Operator builds the following name for own secret: vmalertmanager-%CR_NAME-config. And it may clash with exist secret.

an example of config with bug:

apiVersion: v1
kind: Secret
metadata:
  name: vmalertmanager-example-config
  labels:
    app: vm-operator
type: Opaque
stringData:
  alertmanager.yaml: |
    global:
      resolve_timeout: 5m
    route:
      receiver: 'blachole'
    receivers:
      - name: 'blachole'
---
apiVersion: operator.victoriametrics.com/v1beta1
kind: VMAlertmanager
metadata:
  name: example
spec:
  replicaCount: 1
  configSecret: vmalertmanager-example-config
  selectAllByDefault: true

I think, proper fix for it is ignoring any content of secret, if it's name matches the name of secret created by operator.

Will be fixed at the next patch release

mariapoliss commented 1 month ago

Thank you for fast response!

One more detail: I set configSecret in my custom resource VMAlertManager (not default secret name). And the fix that is proposed now won`t resolve the problem in the case if I set secret name.

f41gh7 commented 1 month ago

Thank you for fast response!

One more detail: I set configSecret in my custom resource VMAlertManager (not default secret name). And the fix that is proposed now won`t resolve the problem in the case if I set secret name.

You have to set value of configSecret to the different name, it must solve issue. Linked PR must resolve it.

Thanks for the test case, added it to the PR.

f41gh7 commented 1 month ago

Changes was included to v0.45.0 release.