DataDog / datadog-operator

Kubernetes Operator for Datadog Resources
Apache License 2.0
294 stars 102 forks source link

Duplicate SLO created for each `DatadogSLO` #1062

Open jeff-jsq opened 7 months ago

jeff-jsq commented 7 months ago

Describe what happened: I've confirmed I'm only running a single Datadog operator in my K8s cluster, but it seems each DatadogSLO creates multiple SLOs in Datadog.

Running 1.3.0 of the operator, creating an example DatadogSLO:

apiVersion: datadoghq.com/v1alpha1
kind: DatadogSLO
metadata:
  name: text-xyz
  namespace: test
spec:
  description: Error SLO for test-xyz
  name: Error SLO for test-xyz
  query:
    denominator: sum:trace.pyramid.request.hits{service:test-xyz, env:test}.as_count()
    numerator: sum:trace.pyramid.request.hits{service:test-xyz, env:test}.as_count()
      - sum:trace.pyramid.request.errors{service:test-xyz, env:test}.as_count()
  tags:
  - integration:kubernetes
  - service:test-xyz
  - env:test
  - team:sre
  - generated:kubernetes
  targetThreshold: 99500m
  timeframe: 7d
  type: metric

results in multiple SLOs being created in Datadog:

CleanShot 2024-01-31 at 11 35 23@2x

Deleting the DatadogSLO results in one of the SLOs being orphaned in Datadog.

Describe what you expected:

I expect a single DatadogSLO resource to result in a single SLO created in Datadog.

Steps to reproduce the issue:

Install the Datadog Operator via Helm (chart version 1.4.1) with following values:

datadogCRDs:
  crds:
    datadogSLOs: true
apiKeyExistingSecret: datadog-secret
appKeyExistingSecret: datadog-secret
datadogMonitor:
  enabled: true
datadogSLO:
  enabled: true
site: datadoghq.com
watchNamespaces:
- ""

Kubectl apply the example DatadogSLO above.

Additional environment details (Operating System, Cloud provider, etc):

khewonc commented 7 months ago

Hi, thanks for reporting this. We'll look into this on our end to try and see why multiple SLOs are getting created

paulbrassard-figure commented 1 month ago

I've also seen this issue using the 1.8.3 helm chart with the 1.7.0 operator.

Additionally, I was using Kyverno with a generate policy for DatadogSLOs and synchronization turned on. My target threshold was set to "99.0" and the datadog-operator controller would change it to "99", which caused Kyverno and the datadog-operator to fight back and forth changing it. The result was that I had around 40 duplicate SLOs as described in this issue. I only add all this to say that it seems that this problem gets exacerbated by updating the resource.

levan-m commented 1 month ago

Thanks for the reporting the issue @paulbrassard-figure!

As mentioned here the fix addressed once specific case leading to duplication - namely concurrent reconciliation of the resource. With SLO Create API not being idempotent we can't guarantee that duplication won't happen. So it would be great if you could share more details about your setup, how to reproduce the issue with Kyverno and if possible without.