DataDog / datadog-operator

Kubernetes Operator for Datadog Resources
Apache License 2.0
305 stars 105 forks source link

DatadogMonitor finalizer removed on deletion despite monitor still existing within DataDog. #1327

Open cehoffman opened 4 months ago

cehoffman commented 4 months ago

Output of the info page (if this is a bug)

{"Monitor ID":"149735579", "datadogmonitor":"multi/dynamic-pooled-cost-runner-develop-platform-failed", "error":"error deleting monitor: 503 Service Unavailable: upstream connect error or disconnect/reset before headers. retried and the latest reset reason: remote connection failure, transport failure reason: delayed connect error: 111", "level":"ERROR", "logger":"controllers.DatadogMonitor", "msg":"failed to finalize monitor", "ts":"2024-07-24T05:03:08Z"}
{"datadogmonitor":"multi/dynamic-pooled-cost-runner-develop-platform-failed", "level":"INFO", "logger":"controllers.DatadogMonitor", "msg":"Reconciling DatadogMonitor", "ts":"2024-07-24T05:04:08Z"}
{"datadogmonitor":"multi/dynamic-pooled-cost-runner-develop-platform-failed", "level":"INFO", "logger":"controllers.DatadogMonitor", "msg":"Reconciling DatadogMonitor", "ts":"2024-07-24T16:20:25Z"}
{"datadogmonitor":"multi/dynamic-pooled-cost-runner-develop-platform-failed", "level":"INFO", "logger":"controllers.DatadogMonitor", "msg":"Adding Finalizer for the DatadogMonitor", "ts":"2024-07-24T16:20:25Z"}
{"datadogmonitor":"multi/dynamic-pooled-cost-runner-develop-platform-failed", "level":"INFO", "logger":"controllers.DatadogMonitor", "msg":"Reconciling DatadogMonitor", "ts":"2024-07-24T16:22:10Z"}
{"Monitor ID":0, "Monitor Name":"dynamic-pooled-cost-runner-develop-platform-failed", "Monitor Namespace":"multi", "datadogmonitor":"multi/dynamic-pooled-cost-runner-develop-platform-failed", "level":"INFO", "logger":"controllers.DatadogMonitor", "msg":"Added required tags", "ts":"2024-07-24T16:22:10Z"}
{"datadogmonitor":"multi/dynamic-pooled-cost-runner-develop-platform-failed", "error":"error creating monitor: 400 Bad Request: {"errors":["Duplicate of an existing monitor_id:149735579 org_id:313359"]}", "level":"ERROR", "logger":"controllers.DatadogMonitor", "msg":"error creating monitor", "ts":"2024-07-24T16:22:10Z"}

These logs start with a DatadogMonitor deletion being processed and then later the DatadogMonitor resource is recreated and fails due to the previous incarnation still existing within DataDog.

Describe what happened: We have some ephemeral applicatons that come and go at irregular times. As part of this they related monitors defined by DatadogMonitor are created or deleted as part of the application. Sometimes if the operator encounters an error response from the DataDog API, the DatadogMonitor can get garbage collected while the monitor remains within DataDog. Once the application comes back into existence and the DatadogMonitors are recreated, some will fail to create due to an already existing monitor.

Describe what you expected: Expect the operator to not allow a DatadogMonitor to be garbage collected until the monitor has been confirmed deleted from DataDog.

Steps to reproduce the issue: We delete 51 monitors at the same time as part of the application teardown. It is unknown if this burst is an issue for the DataDog API. Creating and deleting a batch of monitors with unchanging details will cause this to happen intermittently. Only a few, > 5, will fail to delete at DataDog.

Additional environment details (Operating System, Cloud provider, etc): Google Coud GKE 1.29.6

cehoffman commented 3 months ago

We've had to disable operator managed monitors in our more volatile environments due to this issue.

fanny-jiang commented 2 months ago

Hi @cehoffman, thanks for reporting this issue. I've created a card in our backlog to address this.