DataDog / terraform-provider-datadog

Terraform Datadog provider
https://www.terraform.io/docs/providers/datadog/
Mozilla Public License 2.0
387 stars 372 forks source link

Handle dependency between monitors, SLOs and dashboards #667

Open Tony-Proum opened 3 years ago

Tony-Proum commented 3 years ago

Terraform Version

Terraform v0.12.29

Affected Resource(s)

Terraform Configuration Files

resource "datadog_monitor" "availability" {
  name               = "Availability"
  type               = "metric alert"
  message            = "Monitor triggered."
  query = "<any request>"
  lifecycle {
    ignore_changes = [silenced]
  }
}

resource "datadog_service_level_objective" "availability" {
  name        = "Availability"
  type        = "monitor"
  description = "Availability"
  monitor_ids = [
    datadog_monitor.availability.id
  ]
  thresholds {
    timeframe = "7d"
    target    = 99.95
    warning   = 99.96
  }
}

resource "datadog_dashboard" "service_level_objectives" {
  layout_type = "ordered"
  title       = "Service Level Objectives"
  widget {
    service_level_objective_definition {
      title             = "Availability SLO"
      view_type         = "detail"
      slo_id            = datadog_service_level_objective.availability.id
      show_error_budget = true
      view_mode         = "overall"
      time_windows      = ["7d", "30d", "90d"]
    }
  }
}

Actual Behavior

Modifying the monitor type is not possible in this configuration without using some workaround: When modifying the type of the monitor, terraform needs to destroy the resource and to recreate it (due to an API restriction ? The documentation don't mention that but I didn't tried). The plan shows that the operation is needed, and it make sense as illustrated here:

  # datadog_monitor.availability must be replaced
-/+ resource "datadog_monitor" "availability" {
      ~ evaluation_delay    = 0 -> (known after apply)
      ~ id                  = "269981" -> (known after apply)
... TRUNCATED ...
      - timeout_h           = 0 -> null
      ~ type                = "query alert" -> "log alert" # forces replacement
    }

  # datadog_service_level_objective.availability will be updated in-place
  ~ resource "datadog_service_level_objective" "availability" {
... TRUNCATED ...
      ~ monitor_ids = [
          - 269981,
        ] -> (known after apply)
... TRUNCATED ...
}

But at runtime, terraform output displays throws this type of error :

Error: error deleting monitor: 400 Bad Request: {"errors":["monitor [269981,Availability ] is referenced in slos: [...,Availability]"]}

Expected Behavior

In this particular case, datadog_service_level_objective monitor_ids attribute could forces terraform to replace the resource (seems like it the simplest solution for this problème)

Workaround

If this type of error is thrown, changing the resource name e.g: resource "datadog_service_level_objective" "availability" to esource "datadog_service_level_objective" "availability-v2" Forces replacement of the resource and allows terraform to handle this case.

Steps to Reproduce

  1. Using the example code above,

  2. terraform apply

  3. modify the datadog_monitor type or the datadog_service_level_objective type

  4. terraform apply

References

nmuesch commented 3 years ago

For transparency to anyone looking, conversation is happening in PR #668 🙂

nmuesch commented 3 years ago

Hey, moving the conversation back to this issue instead of the PR so I can better understand the issue you're facing and the proposed solution 🙂

I think I may have misunderstood the original issue a bit, apologies for that. To confirm, you're right that changing the monitor type will require the monitor to be deleted and recreated and this is imposed at the API layer.

To restate the issue to confirm I follow, you'd like to update a monitor type, and have that new ID be referenced in the SLO definition. However, the monitor can't be deleted because it's a part of the SLO.

Given this, I was testing out the force_delete workflow and I believe I understand your note about the usability workflow. You'd need to apply a configuration with the force_delete flag set on each monitor before actually changing the monitor definition to be of another type, otherwise the monitor deletion still fails with a "still referenced in SLO" error. But this should allow you to modify the monitor type without any provider changes.

I believe your proposal is to ForceDelete the SLO if the underlying monitor_id changes. Does this solve the issue you're facing? I believe this would require the SLO to be deleted before any lifecycle changes happened to the monitor.

Tony-Proum commented 3 years ago

No worry, I'm not the better to write issues or to explain clearly things without some kind of support (images schemas and other tools like that)

Forcing the deletion of the SLO solves the issue for all my tests, I hope to have test most of existing case, and the PR contains an automated test for this workflow. The only things that I were worry about is the data that could be lost using this deletion. But as far as I could experiment this I did'nt noticed any issue with this as all data displayed into an SLO seems to be saved on the monitor resource (correct me if I'm wrong)

So, seems ok

Tony-Proum commented 3 years ago

I updated the PR with some more informations: a link exists between dashboard and SLO too. see more details here https://github.com/DataDog/terraform-provider-datadog/pull/668#issuecomment-715238587

nmuesch commented 3 years ago

Hey @Tony-Proum again, I apologize for my delay on this one. I just wanted to leave a note to let you know I'm still taking a look here! I think there may be a few issues being hit and I am just trying to get a complete picture.

nmuesch commented 3 years ago

Hey @Tony-Proum I've been playing around with this some more and wanted to follow up. As mentioned in my previous note, I think there may be multiple issues at play here, and I want to ensure we can provide a solution while avoiding introducing any other edge cases or complexities.

I do have reservations around adding "forceNew" to additional places, if its indeed the only viable option, it's something we can definitely add, but I think it should be one of the later solutions attempted. There have been edge cases in the past, where forceNew was triggered in error on the monitor resource (for example when switching between metric and query types) that caused users to lose monitor history unnecessarily.

I've been looking into this and testing with the config you provided. As you noted, there is a link both between Monitors and the SLO, and between the SLO and Dashboards. Both monitor and slo have a force flag at the api layer that was introduced to allow this to be bypassed. (This is available today in the Go client, but its not yet in terraform for the SLO resource - https://github.com/DataDog/datadog-api-client-go/blob/master/api/v1/datadog/api_service_level_objectives.go#L367)

To propose an alternative, would having the force flag available in terraform on the SLO resource also address your workflow? The steps there would be to apply the force = true field to the monitor and slo resource prior to changes if you're comfortable with deleting a monitor/slo thats used somewhere. Let me know what you think, and I appreciate you working to get this issue resolved!

Tony-Proum commented 3 years ago

@nmuesch I would prefer not to use force = true: in case of using multiple terraform worspaces to split the configuration into small chuncks, using force delete could results in unexpected results (dashboards without slo or slo without monitor). I still "prefer" the forceNew option to securely handle changes in the infrastructure.

Also I understand that this case is not really easy. I'm also not at ease with potential use cases that could be introduced with this solution, but well, each time I had to change some underlying resources (e.g: monitor or SLOs of an existing dashboard) the best way I found to force terraform to do what I expect it to do, is to change all resource names in upper resources (in order to manually trigger a ForceNew)

I also tried to figure out where are the data stored and their seems to be stored at the lower level (in monitor resource or directly in metrics) dashboard and SLOs are just some kind of different view point around those data. I think that this hypothesis have to be verified, but If so, it may not be a big risk to use the forceNew way and I really think that this solution is the most compliant to the "terraform way" to handle this type of dependency

cullenmcdermott commented 3 years ago

Hi, I'm wondering if there's any update on this? We are hitting a similar issue when Terraform tries to delete a monitor that is still associated with an SLO. In our case though this is when we go to test our terraform modules it spins up the infra and all associated pieces (including datadog monitors and SLO dashboards), then we hit the error when destroying the test resources.

therve commented 3 years ago

The recommended approach is to add the force_delete flag on the resources. This way we skip the server side validation happening when deleting resources.

seanyu4296 commented 3 years ago

do you mean force_delete @therve ? https://registry.terraform.io/providers/DataDog/datadog/2.26.1/docs/resources/monitor

therve commented 3 years ago

Yes sorry!

thomasbrezinski commented 3 years ago

Hey there, I wanted to mention this also seems to include any composite monitor relationship as well.

Example:

resource "datadog_monitor" "composite_example" {
  type    = "composite"
  name    = "Example composite monitor"
  query   = "19443597 && 19443598"
  message = "composite monitor"
}
Error: error deleting monitor: 400 Bad Request: {"errors": ["monitor [19443597,Monitor example] is referenced in composite monitors: [37050226,Example composite monitor]"]}