coder / coder

Provision remote development environments via Terraform
https://coder.com
GNU Affero General Public License v3.0
9.28k stars 823 forks source link

OOM/OOD notifications documentation #16581

Closed stirby closed 3 weeks ago

stirby commented 1 month ago

We are adding native notifications to alert users when they are overutilizing memory and disk to prevent agent disconnects due to OOM/OOD ahead of time.

This notification requires more configuration than most opt-in alerts. We should inform users before their release at the start of march.

Resource alerting notifications allow template admins to set "high water mark" thresholds for memory and volume consumption in Terraform. When these thresholds are exceeded in workspaces created from that template, the owner of the workspace is notified. To enable OOM/OOD notifications on a template, use the resources_monitoring[1] block on the coder_agent[2] resource in our Terraform provider. You can specify one or more volumes to monitor for OOD alerts, OOM alerts are reported per-agent.

Here's an example configuration to warn the user when memory usage exceeds 90%, or disk usage exceeds 80%/95%:

resource "coder_agent" "main" {
  arch = data.coder_provisioner.dev.arch
  os   = data.coder_provisioner.dev.os
  resources_monitoring {
    memory {
      enabled   = true
      threshold = 90
    }
    volume {
      path      = "/volume1"
      enabled   = true
      threshold = 80
    }
    volume {
      path      = "/volume2"
      enabled   = true
      threshold = 95
    }
  }
}
  }

[1] https://registry.terraform.io/providers/coder/coder/latest/docs/resources/agent#resources_monitoring-1 [2] https://registry.terraform.io/providers/coder/coder/latest/docs/resources/agent

stirby commented 1 month ago

Related https://github.com/coder/terraform-provider-coder/pull/331