GoogleCloudPlatform / cloud-foundation-fabric

End-to-end modular samples and landing zones toolkit for Terraform on GCP.
Apache License 2.0
1.48k stars 841 forks source link

Question: can you use health checks / outlier detection with serverless NEGS? #1783

Closed yardenas closed 10 months ago

yardenas commented 11 months ago

Hi everyone,

Some context: I'm trying to set up a multiregion deployment for cloudrun, mainly for the purpose of automatic failover. My understanding is that whenever one of the NEGs (of a specific region) becomes unhealthy, traffic is redirected to other NEGS.

Now, to mark a NEG as unhealthy, I have two options:

  1. Use health checks
  2. Use outlier detection.

The documentation states:

Health checks are not supported for serverless backends. Therefore, backend services that contain serverless NEG backends cannot be configured with health checks. However, you can optionally enable outlier detection to identify unhealthy serverless services and route new requests to a healthy serverless service.

So it seems that I cannot use health checks but only outlier detection.

Now for the code (a modified version of the serverless cloud run blueprint, to support multiple regions):

resource "random_uuid" "cloudrun_revision_id" {
  keepers = {
    first = timestamp()
  }
}

locals {
  gclb_create = var.custom_domain == null ? false : true
}

# Cloud Run service
module "cloud_run" {
  for_each      = var.regions
  source        = "github.com/GoogleCloudPlatform/cloud-foundation-fabric.git//modules/cloud-run?ref=v25.0.0"
  project_id    = var.project_id
  name          = "${var.run_svc_name}-${each.key}"
  revision_name = "${var.run_svc_name}-${random_uuid.cloudrun_revision_id.result}"
  region        = each.value
  containers = {
    default = {
      image = var.container_image
      options = {
        command  = null
        args     = null
        env      = {}
        env_from = null
      }
      ports         = null
      resources     = null
      volume_mounts = null
    }
  }
  iam = {
    "roles/run.invoker" = var.invoker_group
  }
  revision_annotations = {
    autoscaling         = var.autoscaling
    cloudsql_instances  = var.cloudsql_instances
    vpcaccess_connector = var.vpcaccess_connectors[each.key]
    vpcaccess_egress    = "all-traffic"
  }
  ingress_settings       = var.ingress_settings
  service_account_create = true
}

# Reserved static IP for the Load Balancer
resource "google_compute_global_address" "default" {
  count   = local.gclb_create ? 1 : 0
  project = var.project_id
  name    = "glb-ip"
}

resource "google_compute_ssl_policy" "profile" {
  name            = "prod-ssl-policy"
  profile         = "MODERN"
  min_tls_version = "TLS_1_2"
}

# Global L7 HTTPS Load Balancer in front of Cloud Run
module "glb" {
  source     = "github.com/GoogleCloudPlatform/cloud-foundation-fabric.git//modules/net-lb-app-ext?ref=v25.0.0"
  count      = local.gclb_create ? 1 : 0
  project_id = var.project_id
  name       = "external-load-balancer"
  address    = google_compute_global_address.default[0].address
  backend_service_configs = {
    default = {
      backends = [
        for k, v in var.regions : {
          backend = k
        }
      ]
      health_checks = []
      outlier_detection = {
        consecutive_errors = 10
      }
      port_name     = "http"
      security_policy = try(google_compute_security_policy.policy[0].name,
      null)
      iap_config = try({
        oauth2_client_id     = google_iap_client.iap_client[0].client_id,
        oauth2_client_secret = google_iap_client.iap_client[0].secret
      }, null)
    }
  }
  health_check_configs = {}
  neg_configs = {
    for k, v in var.regions :
    k => {
      cloudrun = {
        region = v
        target_service = {
          name = module.cloud_run[k].service_name
        }
      }
    }
  }
  protocol = "HTTPS"
  https_proxy_config = {
    ssl_policy = google_compute_ssl_policy.profile.self_link
  }
  ssl_certificates = {
    managed_configs = {
      default = {
        domains = [var.custom_domain]
      }
    }
  }
}

# Cloud Armor configuration
resource "google_compute_security_policy" "policy" {
  count   = local.gclb_create && var.security_policy.enabled ? 1 : 0
  name    = "cloud-run-policy"
  project = var.project_id
  rule {
    action   = "deny(403)"
    priority = 1000
    match {
      versioned_expr = "SRC_IPS_V1"
      config {
        src_ip_ranges = var.security_policy.ip_blacklist
      }
    }
    description = "Deny access to list of IPs"
  }
  rule {
    action   = "deny(403)"
    priority = 900
    match {
      expr {
        expression = "request.path.matches(\"${var.security_policy.path_blocked}\")"
      }
    }
    description = "Deny access to specific URL paths"
  }
  rule {
    action   = "allow"
    priority = "2147483647"
    match {
      versioned_expr = "SRC_IPS_V1"
      config {
        src_ip_ranges = ["*"]
      }
    }
    description = "Default rule"
  }
}

# Identity-Aware Proxy (IAP) or OAuth brand (see OAuth consent screen)
# Note:
# Only "Organization Internal" brands can be created programmatically
# via API. To convert it into an external brand please use the GCP
# Console.
# Brands can only be created once for a Google Cloud project and the
# underlying Google API doesn't support DELETE or PATCH methods.
# Destroying a Terraform-managed Brand will remove it from state but
# will not delete it from Google Cloud.
resource "google_iap_brand" "iap_brand" {
  count   = var.iap.enabled ? 1 : 0
  project = var.project_id
  # Support email displayed on the OAuth consent screen. The caller must be
  # the user with the associated email address, or if a group email is
  # specified, the caller can be either a user or a service account which
  # is an owner of the specified group in Cloud Identity.
  support_email     = var.iap.support_email
  application_title = var.iap.app_title
}

# IAP owned OAuth2 client
# Note:
# Only internal org clients can be created via declarative tools.
# External clients must be manually created via the GCP console.
# Warning:
# All arguments including secret will be stored in the raw state as plain-text.
resource "google_iap_client" "iap_client" {
  count        = var.iap.enabled ? 1 : 0
  display_name = var.iap.oauth2_client_name
  brand        = google_iap_brand.iap_brand[0].name
}

# IAM policy for IAP
# For simplicity we use the same email as support_email and authorized member
resource "google_iap_web_iam_member" "iap_iam" {
  count   = var.iap.enabled ? 1 : 0
  project = var.project_id
  role    = "roles/iap.httpsResourceAccessor"
  member  = var.iap.email
}

resource "google_project_service_identity" "iap_sa" {
  count    = var.iap.enabled ? 1 : 0
  provider = google-beta
  project  = var.project_id
  service  = "iap.googleapis.com"
}

Whenever I run this code I get the following error:

Invalid value for field 'resource.outlierDetection': '{  "consecutiveErrors": 10,  "maxEjectionPercent": 10,  "enforcingConsecutiveErrors": 100,  "enforci...'. Outlier detection is not supported., invalid

What am I missing here? Any help would be very much appreciated!

ludoo commented 11 months ago

@apichick is probably best place to chime in on this

ludoo commented 11 months ago

can you paste the full error message which includes the source line number and file?

yardenas commented 11 months ago
│ Error: Error creating BackendService: googleapi: Error 400: Invalid value for field 'resource.outlierDetection': '{  "consecutiveErrors": 10,  "maxEjectionPercent": 10,  "enforcingConsecutiveErrors": 100,  "enforci...'. Outlier detection is not supported., invalid
│ 
│   with module.application.module.glb[0].google_compute_backend_service.default["default"],
│   on .terraform/modules/application.glb/modules/net-lb-app-ext/backend-service.tf line 44, in resource "google_compute_backend_service" "default":
│   44: resource "google_compute_backend_service" "default" {
│ 
╵

Hope this helps

ludoo commented 11 months ago

You're using a Global Load Balancer, this might be the reason

image

yardenas commented 11 months ago

I see, thanks for the info! @ludoo, are you aware of any other way to achieve automatic failover?

Thanks a lot for helping ! 💪

ludoo commented 11 months ago

@apichick was chatting with me about it working with GLB, she might have code for that. Let's wait a minute until she has time to chime in. :)

czka commented 10 months ago

FWIW, I guess that once the Outlier detection for serverless NEGs enters GA (it's pre-GA as of writing this) the google_compute_backend_service's outlier_detection should be able to support the EXTERNAL_MANAGED LB scheme as well (at least as long as IAP isn't enabled). See: https://github.com/hashicorp/terraform-provider-google/issues/15210

ludoo commented 10 months ago

Closing this for now as I don't think it's a module issue from our side. Feel free to reopen if you still want to discuss, or if new evidence emerges.