Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.97k stars 308 forks source link

AKS provision failed with error code: OverlaymgrReconcileError #1714

Closed smartpcr closed 3 years ago

smartpcr commented 4 years ago

What happened: I was trying to provision AKS cluster using terraform, cluster was created successfully. Then I deleted the cluster and tried to create cluster again, the following error was returned:

Error waiting for creation of Managed Kubernetes Cluster "xddev-xd-aks" (Resource Group "xiaodong-rg"): Code="OverlaymgrReconcileError" Message="We are unable to serve this request due to an internal error, Correlation ID: c9eb5b90-d756-5857-b5d8-54b55b326f47, Operation ID: 0bff1ac9-b6fd-49f8-b181-754fff567e44, Timestamp: 2020-07-07T22:15:01Z."

What you expected to happen: the cluster should be created without error

How to reproduce it (as minimally and precisely as possible):

  1. main.tf
    
    terraform {
    backend "azurerm" {}
    }

module "provider" { source = "github.com/smartpcr/bedrock/cluster/azure/provider" }

data "azurerm_client_config" "current" {}

module "aks-gitops" { source = "github.com/smartpcr/bedrock/cluster/azure/aks-gitops"

log analytics

log_analytics_resource_group_name = var.log_analytics_resource_group_name log_analytics_resource_group_location = var.log_analytics_resource_group_location log_analytics_name = var.log_analytics_name

aks cluster

subscription_id = var.aks_subscription_id ssh_public_key = var.ssh_public_key aks_resource_group_location = var.aks_resource_group_location aks_resource_group_name = var.aks_resource_group_name service_principal_id = var.service_principal_id service_principal_secret = var.service_principal_secret server_app_id = var.server_app_id server_app_secret = var.server_app_secret client_app_id = var.client_app_id tenant_id = var.tenant_id agent_vm_count = var.agent_vm_count agent_vm_size = var.agent_vm_size cluster_name = var.cluster_name kubernetes_version = var.kubernetes_version dns_prefix = var.dns_prefix service_cidr = var.service_cidr dns_ip = var.dns_ip docker_cidr = var.docker_cidr oms_agent_enabled = var.oms_agent_enabled dashboard_cluster_role = var.dashboard_cluster_role

dev-space

enable_dev_spaces = var.enable_dev_spaces dev_space_name = var.dev_space_name

aks role assignment

aks_owners = var.aks_owners aks_contributors = var.aks_contributors aks_readers = var.aks_readers aks_owner_groups = var.aks_owner_groups aks_contributor_groups = var.aks_contributor_groups aks_reader_groups = var.aks_reader_groups

flux

enable_flux = var.enable_flux flux_recreate = var.flux_recreate kubeconfig_recreate = var.kubeconfig_recreate gc_enabled = var.gc_enabled acr_enabled = var.acr_enabled gitops_ssh_url = var.gitops_ssh_url gitops_ssh_key = var.gitops_ssh_key gitops_path = var.gitops_path gitops_poll_interval = var.gitops_poll_interval gitops_url_branch = var.gitops_url_branch create_helm_operator = var.create_helm_operator create_helm_operator_crds = var.create_helm_operator_crds git_label = var.git_label }


2. aks terraform module
``` terraform
locals {
  msi_identity_type = "SystemAssigned"
}

module "azure-provider" {
  source = "../provider"
}

provider "azurerm" {
  subscription_id = var.subscription_id
}

resource "random_id" "workspace" {
  keepers = {
    group_name = var.log_analytics_resource_group_name
  }

  byte_length = 8
}

resource "azurerm_log_analytics_workspace" "workspace" {
  name                = "bedrock-k8s-workspace-${random_id.workspace.hex}"
  location            = var.log_analytics_resource_group_location
  resource_group_name = var.log_analytics_resource_group_name
  sku                 = "PerGB2018"
}

resource "azurerm_log_analytics_solution" "solution" {
  solution_name         = "ContainerInsights"
  location              = var.log_analytics_resource_group_location
  resource_group_name   = var.log_analytics_resource_group_name
  workspace_resource_id = azurerm_log_analytics_workspace.workspace.id
  workspace_name        = azurerm_log_analytics_workspace.workspace.name

  plan {
    publisher = "Microsoft"
    product   = "OMSGallery/ContainerInsights"
  }
}

resource "azurerm_virtual_network" "vnet" {
  name                = "aks-vnet"
  location            = var.aks_resource_group_location
  address_space       = [var.address_space]
  resource_group_name = var.aks_resource_group_name
  dns_servers         = []
  tags = {
    environment = "aks-vnet"
  }
}

resource "azurerm_subnet" "subnet" {
  name                 = "aks-subnet"
  virtual_network_name = "aks-vnet"
  resource_group_name  = var.aks_resource_group_name
  address_prefix       = var.subnet_prefix
  service_endpoints    = []
  depends_on           = [azurerm_virtual_network.vnet]
}

resource "azurerm_kubernetes_cluster" "cluster" {
  name                            = var.cluster_name
  location                        = var.aks_resource_group_location
  resource_group_name             = var.aks_resource_group_name
  dns_prefix                      = var.dns_prefix
  kubernetes_version              = var.kubernetes_version
  node_resource_group             = var.node_resource_group
  api_server_authorized_ip_ranges = var.api_auth_ips

  linux_profile {
    admin_username = var.admin_user

    ssh_key {
      key_data = var.ssh_public_key
    }
  }

  # The windows_profile block should be optional.  However, there is a bug in the Terraform Azure provider
  # that does not treat this block as optional -- even if no windows nodes are used.  If not present, any
  # change that should result in an update to the cluster causes a replacement.
  windows_profile {
    admin_username = "azureuser"
    admin_password = "Adm1nPa33++"
  }

  default_node_pool {
    name            = "default"
    node_count      = var.agent_vm_count
    vm_size         = var.agent_vm_size
    os_disk_size_gb = 30
    vnet_subnet_id  = azurerm_subnet.subnet.id
  }

  network_profile {
    network_plugin     = var.network_plugin
    network_policy     = var.network_policy
    service_cidr       = var.service_cidr
    dns_service_ip     = var.dns_ip
    docker_bridge_cidr = var.docker_cidr
  }

  role_based_access_control {
    enabled = true

    azure_active_directory {
      server_app_id     = var.server_app_id
      server_app_secret = var.server_app_secret
      client_app_id     = var.client_app_id
    }
  }

  dynamic "service_principal" {
    for_each = ! var.msi_enabled && var.service_principal_id != "" ? [{
      client_id     = var.service_principal_id
      client_secret = var.service_principal_secret
    }] : []
    content {
      client_id     = service_principal.value.client_id
      client_secret = service_principal.value.client_secret
    }
  }

  addon_profile {
    oms_agent {
      enabled                    = var.oms_agent_enabled
      log_analytics_workspace_id = azurerm_log_analytics_workspace.workspace.id
    }

    http_application_routing {
      enabled = var.enable_http_application_routing
    }

    kube_dashboard {
      enabled = true
    }

    azure_policy {
      enabled = true
    }
  }

  # This dynamic block enables managed service identity for the cluster
  # in the case that the following holds true:
  #   1: the msi_enabled input variable is set to true
  dynamic "identity" {
    for_each = var.msi_enabled ? [local.msi_identity_type] : []
    content {
      type = identity.value
    }
  }

  tags = var.tags

  depends_on = [azurerm_subnet.subnet]
}

data "external" "msi_object_id" {
  depends_on = [azurerm_kubernetes_cluster.cluster]
  program = [
    "${path.module}/aks_msi_client_id_query.sh",
    var.cluster_name,
    var.cluster_name,
    var.subscription_id
  ]
}

Anything else we need to know?:

Environment:

krishnadce commented 4 years ago

I also got the same error in EAST US region for the kubernetes version 1.17.7. Then, I ran it again with 1.16.10 and it worked.

Error: waiting for creation of Managed Kubernetes Cluster "AKS-CLUSTEREASTUS" (Resource Group "RG-AKS-CLUSTEREASTUS"): Code="OverlaymgrReconcileError" Message="We are unable to serve this request due to an internal error, Correlation ID: <REDACTED>, Operation ID: <REDACTED>, Timestamp: 2020-07-10T10:10:11Z."

ohorvath commented 4 years ago

Same error here for 1.17.7.

github-actions[bot] commented 4 years ago

Action required from @Azure/aks-pm

alexeldeib commented 4 years ago

This is an internal error code, i'd recommend opening a support ticket with the unredacted operation/correlation IDs so support can see the full error code in the backend

TomGeske commented 4 years ago

@smartpcr and @ohorvath : did you manage to open a support case with us?

ghost commented 4 years ago

Case being worked with Microsoft Support, adding stale label for automatic closure if not other reports are added.

ghost commented 4 years ago

Case being worked with Microsoft Support, adding stale label for automatic closure if not other reports are added.

ghost commented 4 years ago

Case being worked with Microsoft Support, adding stale label for automatic closure if not other reports are added.

ghost commented 4 years ago

Case being worked with Microsoft Support, adding stale label for automatic closure if not other reports are added.

ghost commented 4 years ago

Case being worked with Microsoft Support, adding stale label for automatic closure if not other reports are added.

eehret commented 4 years ago

Apologies if I'm not supposed to chime in at this point, but this is happening to me too in Canada Central region

Terraform v0.13.4

ohorvath commented 4 years ago

Happening again for us too. Today and yesterday at least 10 clusters failed with this error message. No terraform, mostly in centralus region.

ghost commented 4 years ago

Case being worked with Microsoft Support, adding stale label for automatic closure if not other reports are added.

ghost commented 3 years ago

This issue will now be closed because it hasn't had any activity for 15 days after stale. smartpcr feel free to comment again on the next 7 days to reopen or open a new issue after that time if you still have a question/issue or suggestion.