Azure / kube-egress-gateway

kube-egress-gateway provides fixed egress IPs for Kubernetes workloads running on Azure.
MIT License
57 stars 11 forks source link

Failed to update VMSS sku parameter not set #645

Closed wizedkyle closed 3 months ago

wizedkyle commented 3 months ago

Hey all,

Sorry for the second issue, I have gotten past the auth issues in #644 by using App Registration client ID and secret (would still prefer to use managed identities). However, when the controller is reconciling the StaticGatewayConfiguration it is getting a 400 from Azure when trying to update the VMSS (I have redacted the resource group name and subscription ID):

failed to update vmss(aks-egress-24028985-vmss): PUT https://management.azure.com/subscriptions/XXXX/resourceGroups/XXXX/providers/Microsoft.Compute/virtualMachineScaleSets/aks-egress-24028985-vmss -------------------------------------------------------------------------------- RESPONSE 400: 400 Bad Request ERROR CODE: InvalidParameter -------------------------------------------------------------------------------- { "error": { "code": "InvalidParameter", "message": "Required parameter 'sku' is missing (null).", "target": "sku" } } --------------------------------------------------------------------------------

Stack trace from the controller:

image

Looking at the latest API spec for VMSS Create or Update there is no reference to sku being a required parameter.

jwtty commented 3 months ago

Hi, thank you for trying. Could you pls share more error log?

wizedkyle commented 3 months ago

Hey @jwtty what types of logs are you after? as all I can see in the controller is the above and on the azure side I see the failed attempts to update the resource.

jwtty commented 3 months ago

Hi @wizedkyle, did you create the vmss BEFORE running the controller? You have to manually create the vmss as the gateway. I checked Azure log and it shows the vmss was not created before kube-egress-gateway-controller tried to update it.

BTW, aks integration with this feature is close to public preview. With aks integration, we are going to provision the vmss as an agentpool.

jwtty commented 3 months ago

There is doc for creating the vmss beforehand: https://github.com/Azure/kube-egress-gateway/blob/main/docs/install.md#prerequisites

wizedkyle commented 3 months ago

@jwtty The VMSS was created before the controller deployment using Terraform and it was created as a Kubernetes node pool.

Here is the Terraform showing what was provisioned:

resource "azurerm_kubernetes_cluster_node_pool" "egress_gateway" {
  name = "egress"
  kubernetes_cluster_id = azurerm_kubernetes_cluster.cluster.id
  vm_size = "Standard_B4s_v2"
  node_count = 2
  enable_auto_scaling = false
  os_sku = "Ubuntu"
  vnet_subnet_id = azurerm_subnet.aks_cluster.id

  node_labels = {
    "kubeegressgateway.azure.com/mode" = "true"
  }

  node_taints = [
    "kubeegressgateway.azure.com/mode=true:NoSchedule"
  ]

  tags = local.common_tags
}
jwtty commented 3 months ago

Hi @wizedkyle, I think I identified a bug in the code. Just for comfirmation, what value did you put in your "config.azureCloudConfig.resourceGroup" in the helm chart? And what value did you put in your staticGatewayConfiguration.spec.gatewayVmssProfile.vmssResourceGroup?

The issue I think is related to the incorrect resource group provided. My hypothesis is that you put the "MC" resource group of the AKS cluster in the staticGatewayConfiguration spec while the non-mc resource group in the cloud config. Please help me check. And if this is the case, to mitigate, you may put the "MC" resource group in the cloud config too. While I'm working on the bugfix.

wizedkyle commented 3 months ago

Hey @jwtty in the helm chart for config.azureCloudConfig.resourceGroup" I put the resource group name that holds the AKS cluster resource and for thestaticGatewayConfiguration.spec.gatewayVmssProfile.vmssResourceGroup` I used the MC_ resource group name relating to the AKS cluster.

So based on that I would say your hypothesis is correct.

jwtty commented 3 months ago

Cool, appreciate your confirmation. So for mitigation, please set the resourceGroup in the config to the MC_ rg and try. Meanwhile I already made the fix for the issue.

jwtty commented 3 months ago

Please upgrade the images version to v0.0.14.