Azure / AKS-Construction

Accelerate your onboarding to AKS with; Helper Web App, bicep templating and CI/CD samples. Flexible & secure AKS baseline implementations in a Microsoft + community maintained reference implementation.
https://azure.github.io/AKS-Construction/
MIT License
356 stars 168 forks source link

Receiving Bicep errors when deploying using the latest version of AKS Construction (v0.10.0) #606

Closed pjlewisuk closed 1 year ago

pjlewisuk commented 1 year ago

Describe the bug When deploying and AKS cluster using the latest 0.10.0 version of AKS-C, I receive an error like "InvalidTemplate","message":"Deployment template parse failed: 'Error converting value \"1\" to type 'System.String[]'. Path '[0]'.'.

To Reproduce Go to https://azure.github.io/AKS-Construction and run the commands provided as part of the default configuration for "I want a managed environment" and "Cluster with additional security controls".

The only setting I changed was the region I'm deploying into: US East instead of West Europe.

I'm running the commands on zsh on WSL, but running the PowerShell commands in PowerShell returns the same error

(I've redacted my IP address below, but I'm passing in a valid IP there).

az group create -l EastUS -n az-k8s-amru-rg
az deployment group create -g az-k8s-amru-rg  --template-uri https://github.com/Azure/AKS-Construction/releases/download/0.10.0/main.json --parameters \
        resourceName=az-k8s-amru \
        upgradeChannel=stable \
        AksPaidSkuForSLA=true \
        SystemPoolType=Standard \
        agentCountMax=20 \
        custom_vnet=true \
        enable_aad=true \
        AksDisableLocalAccounts=true \
        enableAzureRBAC=true \
        adminPrincipalId=$(az ad signed-in-user show --query id --out tsv) \
        registries_sku=Premium \
        acrPushRolePrincipalId=$(az ad signed-in-user show --query id --out tsv) \
        omsagent=true \
        retentionInDays=30 \
        networkPolicy=azure \
        azurepolicy=audit \
        availabilityZones="[\"1\",\"2\",\"3\"]" \
        authorizedIPRanges="[\"1.2.3.4/32\"]" \
        ingressApplicationGateway=true \
        appGWcount=0 \
        appGWsku=WAF_v2 \
        appGWmaxCount=10 \
        appgwKVIntegration=true \
        aksOutboundTrafficType=natGateway \
        createNatGateway=true \
        keyVaultAksCSI=true \
        keyVaultCreate=true \
        keyVaultOfficerRolePrincipalId=$(az ad signed-in-user show --query id --out tsv)

Full error received:

{
    "status": "Failed",
    "error": {
        "code": "DeploymentFailed",
        "target": "/subscriptions/1ef1298c-a01a-454b-ab6c-2d2203a00553/resourceGroups/az-k8s-amru-rg/providers/Microsoft.Resources/deployments/main",
        "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.",
        "details": [{
            "code": "InvalidTemplate",
            "message": "Deployment template parse failed: 'Error converting value \"1\" to type 'System.String[]'. Path '[0]'.'."
        }]
    }
}

By comparison, a "barebones" command like the one below succeeds:

az group create -l WestEurope -n az-k8s-80k3-rg
az deployment group create -g az-k8s-80k3-rg --template-uri https://github.com/Azure/AKS-Construction/releases/download/0.10.0/main.json --parameters \
resourceName=az-k8s-80k3
agentCount=1 \
JustUseSystemPool=true \
osDiskType=Managed \
osDiskSizeGB=32 \
availabilityZones="[\"1\",\"2\",\"3\"]"

Expected behavior The cluster should get created without the errors :)

Screenshots If applicable, add screenshots to help explain your problem.

Additional context Add any other context about the problem here.

pjlewisuk commented 1 year ago

After some further investigation, the issue seems to be related to the network configuration. By default, ASK Construction helper selects "Custom networking" on the "Networking Details" tab, and it seems the default CIDR / subnet configuration doesn't align with the default AKS CIDR / subnet configuration, which means the nodes can't communicate with the control plane?

Verified by changing the VNET option to "Default Networking" and running a similar (but slightly different) command:

az group create -l EastUS -n az-k8s-l4w8-rg
az deployment group create -g az-k8s-l4w8-rg  --template-uri https://github.com/Azure/AKS-Construction/releases/download/0.10.0/main.json --parameters \
        resourceName=az-k8s-l4w8 \
        upgradeChannel=stable \
        AksPaidSkuForSLA=true \
        SystemPoolType=Standard \
        agentCountMax=20 \
        enable_aad=true \
        AksDisableLocalAccounts=true \
        enableAzureRBAC=true \
        adminPrincipalId=$(az ad signed-in-user show --query id --out tsv) \
        registries_sku=Premium \
        acrPushRolePrincipalId=$(az ad signed-in-user show --query id --out tsv) \
        omsagent=true \
        retentionInDays=30 \
        networkPolicy=azure \
        azurepolicy=audit \
        authorizedIPRanges="[\"1.2.3.4/32\"]" \
        ingressApplicationGateway=true \
        aksOutboundTrafficType=natGateway \
        keyVaultAksCSI=true \
        keyVaultCreate=true \
        keyVaultOfficerRolePrincipalId=$(az ad signed-in-user show --query id --out tsv)

Note lack of the following parameters:

custom_vnet=true
availabilityZones="[\"1\",\"2\",\"3\"]"
appGWcount=0
appGWsku=WAF_v2
appGWmaxCount=10
appgwKVIntegration=true
createNatGateway=true

This command deployed a cluster successfully.

pjlewisuk commented 1 year ago

Running a similar command, but omitting the availabilityZones="[\"1\",\"2\",\"3\"]" parameter results in a different error, because the npuser node pool fails to provision nodes that can communicate with the cluster

az group create -l EastUS -n az-k8s-amru-rg
az deployment group create -g az-k8s-amru-rg  --template-uri https://github.com/Azure/AKS-Construction/releases/download/0.10.0/main.json --parameters \
        resourceName=az-k8s-amru \
        upgradeChannel=stable \
        AksPaidSkuForSLA=true \
        SystemPoolType=Standard \
        agentCountMax=20 \
        custom_vnet=true \
        enable_aad=true \
        AksDisableLocalAccounts=true \
        enableAzureRBAC=true \
        adminPrincipalId=$(az ad signed-in-user show --query id --out tsv) \
        registries_sku=Premium \
        acrPushRolePrincipalId=$(az ad signed-in-user show --query id --out tsv) \
        omsagent=true \
        retentionInDays=30 \
        networkPolicy=azure \
        azurepolicy=audit \
        authorizedIPRanges="[\"1.2.3.4/32\"]" \
        ingressApplicationGateway=true \
        appGWcount=0 \
        appGWsku=WAF_v2 \
        appGWmaxCount=10 \
        appgwKVIntegration=true \
        aksOutboundTrafficType=natGateway \
        createNatGateway=true \
        keyVaultAksCSI=true \
        keyVaultCreate=true \
        keyVaultOfficerRolePrincipalId=$(az ad signed-in-user show --query id --out tsv)

Error details:

{
    "status": "Failed",
    "error": {
        "code": "DeploymentFailed",
        "target": "/subscriptions/1ef1298c-a01a-454b-ab6c-2d2203a00553/resourceGroups/az-k8s-amru-rg/providers/Microsoft.Resources/deployments/main",
        "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.",
        "details": [{
            "code": "ResourceDeploymentFailure",
            "target": "/subscriptions/1ef1298c-a01a-454b-ab6c-2d2203a00553/resourceGroups/az-k8s-amru-rg/providers/Microsoft.Resources/deployments/main-userNodePool",
            "message": "The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.",
            "details": [{
                "code": "DeploymentFailed",
                "target": "/subscriptions/1ef1298c-a01a-454b-ab6c-2d2203a00553/resourceGroups/az-k8s-amru-rg/providers/Microsoft.Resources/deployments/main-userNodePool",
                "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.",
                "details": [{
                    "code": "ResourceDeploymentFailure",
                    "target": "/subscriptions/1ef1298c-a01a-454b-ab6c-2d2203a00553/resourceGroups/az-k8s-amru-rg/providers/Microsoft.ContainerService/managedClusters/aks-az-k8s-amru/agentPools/npuser01",
                    "message": "The resource write operation failed to complete successfully, because it reached terminal provisioning state 'Failed'.",
                    "details": [{
                        "code": "ReconcileVMSSAgentPoolFailed",
                        "message": "Unable to establish connection from agents to Kubernetes API server, please see https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/error-code-k8sapiserverconnfailvmextensionerror and https://aka.ms/aks-required-ports-and-addresses for more information. Details: VMSSAgentPoolReconciler retry failed: Category: ClientError; Code: VMExtensionProvisioningError; SubCode: K8SAPIServerConnFailVMExtensionError; Message: Unable to establish connection from agents to Kubernetes API server, please see https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/error-code-k8sapiserverconnfailvmextensionerror and https://aka.ms/aks-required-ports-and-addresses for more information. Details: instance 3 has extension error details : {vmssCSE error messages : {vmssCSE exit status=51, output=e/man/uk...
                        Purging old database entries in /usr/share/man/ro...
                        Processing manual pages under /usr/share/man/ro...
                        Purging old database entries in /usr/share/man/sv...
                        Processing manual pages under /usr/share/man/sv...
                        Purging old database entries in /usr/share/man/fr...
                        Processing manual pages under /usr/share/man/fr...
                        Purging old database entries in /usr/share/man/sl...
                        Processing manual pages under /usr/share/man/sl...
                        Purging old database entries in /usr/share/man/cs...
                        Processing manual pages under /usr/share/man/cs...
                        Purging old database entries in /usr/share/man/ja...
                        Processing manual pages under /usr/share/man/ja...
                        Purging old database entries in /usr/share/man/de...
                        Processing manual pages under /usr/share/man/de...
                        Purging old database entries in /usr/share/man/nl...
                        Processing manual pages under /usr/share/man/nl...
                        Purging old database entries in /usr/share/man/ko...
                        Processing manual pages under /usr/share/man/ko...
                        Purging old database entries in /usr/share/man/hu...
                        Processing manual pages under /usr/share/man/hu...
                        Purging old database entries in /usr/share/man/sr...
                        Processing manual pages under /usr/share/man/sr...
                        Purging old database entries in /usr/share/man/it...
                        Processing manual pages under /usr/share/man/it...
                        Purging old database entries in /usr/share/man/da...
                        Processing manual pages under /usr/share/man/da...
                        Purging old database entries in /usr/share/man/pl...
                        Processing manual pages under /usr/share/man/pl...
                        Purging old database entries in /usr/share/man/tr...
                        Processing manual pages under /usr/share/man/tr...
                        Purging old database entries in /usr/share/man/pt_BR...
                        Processing manual pages under /usr/share/man/pt_BR...
                        Purging old database entries in /usr/share/man/zh_TW...
                        Processing manual pages under /usr/share/man/zh_TW...
                        Processing manual pages under /usr/local/man...
                        /usr/bin/mandb: can't update index cache /var/cache/man/oldlocal/2222: No such file or directory
                        4 man subdirectories contained newer manual pages.
                        0 manual pages were added.
                        0 stray cats were added.
                        634 old database entries were purged.
                        ++ date
                        + echo 'man-db finished updates at Wed Jul 5 15:28:55 UTC 2023'
                        man-db finished updates at Wed Jul 5 15:28:55 UTC 2023
                        + systemctl restart --no-block apt-daily.timer apt-daily-upgrade.timer
                        + systemctl restart --no-block apt-daily.service
                        + aptmarkWALinuxAgent unhold
                        + echo 'Custom script finished. API server connection check code:' 51
                        Custom script finished. API server connection check code: 51
                        ++ date
                        ++ date
                        ++ hostname
                        ++ hostname
                        + echo Wed Jul 5 15:28:55 UTC 2023,aks-npuser01-40741532-vmss000003, startAptmarkWALinuxAgent unhold
                        Wed Jul 5 15:28:55 UTC 2023,aks-npuser01-40741532-vmss000003, startAptmarkWALinuxAgent unhold
                        + wait_for_apt_locks
                        + fuser /var/lib/dpkg/lock /var/lib/apt/lists/lock /var/cache/apt/archives/lock
                        + echo Wed Jul 5 15:28:55 UTC 2023,aks-npuser01-40741532-vmss000003, endcustomscript
                        + mkdir -p /opt/azure/containers
                        + touch /opt/azure/containers/provision.complete
                        + exit 51, error=}};instance 4 has extension error details : {vmssCSE error messages : {vmssCSE exit status=51, output= true ']'
                        + UU_CONFIG_DIR=/etc/apt/apt.conf.d/99periodic
                        ++ dirname /etc/apt/apt.conf.d/99periodic
                        + mkdir -p /etc/apt/apt.conf.d
                        + touch /etc/apt/apt.conf.d/99periodic
                        + chmod 0644 /etc/apt/apt.conf.d/99periodic
                        + echo 'APT::Periodic::Update-Package-Lists \"1\";'
                        + echo 'APT::Periodic::Unattended-Upgrade \"1\";'
                        + systemctl unmask apt-daily.service apt-daily-upgrade.service
                        Removed /etc/systemd/system/apt-daily.service.
                        Removed /etc/systemd/system/apt-daily-upgrade.service.
                        + systemctl enable apt-daily.service apt-daily-upgrade.service
                        Updating index cache for path `/usr/share/man/man7'. Wait...
                        Updating index cache for path `/usr/share/man/man8'. Wait...
                        Updating index cache for path `/usr/share/man/man3'. Wait...
                        Updating index cache for path `/usr/share/man/man1'. Wait...
                        Purging old database entries in /usr/share/man...
                        Processing manual pages under /usr/share/man...
                        /usr/bin/mandb: warning: /usr/share/man/man1/usbip.8.gz: ignoring bogus filename
                        /usr/bin/mandb: warning: /usr/share/man/man1/usbipd.8.gz: ignoring bogus filename
                        The unit files have no installation config (WantedBy=, RequiredBy=, Also=,
                        Alias= settings in the [Install] section, and DefaultInstance= for template
                        units). This means they are not meant to be enabled using systemctl.

                        Possible reasons for having this kind of units are:
                        • A unit may be statically enabled by being symlinked from another unit's
                         .wants/ or .requires/ directory.
                        • A unit's purpose may be to act as a helper for some other unit which has
                         a requirement dependency on it.
                        • A unit may be started when needed via activation (socket, path, timer,
                         D-Bus, udev, scripted systemctl call, ...).
                        • In case of template units, the unit is meant to be enabled with some
                         instance name specified.
                        + systemctl enable apt-daily.timer apt-daily-upgrade.timer
                        done.
                        Created symlink /etc/systemd/system/timers.target.wants/apt-daily.timer ? /lib/systemd/system/apt-daily.timer.
                        Created symlink /etc/systemd/system/timers.target.wants/apt-daily-upgrade.timer ? /lib/systemd/system/apt-daily-upgrade.timer.
                        + systemctl restart --no-block apt-daily.timer apt-daily-upgrade.timer
                        + systemctl restart --no-block apt-daily.service
                        + aptmarkWALinuxAgent unhold
                        + echo 'Custom script finished. API server connection check code:' 51
                        Custom script finished. API server connection check code: 51
                        ++ date
                        ++ date
                        ++ hostname
                        ++ hostname
                        + echo Wed Jul 5 15:29:00 UTC 2023,aks-npuser01-40741532-vmss000004, startAptmarkWALinuxAgent unhold
                        Wed Jul 5 15:29:00 UTC 2023,aks-npuser01-40741532-vmss000004, startAptmarkWALinuxAgent unhold
                        + wait_for_apt_locks
                        + fuser /var/lib/dpkg/lock /var/lib/apt/lists/lock /var/cache/apt/archives/lock
                        + echo Wed Jul 5 15:29:00 UTC 2023,aks-npuser01-40741532-vmss000004, endcustomscript
                        + mkdir -p /opt/azure/containers
                        + touch /opt/azure/containers/provision.complete
                        + exit 51
                        + retrycmd_if_failure 120 5 25 apt-mark unhold walinuxagent
                        + retries=120
                        + wait_sleep=5
                        + timeout=25
                        + shift
                        + shift
                        + shift
                        ++ seq 1 120, error=}};instance 5 has extension error details : {vmssCSE error messages : {vmssCSE exit status=51, output=s in /usr/share/man/sr...
                        Processing manual pages under /usr/share/man/sr...
                        Purging old database entries in /usr/share/man/it...
                        Processing manual pages under /usr/share/man/it...
                        Purging old database entries in /usr/share/man/da...
                        Processing manual pages under /usr/share/man/da...
                        Purging old database entries in /usr/share/man/pl...
                        Processing manual pages under /usr/share/man/pl...
                        Purging old database entries in /usr/share/man/tr...
                        Processing manual pages under /usr/share/man/tr...
                        Purging old database entries in /usr/share/man/pt_BR...
                        Processing manual pages under /usr/share/man/pt_BR...
                        Purging old database entries in /usr/share/man/zh_TW...
                        Processing manual pages under /usr/share/man/zh_TW...
                        Processing manual pages under /usr/local/man...
                        /usr/bin/mandb: can't update index cache /var/cache/man/oldlocal/2226: No such file or directory
                        0 man subdirectories contained newer manual pages.
                        0 manual pages were added.
                        0 stray cats were added.
                        0 old database entries were purged.
                        ++ date
                        + echo 'man-db finished updates at Wed Jul 5 15:28:43 UTC 2023'
                        man-db finished updates at Wed Jul 5 15:28:43 UTC 2023
                        + systemctl enable apt-daily.service apt-daily-upgrade.service
                        The unit files have no installation config (WantedBy=, RequiredBy=, Also=,
                        Alias= settings in the [Install] section, and DefaultInstance= for template
                        units). This means they are not meant to be enabled using systemctl.

                        Possible reasons for having this kind of units are:
                        • A unit may be statically enabled by being symlinked from another unit's
                         .wants/ or .requires/ directory.
                        • A unit's purpose may be to act as a helper for some other unit which has
                         a requirement dependency on it.
                        • A unit may be started when needed via activation (socket, path, timer,
                         D-Bus, udev, scripted systemctl call, ...).
                        • In case of template units, the unit is meant to be enabled with some
                         instance name specified.
                        + systemctl enable apt-daily.timer apt-daily-upgrade.timer
                        Created symlink /etc/systemd/system/timers.target.wants/apt-daily.timer ? /lib/systemd/system/apt-daily.timer.
                        Created symlink /etc/systemd/system/timers.target.wants/apt-daily-upgrade.timer ? /lib/systemd/system/apt-daily-upgrade.timer.
                        + systemctl restart --no-block apt-daily.timer apt-daily-upgrade.timer
                        + systemctl restart --no-block apt-daily.service
                        + aptmarkWALinuxAgent unhold
                        + echo 'Custom script finished. API server connection check code:' 51
                        Custom script finished. API server connection check code: 51
                        ++ date
                        ++ date
                        ++ hostname
                        ++ hostname
                        + echo Wed Jul 5 15:28:44 UTC 2023,aks-npuser01-40741532-vmss000005, startAptmarkWALinuxAgent unhold
                        Wed Jul 5 15:28:44 UTC 2023,aks-npuser01-40741532-vmss000005, startAptmarkWALinuxAgent unhold
                        + wait_for_apt_locks
                        + fuser /var/lib/dpkg/lock /var/lib/apt/lists/lock /var/cache/apt/archives/lock
                        + echo Wed Jul 5 15:28:44 UTC 2023,aks-npuser01-40741532-vmss000005, endcustomscript
                        + mkdir -p /opt/azure/containers
                        + touch /opt/azure/containers/provision.complete
                        + exit 51, error=}}; InnerMessage: ; Dependency: Microsoft.Compute/VirtualMachineScaleSet; AKSTeam: "
                    }]
                }]
            }]
        }]
    }
}

The key part of the error seems to be:

Unable to establish connection from agents to Kubernetes API server, please see https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/error-code-k8sapiserverconnfailvmextensionerror and https://aka.ms/aks-required-ports-and-addresses for more information. Details: VMSSAgentPoolReconciler retry failed: Category: ClientError; Code: VMExtensionProvisioningError; SubCode: K8SAPIServerConnFailVMExtensionError; Message: Unable to establish connection from agents to Kubernetes API server, please see https://learn.microsoft.com/en-us/troubleshoot/azure/azure-kubernetes/error-code-k8sapiserverconnfailvmextensionerror and https://aka.ms/aks-required-ports-and-addresses for more information.

So with this command, the cluster creation proceeds further than the original command, but then fails due to (seemingly) a networking configuration issue which prevents some / all of the npuser pool nodes communicating with the control plane...

I was watching the cluster using kubectl get nodes -w, and the Virtual machine scale set in the portal, I could see the VMs get provisioned into the scale set, but they never appeared as part of the cluster with kubectl get nodes.

image

pjlewisuk commented 1 year ago

Turning off "Azure Application Gateway Ingress Controller add-on" (i.e. changing "Ingress Controllers" to "Not required"), but leaving "Custom Networking" turned on generates this command:

az group create -l EastUS -n az-k8s-8376-rg
az deployment group create -g az-k8s-8376-rg  --template-uri https://github.com/Azure/AKS-Construction/releases/download/0.10.0/main.json --parameters \
    resourceName=az-k8s-8376 \
    upgradeChannel=stable \
    AksPaidSkuForSLA=true \
    SystemPoolType=Standard \
    agentCountMax=20 \
    custom_vnet=true \
    enable_aad=true \
    AksDisableLocalAccounts=true \
    enableAzureRBAC=true \
    adminPrincipalId=$(az ad signed-in-user show --query id --out tsv) \
    registries_sku=Premium \
    acrPushRolePrincipalId=$(az ad signed-in-user show --query id --out tsv) \
    omsagent=true \
    retentionInDays=30 \
    networkPolicy=azure \
    azurepolicy=audit \
    authorizedIPRanges="[\"1.2.3.4/32\"]" \
    aksOutboundTrafficType=natGateway \
    createNatGateway=true \
    keyVaultAksCSI=true \
    keyVaultCreate=true \
    keyVaultOfficerRolePrincipalId=$(az ad signed-in-user show --query id --out tsv)

i.e. the following parameters have been removed:

availabilityZones="[\"1\",\"2\",\"3\"]"
ingressApplicationGateway=true
appGWcount=0
appGWsku=WAF_v2
appGWmaxCount=10
appgwKVIntegration=true

This results in the same error as the previous configuration.