Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.95k stars 305 forks source link

As per Microsoft Docs, agentPoolProfiles is not a mandatory parameter while creating AKS via arm template, but my arm deployment fails saying missing parameter : agentPoolProfiles. #2495

Open preetisingh110 opened 3 years ago

preetisingh110 commented 3 years ago

What happened: Tried creating AKS via arm template and as per doc agentPoolProfiles is not a mandatory parameter, Hence i didn't pass agentPoolProfiles . Rather i was creating agentPools via resources array section of arm template.But my deployment failed saying missing parameter : agentPoolProfiles .

What you expected to happen: I expected the deployment should proceed.

How to reproduce it (as minimally and precisely as possible): Let me know if you want my arm template.

ghost commented 3 years ago

Hi preetisingh110, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

ghost commented 3 years ago

Triage required from @Azure/aks-pm

qpetraroia commented 3 years ago

@preetisingh110, if you could pass your arm template to us, that would be great. Thanks!

danijam commented 2 years ago

@qpetraroia not the original poster but I believe I have hit the same issue as described.

Here is my bicep file that reproduces the same error.

param adminUsername string = 'joeblogs'

resource aks 'Microsoft.ContainerService/managedClusters@2021-08-01' = {
  name: 'ring0'
  location: 'uksouth'
  sku: {
    name: 'Basic'
    tier: 'Free'
  }
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    dnsPrefix: 'jdaksring0'
    enableRBAC: true
    kubernetesVersion: '1.21.2'
    nodeResourceGroup: 'aks-platform-ring0-nodepool'
    windowsProfile: {
      adminUsername: adminUsername
      adminPassword: 'superSecret123!'
    }
  }

  resource linuxpool 'agentPools' = {
    name: 'linux'
    properties: {
      count: 3
      mode: 'System'
      osDiskType: 'Ephemeral'
      osType: 'Linux'
      type: 'VirtualMachineScaleSets'
      vmSize: 'standard_d2s_v3'
    }
  }

  resource winpool 'agentPools' = {
    name: 'win'
    properties: {
      count: 3
      mode: 'System'
      osDiskType: 'Ephemeral'
      osType: 'Windows'
      type: 'VirtualMachineScaleSets'
      vmSize: 'standard_d2s_v3'
    }
  }
}

Error:

{
  "code": "DeploymentFailed",
  "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.",
  "details": [
    {
      "code": "InvalidParameter",
      "message": "Required parameter agentPoolProfiles is missing (null)."
    }
  ]
}
hansmbakker commented 2 years ago

Having the same error here.

I'm not sure whether it does succeed after an initial deployment with an inline agent pool, but when rebuilding from scratch this does fail.

In my version, I have the agent pool resources not-nested as opposed to https://github.com/Azure/AKS/issues/2495#issuecomment-959181070

@justindavies @yizhang4321 could you have a look at this?

craigktreasure commented 2 years ago

We're hitting this now as well when deploying new clusters. Deploying on top of existing clusters is fine. We moved our pool resources out of agentPoolProfiles after initial deployment so that we could update the pools since there are many changes you can't make to existing pools.

Same error as others.

{
    "code": "DeploymentFailed",
    "message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.",
    "details": [
        {
            "code": "InvalidParameter",
            "message": "Required parameter agentPoolProfiles is missing (null)."
        }
    ]
}
kaarthis commented 2 years ago

The AKS dev team is looking into this and shall have an update provided by next week here.

tonychen15 commented 2 years ago

@craigktreasure Could you elaborate on your failure scenario a little bit? I prefer you could provide us with what steps you have executed and what kind of parameters for each step. Without this detailed information, we can't conclude whether the failure was caused by an AKS issue or a test scenario problem.

craigktreasure commented 2 years ago

@tonychen15 I'm not sure what you're after specifically. I think between the OP and https://github.com/Azure/AKS/issues/2495#issuecomment-959181070 it's mostly there. It might help to clarify how I ended up in this situation. I can only speak for myself, but I wouldn't be surprised if my experience is similar to others.

Note: The examples below have been stripped down to make the point and are not expected to work in practice.

You start by creating your cluster with something simple (not complete):

resource cluster_resource 'Microsoft.ContainerService/managedClusters@2021-09-01' = {
  name: clusterName
  location: regionLocation
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    kubernetesVersion: clusterKubernetesVersion
    dnsPrefix: '${clusterName}-dns'
    agentPoolProfiles: [
      {
        name: 'agentpool'
        vmSize: 'Standard_DS2_v2'
        osDiskSizeGB: 128
        osDiskType: 'Managed'
        type: 'VirtualMachineScaleSets'
        enableAutoScaling: true
        mode: 'System'
        osType: 'Linux'
      }
    ]
    servicePrincipalProfile: {
      clientId: 'msi'
    }
  }
}

Then, you decide you want to change your vmSize or some other aspect of the pool and redeploy, which generally results in failure indicating that you can't change various aspects of an agent pool after creating it. Lesson learned.

So, you then think, ok, i'll simply add another agent pool and migrate things over to the new one, right? Wrong. You're then greeted with:

A new agent pool was introduced. Adding agent pools to an existing cluster is not allowed through managed cluster operations. 
For agent pool specific change, please use per agent pool operations: https://aka.ms/agent-pool-rest-api.

Maybe I missed something, but I didn't find the help url (https://aka.ms/agent-pool-rest-api) specified to be all that helpful for my situation.

After some documentation surfing, you learn the magic of the Microsoft.ContainerService/managedClusters/agentPools resource type. This makes you start to feel like the agentPoolProfiles property is pretty much useless since you can't modify or add to the silly thing once you've deployed.

So, I then defined new pools using the Microsoft.ContainerService/managedClusters/agentPools resource type and ended up with something that looks like this:

resource cluster_resource 'Microsoft.ContainerService/managedClusters@2021-09-01' = {
  name: clusterName
  location: regionLocation
  identity: {
    type: 'SystemAssigned'
  }
  properties: {
    kubernetesVersion: clusterKubernetesVersion
    // agentPoolProfiles: Agent pools are defined below using managedClusters/agentPools.
    servicePrincipalProfile: {
      clientId: 'msi'
    }
  }
}

resource systempool1_pool 'Microsoft.ContainerService/managedClusters/agentPools@2021-09-01' = {
  name: 'systempool1'

  parent: cluster_resource
  properties: {
    count: 2
    enableAutoScaling: true
    kubeletDiskType: 'OS'
    mode: 'System'
    nodeTaints: [
      // Prevent application pods from running on this pool.
      // https://docs.microsoft.com/azure/aks/use-system-pools#system-and-user-node-pools
      'CriticalAddonsOnly=true:NoSchedule'
    ]
    osDiskSizeGB: 128
    osDiskType: 'Managed'
    osType: 'Linux'
    type: 'VirtualMachineScaleSets'
    vmSize: 'Standard_D4s_v3'
  }
}
resource general1_pool 'Microsoft.ContainerService/managedClusters/agentPools@2021-09-01' = {
  name: 'general1'
  parent: cluster_resource
  properties: {
    count: 1
    enableAutoScaling: true
    kubeletDiskType: 'OS'
    mode: 'User'
    osDiskSizeGB: 128
    osDiskType: 'Managed'
    osType: 'Linux'
    type: 'VirtualMachineScaleSets'
    vmSize: 'Standard_D4s_v3'
  }
}

This can be deployed again and again to a cluster that already exists. I then manually cordoned, drained, and deleted the old agent pool.

However, when you go to deploy a new cluster with something like I ended up with at the end, you end up with the errors reported above. Suddenly the agentPoolProfiles property is required again expecting a pool to be defined. It leaves us quite confused as to how to use these two methods of defining agent pools.

tonychen15 commented 2 years ago

@craftyhouse I appreciate you providing such detailed information on how this issue happened.

So sorry to hear that Agent Pool Rest API document didn't help you in your situation. I believe the following how-to guide may help you more in a similar situation. https://docs.microsoft.com/en-us/azure/aks/use-multiple-node-pools.

Back to the issue itself. It seems that the failure scenario happened in the create phase but not the update scenario. The second key point is that two agent pool resource definitions were provided but no agentPoolProfiles field was provided. If my understanding is correct, then I guess the wordings "agentPoolProfiles is not a mandatory parameter" is causing some misunderstandings for customers.

If the customers don't provide any parameters for agentPoolProfiles, then AKS will automatically allocate a default system pool and agentPoolProfiles for the customer. In that case, that sentence is correct. However, once the customers provide some agentPool resource definitions but without agentPoolProfiles structure, AKS code will double-check agentPoolProfiles's existence based on the fact that agentPoolProfiles is a wrap structure for agentPool resources. As my first thought, there is a gap between Bicep/ARM template and AKS on whether agentPoolProfiles is needed in the latter case. But we will first discuss this with AKS product team, and then comes out with a solution to overcome the confusion.

fschmied commented 2 years ago

@tonychen15 Just want to add that I ran into this problem exactly the same way as @craigktreasure described.

I think it's hard to get the design of ARM/Bicep templates right with regards to AKS clusters when you need to support both fresh and update deployments. And this is mainly caused by

In an ideal ARM/Bicep IaC world, the agentPoolProfiles property would represent exactly the desired state configuration. I.e., when a new agent pool is listed, it should be created. When an existing agent pool is no longer listed, it should be removed (and its nodes cordoned and drained; if not possible, the deployment should fail). If you could make this work, most things would become very intuitive.

In the current situation, one option might be to provide better documentation on how AKS via ARM/Bicep is meant to be done. One thing that works for me is too define only the system node pool in the agentPoolProfiles property and every other node pool as child resources. This approach still requires manual removal of replaced pools and it doesn't support replacing the system node pool, but at least it supports new and updating deployments at the same time and I haven't found a better one yet.

craigktreasure commented 2 years ago

@kaarthis or @tonychen15, any update on this?

ericyew commented 2 years ago

Managed to get this working. You only need the system pool in the agentPoolProfiles. The other node pools can be added as part of "Microsoft.ContainerService/managedClusters/agentPools" as per fschmied

craigktreasure commented 1 year ago

@kaarthis or @tonychen15, still looking for an update on this?

ghost commented 1 year ago

@craigktreasure @kaarthis will start the effort to update the Microsoft Doc to make it clear to customers. In a very simple create cluster scenario, i.e. customers only need one system pool, then they do not need to mention any agentpool profile, AKS will create a default agentpool profile for customers. However, in other cases when customers try to create a couple of node pools at the same time, then an agentpool profile is needed to indicate some common properties for these agent pools. Without it, AKS will respond back an alert.

nicklasfrahm commented 1 year ago

The way that node pools are handled today is far from optimal. Ideally I would expect the agentPoolProfiles to be a required attribute. In addition, it should also allow for the capability to create new node pools directly by updating this property.

Our use case is that we want to declaratively manage agentPoolProfiles as described in https://github.com/pulumi/pulumi-azure-native/issues/579 without having to create our own logic on top to cater for the fact that we may need to create the node pools as separate REST resources.

ericsuhong commented 1 year ago

We are also facing a similar issue.

You cannot rely on agentPoolProfiles property to model agent pools declaratively because ARM doesn't allow you to change this property in many ways. For example, you cannot add/delete node pools, cannot upgrade OS versions, and cannot even change maxSurge property!

Message: Updating property MaxSurge of a virtual-machine-scale-set agent pool is not allowed through the managed cluster API. Use the agent pool API (https://aka.ms/agent-pool-rest-api) to update property in agent pool agentpool2

It looks like using "Microsoft.ContainerService/managedClusters/agentPools" resource to model agent pools separately may allow us to agent node pools declaratively to some level, but I am worried that this approach is "hacky" and may cause some inconsistency from AKS's control plane.

@tonche @kaarthis @craigktreasure Can you check whether using "Microsoft.ContainerService/managedClusters/agentPools" to model agent pools directly is a safe operation and can be used as an interim solution?

nicklasfrahm commented 1 year ago

I would be fine with using Microsoft.ContainerService/managedClusters/agentPools but then AKS should allow you to create a managedCluster with an empty agentPoolProfiles property so you can manage all node pools homogenously. The issue is that the current implementation is inconsistent.

The issue we are getting today is:

Diagnostics:
  pulumi:pulumi:Stack (Platform-clusters-prd):
    error: update failed

  azure-native:containerservice/v20220901:ManagedCluster (ns-azweu-dev-c.akscluster-r.akscluster):
    error: Code="BadRequest" Message="A new agent pool was introduced. Adding agent pools to an existing cluster is not allowed through managed cluster operations. For agent pool specific change, please use per agent pool operations: [https://aka.ms/agent-pool-rest-api"](https://aka.ms/agent-pool-rest-api%22) Target="agentPoolProfiles"
jeffbeagley commented 1 year ago

We just ran into this as well and I agree with the "workaround" by @fschmied .

The documentation from the Bicep examples is not clear and I think the labeling is incorrect to be labeled as "agentpool".

Consider the documentation here, https://learn.microsoft.com/en-us/azure/aks/use-multiple-node-pools as there is a clear distinction between "system node" and "user node" pools. IMHO, the "agentpoolprofile" should not be required and relabeled to "systempoolprofile", and if not provided then managedClusters resource should deploy a minimum vmset necessary for those dependencies to run. Then any subsequent "user node pool" is to be provided by the agentpool resource.

If that is the original intent, then the label should be renamed in the Arm/Bicep template and documentation be updated. We discovered this right before deploying this into a customers environment. We went to update the servers right before deploying to production and was met with the "cannot change this property" error.

jeffbeagley commented 1 year ago

Still no update on this? Running into this again on another customer of mine.

chzbrgr71 commented 1 year ago

Checking in on this @kaarthis @yizhang4321

Gil-Shvarzman commented 1 year ago

I'm experiencing the same issue, but from the Azure Portal.

dogsbody-josh commented 9 months ago

We also ran into this issue having adopted a customers existing Bicep code and their AgentPoolProfiles setup. We needed to expand their infrastructure and replace the existing system pool. We needed to do this because we also needed to change the vmSize used in the pools and that's not possible to alter for an existing node pool.

I haven't seen exact instructions in this thread to address this need, nor in the documentation. I believe this is a continuation of the method @craigktreasure outlined above.

The process of system pool replacement and maintaining a valid Bicep setup (and I assume other IaC) isn't actually too tricky, it's just a 2 step process using the methods described above, with a small change.

Example - start with one node pool of mode 'system', perhaps like this:

resource k8sCluster 'Microsoft.ContainerService/managedClusters@2021-03-01' = {
  name: name
  location: location
  tags:{
    displayName: 'AKS Cluster'
  }
  identity:{
    type: 'SystemAssigned'
  }
  properties: {
    kubernetesVersion: k8sVersion
    enableRBAC: true    
    nodeResourceGroup: '${resourceGroup().name}-k8s-nrg'
    dnsPrefix: dnsPrefix
    agentPoolProfiles: [
      {
        name: 'agentpool'
        mode: 'System'
        osType: 'Linux'
        type: 'VirtualMachineScaleSets'
        count: k8sNodeCount
        vmSize: k8sClusterSize
        osDiskSizeGB: k8sOsDiskSizeGB
        orchestratorVersion: k8sVersion
        enableAutoScaling: agentpoolAutoScaling
        minCount: AutoScalingMin
        maxCount: AutoScalingMax
      }
     ]
  }
}

To replace this pool, grow the cluster with another system mode node pool, using the agentPools resource mentioned by others and documented here.

Here's an example file showing the two node pools:

resource k8sCluster 'Microsoft.ContainerService/managedClusters@2021-03-01' = {
  name: name
  location: location
  tags:{
    displayName: 'AKS Cluster'
  }
  identity:{
    type: 'SystemAssigned'
  }
  properties: {
    kubernetesVersion: k8sVersion
    enableRBAC: true    
    nodeResourceGroup: '${resourceGroup().name}-k8s-nrg'
    dnsPrefix: dnsPrefix
    agentPoolProfiles: [
      {
        name: 'agentpool'
        mode: 'System'
        osType: 'Linux'
        type: 'VirtualMachineScaleSets'
        count: k8sNodeCount
        vmSize: k8sClusterSize
        osDiskSizeGB: k8sOsDiskSizeGB
        orchestratorVersion: k8sVersion
        enableAutoScaling: agentpoolAutoScaling
        minCount: AutoScalingMin
        maxCount: AutoScalingMax
      }
     ]
  }
}

resource systempool 'Microsoft.ContainerService/managedClusters/agentPools@2023-07-02-preview' = {
  name: 'systempool'
  parent: k8sCluster
  properties: {
    mode: 'System'
    osType: 'Linux'
    type: 'VirtualMachineScaleSets'
    count: systemNodeCount
    vmSize: systemVmSize
    osDiskType: 'Ephemeral'
    osDiskSizeGB: systemOsDiskSizeGB
    orchestratorVersion: k8sVersion
    enableAutoScaling: systempoolAutoScaling
    minCount: systempoolAutoScalingMin
    maxCount: systempoolAutoScalingMax
}
}

You're free to change all aspects of this second system node pool. Change the vmSize or whatever. It's a new pool so can be setup as such however you like.

Once that's been deployed to your infrastructure and created, refactor the Bicep/IaC file itself to place the new 'systempool' into the original AgentPoolProfiles section. Like so:

resource k8sCluster 'Microsoft.ContainerService/managedClusters@2021-03-01' = {
  name: name
  location: location
  tags:{
    displayName: 'AKS Cluster'
  }
  identity:{
    type: 'SystemAssigned'
  }
  properties: {
    kubernetesVersion: k8sVersion
    enableRBAC: true    
    nodeResourceGroup: '${resourceGroup().name}-k8s-nrg'
    dnsPrefix: dnsPrefix
    agentPoolProfiles: [
      {    
        name: 'systempool'
        mode: 'System'
        osType: 'Linux'
        type: 'VirtualMachineScaleSets'
        count: systemNodeCount
        vmSize: systemVmSize
        osDiskType: 'Ephemeral'
        osDiskSizeGB: systemOsDiskSizeGB
        orchestratorVersion: k8sVersion
        enableAutoScaling: systempoolAutoScaling
        minCount: systempoolAutoScalingMin
        maxCount: systempoolAutoScalingMax
      }
     ]
  }
}

Hopefully that example is clear enough, you just replace the original values of the original 'agentpool' within the agentPoolProfiles section of the managedClusters resource with the new values of the newly created 'systempool'.

You then remove the entire section that created the new systempool, i.e. from the middle code block above you'd remove the resource systempool 'Microsoft.ContainerService/managedClusters/agentPools@2023-07-02-preview' section entirely. So you'd be left with just the final code block above as your final Bicep code. I imagine similar can be achieved with other IaC but I haven't tested it.

This method will retain a valid Bicep deployment matching the infrastructure, but will essentially orphan the original system node pool (agentpool) out of your code.

One thing to watch out for - I've used params in the examples to set lots of values, if this is how you do things, make sure you remove those unused ones from the files after making changes. And make sure you remove them from Deploy parameters files/variables you might have in a pipeline library etc.

Once this is done, you can trigger another deploy and essentially nothing should happen, because the pool already exists, it's just that it's now in the 'required' section of the template. You are then free to delete the original agentpool manually.

I was sceptical this would work, but it did work for 3 separate clusters so far. This method seems 'obvious' enough that I wonder whether if this is how Azure intended this to work, but didn't document it - not anywhere I have found anyway.

craigktreasure commented 9 months ago

@dogsbody-josh Yep. That's exactly the dance we're all doing at this point.

It would be awesome if we could change the aspects of the pool and then watch as AKS just rotates out the old machines in the pool with new ones, but that's not how it works today. I imagine it's a limitation of the scale sets. I'm sure they (AKS) could work around those limitations, but it would be more work.

kek-Sec commented 4 months ago

This issue has still not been fixed we are forced to do this pointless dance of placing agent pool profiles in and out of the main kubernetes resource.

joachimnielandt commented 3 months ago

I want to confirm I am also dancing pointlessly. However, the following approach has simplified the issue somewhat for me (abbreviated bicep). It allows me to redeploy when the cluster is already present, and only define the system pool properties once.

// enable this for a first run (upon creation of resources) - could be externalised to parameter
var firstRun = false

var aksSystemNodePoolName = 'npsystem'
var aksSystemNodePoolProperties = {
  ...
  vmSize: 'Standard_B2ps_v2'
  ...
  nodeTaints: [
    'CriticalAddonsOnly=true:NoSchedule'
  ]
}

resource aksSystemNodePool 'Microsoft.ContainerService/managedClusters/agentPools@2023-07-01' = if(!firstRun) {
  parent: aks
  name: aksSystemNodePoolName
  properties: aksSystemNodePoolProperties
}

resource aksUserNodePool 'Microsoft.ContainerService/managedClusters/agentPools@2023-07-01' = {
  parent: aks
  name: 'npuser'
  properties: {
    ...
    vmSize: 'Standard_E4as_v5'
    ...
  }
}

resource aks 'Microsoft.ContainerService/managedClusters@2024-01-01' = {
  name: aksName
  properties: {
    ...
    agentPoolProfiles: firstRun ? [ union( aksSystemNodePoolProperties, {name: aksSystemNodePoolName}) ] : []
    ...
  }
  ...
}