Open preetisingh110 opened 3 years ago
Hi preetisingh110, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.
I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!
Triage required from @Azure/aks-pm
@preetisingh110, if you could pass your arm template to us, that would be great. Thanks!
@qpetraroia not the original poster but I believe I have hit the same issue as described.
Here is my bicep file that reproduces the same error.
param adminUsername string = 'joeblogs'
resource aks 'Microsoft.ContainerService/managedClusters@2021-08-01' = {
name: 'ring0'
location: 'uksouth'
sku: {
name: 'Basic'
tier: 'Free'
}
identity: {
type: 'SystemAssigned'
}
properties: {
dnsPrefix: 'jdaksring0'
enableRBAC: true
kubernetesVersion: '1.21.2'
nodeResourceGroup: 'aks-platform-ring0-nodepool'
windowsProfile: {
adminUsername: adminUsername
adminPassword: 'superSecret123!'
}
}
resource linuxpool 'agentPools' = {
name: 'linux'
properties: {
count: 3
mode: 'System'
osDiskType: 'Ephemeral'
osType: 'Linux'
type: 'VirtualMachineScaleSets'
vmSize: 'standard_d2s_v3'
}
}
resource winpool 'agentPools' = {
name: 'win'
properties: {
count: 3
mode: 'System'
osDiskType: 'Ephemeral'
osType: 'Windows'
type: 'VirtualMachineScaleSets'
vmSize: 'standard_d2s_v3'
}
}
}
Error:
{
"code": "DeploymentFailed",
"message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.",
"details": [
{
"code": "InvalidParameter",
"message": "Required parameter agentPoolProfiles is missing (null)."
}
]
}
Having the same error here.
I'm not sure whether it does succeed after an initial deployment with an inline agent pool, but when rebuilding from scratch this does fail.
In my version, I have the agent pool resources not-nested as opposed to https://github.com/Azure/AKS/issues/2495#issuecomment-959181070
@justindavies @yizhang4321 could you have a look at this?
We're hitting this now as well when deploying new clusters. Deploying on top of existing clusters is fine. We moved our pool resources out of agentPoolProfiles
after initial deployment so that we could update the pools since there are many changes you can't make to existing pools.
Same error as others.
{
"code": "DeploymentFailed",
"message": "At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/DeployOperations for usage details.",
"details": [
{
"code": "InvalidParameter",
"message": "Required parameter agentPoolProfiles is missing (null)."
}
]
}
The AKS dev team is looking into this and shall have an update provided by next week here.
@craigktreasure Could you elaborate on your failure scenario a little bit? I prefer you could provide us with what steps you have executed and what kind of parameters for each step. Without this detailed information, we can't conclude whether the failure was caused by an AKS issue or a test scenario problem.
@tonychen15 I'm not sure what you're after specifically. I think between the OP and https://github.com/Azure/AKS/issues/2495#issuecomment-959181070 it's mostly there. It might help to clarify how I ended up in this situation. I can only speak for myself, but I wouldn't be surprised if my experience is similar to others.
Note: The examples below have been stripped down to make the point and are not expected to work in practice.
You start by creating your cluster with something simple (not complete):
resource cluster_resource 'Microsoft.ContainerService/managedClusters@2021-09-01' = {
name: clusterName
location: regionLocation
identity: {
type: 'SystemAssigned'
}
properties: {
kubernetesVersion: clusterKubernetesVersion
dnsPrefix: '${clusterName}-dns'
agentPoolProfiles: [
{
name: 'agentpool'
vmSize: 'Standard_DS2_v2'
osDiskSizeGB: 128
osDiskType: 'Managed'
type: 'VirtualMachineScaleSets'
enableAutoScaling: true
mode: 'System'
osType: 'Linux'
}
]
servicePrincipalProfile: {
clientId: 'msi'
}
}
}
Then, you decide you want to change your vmSize
or some other aspect of the pool and redeploy, which generally results in failure indicating that you can't change various aspects of an agent pool after creating it. Lesson learned.
So, you then think, ok, i'll simply add another agent pool and migrate things over to the new one, right? Wrong. You're then greeted with:
A new agent pool was introduced. Adding agent pools to an existing cluster is not allowed through managed cluster operations.
For agent pool specific change, please use per agent pool operations: https://aka.ms/agent-pool-rest-api.
Maybe I missed something, but I didn't find the help url (https://aka.ms/agent-pool-rest-api) specified to be all that helpful for my situation.
After some documentation surfing, you learn the magic of the Microsoft.ContainerService/managedClusters/agentPools
resource type. This makes you start to feel like the agentPoolProfiles
property is pretty much useless since you can't modify or add to the silly thing once you've deployed.
So, I then defined new pools using the Microsoft.ContainerService/managedClusters/agentPools
resource type and ended up with something that looks like this:
resource cluster_resource 'Microsoft.ContainerService/managedClusters@2021-09-01' = {
name: clusterName
location: regionLocation
identity: {
type: 'SystemAssigned'
}
properties: {
kubernetesVersion: clusterKubernetesVersion
// agentPoolProfiles: Agent pools are defined below using managedClusters/agentPools.
servicePrincipalProfile: {
clientId: 'msi'
}
}
}
resource systempool1_pool 'Microsoft.ContainerService/managedClusters/agentPools@2021-09-01' = {
name: 'systempool1'
parent: cluster_resource
properties: {
count: 2
enableAutoScaling: true
kubeletDiskType: 'OS'
mode: 'System'
nodeTaints: [
// Prevent application pods from running on this pool.
// https://docs.microsoft.com/azure/aks/use-system-pools#system-and-user-node-pools
'CriticalAddonsOnly=true:NoSchedule'
]
osDiskSizeGB: 128
osDiskType: 'Managed'
osType: 'Linux'
type: 'VirtualMachineScaleSets'
vmSize: 'Standard_D4s_v3'
}
}
resource general1_pool 'Microsoft.ContainerService/managedClusters/agentPools@2021-09-01' = {
name: 'general1'
parent: cluster_resource
properties: {
count: 1
enableAutoScaling: true
kubeletDiskType: 'OS'
mode: 'User'
osDiskSizeGB: 128
osDiskType: 'Managed'
osType: 'Linux'
type: 'VirtualMachineScaleSets'
vmSize: 'Standard_D4s_v3'
}
}
This can be deployed again and again to a cluster that already exists. I then manually cordoned, drained, and deleted the old agent pool.
However, when you go to deploy a new cluster with something like I ended up with at the end, you end up with the errors reported above. Suddenly the agentPoolProfiles
property is required again expecting a pool to be defined. It leaves us quite confused as to how to use these two methods of defining agent pools.
@craftyhouse I appreciate you providing such detailed information on how this issue happened.
So sorry to hear that Agent Pool Rest API document didn't help you in your situation. I believe the following how-to guide may help you more in a similar situation. https://docs.microsoft.com/en-us/azure/aks/use-multiple-node-pools.
Back to the issue itself. It seems that the failure scenario happened in the create phase but not the update scenario. The second key point is that two agent pool resource definitions were provided but no agentPoolProfiles field was provided. If my understanding is correct, then I guess the wordings "agentPoolProfiles is not a mandatory parameter" is causing some misunderstandings for customers.
If the customers don't provide any parameters for agentPoolProfiles, then AKS will automatically allocate a default system pool and agentPoolProfiles for the customer. In that case, that sentence is correct. However, once the customers provide some agentPool resource definitions but without agentPoolProfiles structure, AKS code will double-check agentPoolProfiles's existence based on the fact that agentPoolProfiles is a wrap structure for agentPool resources. As my first thought, there is a gap between Bicep/ARM template and AKS on whether agentPoolProfiles is needed in the latter case. But we will first discuss this with AKS product team, and then comes out with a solution to overcome the confusion.
@tonychen15 Just want to add that I ran into this problem exactly the same way as @craigktreasure described.
I think it's hard to get the design of ARM/Bicep templates right with regards to AKS clusters when you need to support both fresh and update deployments. And this is mainly caused by
agentPoolProfiles
property (and instead need separate child resources and manual/imperative draining/removal), andIn an ideal ARM/Bicep IaC world, the agentPoolProfiles
property would represent exactly the desired state configuration. I.e., when a new agent pool is listed, it should be created. When an existing agent pool is no longer listed, it should be removed (and its nodes cordoned and drained; if not possible, the deployment should fail). If you could make this work, most things would become very intuitive.
In the current situation, one option might be to provide better documentation on how AKS via ARM/Bicep is meant to be done. One thing that works for me is too define only the system node pool in the agentPoolProfiles
property and every other node pool as child resources. This approach still requires manual removal of replaced pools and it doesn't support replacing the system node pool, but at least it supports new and updating deployments at the same time and I haven't found a better one yet.
@kaarthis or @tonychen15, any update on this?
Managed to get this working. You only need the system pool in the agentPoolProfiles. The other node pools can be added as part of "Microsoft.ContainerService/managedClusters/agentPools" as per fschmied
@kaarthis or @tonychen15, still looking for an update on this?
@craigktreasure @kaarthis will start the effort to update the Microsoft Doc to make it clear to customers. In a very simple create cluster scenario, i.e. customers only need one system pool, then they do not need to mention any agentpool profile, AKS will create a default agentpool profile for customers. However, in other cases when customers try to create a couple of node pools at the same time, then an agentpool profile is needed to indicate some common properties for these agent pools. Without it, AKS will respond back an alert.
The way that node pools are handled today is far from optimal. Ideally I would expect the agentPoolProfiles
to be a required attribute. In addition, it should also allow for the capability to create new node pools directly by updating this property.
Our use case is that we want to declaratively manage agentPoolProfiles
as described in https://github.com/pulumi/pulumi-azure-native/issues/579 without having to create our own logic on top to cater for the fact that we may need to create the node pools as separate REST resources.
We are also facing a similar issue.
You cannot rely on agentPoolProfiles property to model agent pools declaratively because ARM doesn't allow you to change this property in many ways. For example, you cannot add/delete node pools, cannot upgrade OS versions, and cannot even change maxSurge property!
Message: Updating property MaxSurge of a virtual-machine-scale-set agent pool is not allowed through the managed cluster API. Use the agent pool API (https://aka.ms/agent-pool-rest-api) to update property in agent pool agentpool2
It looks like using "Microsoft.ContainerService/managedClusters/agentPools" resource to model agent pools separately may allow us to agent node pools declaratively to some level, but I am worried that this approach is "hacky" and may cause some inconsistency from AKS's control plane.
@tonche @kaarthis @craigktreasure Can you check whether using "Microsoft.ContainerService/managedClusters/agentPools" to model agent pools directly is a safe operation and can be used as an interim solution?
I would be fine with using Microsoft.ContainerService/managedClusters/agentPools but then AKS should allow you to create a managedCluster
with an empty agentPoolProfiles
property so you can manage all node pools homogenously. The issue is that the current implementation is inconsistent.
The issue we are getting today is:
Diagnostics:
pulumi:pulumi:Stack (Platform-clusters-prd):
error: update failed
azure-native:containerservice/v20220901:ManagedCluster (ns-azweu-dev-c.akscluster-r.akscluster):
error: Code="BadRequest" Message="A new agent pool was introduced. Adding agent pools to an existing cluster is not allowed through managed cluster operations. For agent pool specific change, please use per agent pool operations: [https://aka.ms/agent-pool-rest-api"](https://aka.ms/agent-pool-rest-api%22) Target="agentPoolProfiles"
We just ran into this as well and I agree with the "workaround" by @fschmied .
The documentation from the Bicep examples is not clear and I think the labeling is incorrect to be labeled as "agentpool".
Consider the documentation here, https://learn.microsoft.com/en-us/azure/aks/use-multiple-node-pools as there is a clear distinction between "system node" and "user node" pools. IMHO, the "agentpoolprofile" should not be required and relabeled to "systempoolprofile", and if not provided then managedClusters
resource should deploy a minimum vmset necessary for those dependencies to run. Then any subsequent "user node pool" is to be provided by the agentpool
resource.
If that is the original intent, then the label should be renamed in the Arm/Bicep template and documentation be updated. We discovered this right before deploying this into a customers environment. We went to update the servers right before deploying to production and was met with the "cannot change this property" error.
Still no update on this? Running into this again on another customer of mine.
Checking in on this @kaarthis @yizhang4321
I'm experiencing the same issue, but from the Azure Portal.
We also ran into this issue having adopted a customers existing Bicep code and their AgentPoolProfiles
setup. We needed to expand their infrastructure and replace the existing system pool. We needed to do this because we also needed to change the vmSize used in the pools and that's not possible to alter for an existing node pool.
I haven't seen exact instructions in this thread to address this need, nor in the documentation. I believe this is a continuation of the method @craigktreasure outlined above.
The process of system pool replacement and maintaining a valid Bicep setup (and I assume other IaC) isn't actually too tricky, it's just a 2 step process using the methods described above, with a small change.
Example - start with one node pool of mode 'system', perhaps like this:
resource k8sCluster 'Microsoft.ContainerService/managedClusters@2021-03-01' = {
name: name
location: location
tags:{
displayName: 'AKS Cluster'
}
identity:{
type: 'SystemAssigned'
}
properties: {
kubernetesVersion: k8sVersion
enableRBAC: true
nodeResourceGroup: '${resourceGroup().name}-k8s-nrg'
dnsPrefix: dnsPrefix
agentPoolProfiles: [
{
name: 'agentpool'
mode: 'System'
osType: 'Linux'
type: 'VirtualMachineScaleSets'
count: k8sNodeCount
vmSize: k8sClusterSize
osDiskSizeGB: k8sOsDiskSizeGB
orchestratorVersion: k8sVersion
enableAutoScaling: agentpoolAutoScaling
minCount: AutoScalingMin
maxCount: AutoScalingMax
}
]
}
}
To replace this pool, grow the cluster with another system mode node pool, using the agentPools
resource mentioned by others and documented here.
Here's an example file showing the two node pools:
resource k8sCluster 'Microsoft.ContainerService/managedClusters@2021-03-01' = {
name: name
location: location
tags:{
displayName: 'AKS Cluster'
}
identity:{
type: 'SystemAssigned'
}
properties: {
kubernetesVersion: k8sVersion
enableRBAC: true
nodeResourceGroup: '${resourceGroup().name}-k8s-nrg'
dnsPrefix: dnsPrefix
agentPoolProfiles: [
{
name: 'agentpool'
mode: 'System'
osType: 'Linux'
type: 'VirtualMachineScaleSets'
count: k8sNodeCount
vmSize: k8sClusterSize
osDiskSizeGB: k8sOsDiskSizeGB
orchestratorVersion: k8sVersion
enableAutoScaling: agentpoolAutoScaling
minCount: AutoScalingMin
maxCount: AutoScalingMax
}
]
}
}
resource systempool 'Microsoft.ContainerService/managedClusters/agentPools@2023-07-02-preview' = {
name: 'systempool'
parent: k8sCluster
properties: {
mode: 'System'
osType: 'Linux'
type: 'VirtualMachineScaleSets'
count: systemNodeCount
vmSize: systemVmSize
osDiskType: 'Ephemeral'
osDiskSizeGB: systemOsDiskSizeGB
orchestratorVersion: k8sVersion
enableAutoScaling: systempoolAutoScaling
minCount: systempoolAutoScalingMin
maxCount: systempoolAutoScalingMax
}
}
You're free to change all aspects of this second system node pool. Change the vmSize or whatever. It's a new pool so can be setup as such however you like.
Once that's been deployed to your infrastructure and created, refactor the Bicep/IaC file itself to place the new 'systempool' into the original AgentPoolProfiles
section. Like so:
resource k8sCluster 'Microsoft.ContainerService/managedClusters@2021-03-01' = {
name: name
location: location
tags:{
displayName: 'AKS Cluster'
}
identity:{
type: 'SystemAssigned'
}
properties: {
kubernetesVersion: k8sVersion
enableRBAC: true
nodeResourceGroup: '${resourceGroup().name}-k8s-nrg'
dnsPrefix: dnsPrefix
agentPoolProfiles: [
{
name: 'systempool'
mode: 'System'
osType: 'Linux'
type: 'VirtualMachineScaleSets'
count: systemNodeCount
vmSize: systemVmSize
osDiskType: 'Ephemeral'
osDiskSizeGB: systemOsDiskSizeGB
orchestratorVersion: k8sVersion
enableAutoScaling: systempoolAutoScaling
minCount: systempoolAutoScalingMin
maxCount: systempoolAutoScalingMax
}
]
}
}
Hopefully that example is clear enough, you just replace the original values of the original 'agentpool' within the agentPoolProfiles
section of the managedClusters
resource with the new values of the newly created 'systempool'.
You then remove the entire section that created the new systempool, i.e. from the middle code block above you'd remove the resource systempool 'Microsoft.ContainerService/managedClusters/agentPools@2023-07-02-preview'
section entirely. So you'd be left with just the final code block above as your final Bicep code. I imagine similar can be achieved with other IaC but I haven't tested it.
This method will retain a valid Bicep deployment matching the infrastructure, but will essentially orphan the original system node pool (agentpool) out of your code.
One thing to watch out for - I've used params in the examples to set lots of values, if this is how you do things, make sure you remove those unused ones from the files after making changes. And make sure you remove them from Deploy parameters files/variables you might have in a pipeline library etc.
Once this is done, you can trigger another deploy and essentially nothing should happen, because the pool already exists, it's just that it's now in the 'required' section of the template. You are then free to delete the original agentpool manually.
I was sceptical this would work, but it did work for 3 separate clusters so far. This method seems 'obvious' enough that I wonder whether if this is how Azure intended this to work, but didn't document it - not anywhere I have found anyway.
@dogsbody-josh Yep. That's exactly the dance we're all doing at this point.
It would be awesome if we could change the aspects of the pool and then watch as AKS just rotates out the old machines in the pool with new ones, but that's not how it works today. I imagine it's a limitation of the scale sets. I'm sure they (AKS) could work around those limitations, but it would be more work.
This issue has still not been fixed we are forced to do this pointless dance of placing agent pool profiles in and out of the main kubernetes resource.
I want to confirm I am also dancing pointlessly. However, the following approach has simplified the issue somewhat for me (abbreviated bicep). It allows me to redeploy when the cluster is already present, and only define the system pool properties once.
// enable this for a first run (upon creation of resources) - could be externalised to parameter
var firstRun = false
var aksSystemNodePoolName = 'npsystem'
var aksSystemNodePoolProperties = {
...
vmSize: 'Standard_B2ps_v2'
...
nodeTaints: [
'CriticalAddonsOnly=true:NoSchedule'
]
}
resource aksSystemNodePool 'Microsoft.ContainerService/managedClusters/agentPools@2023-07-01' = if(!firstRun) {
parent: aks
name: aksSystemNodePoolName
properties: aksSystemNodePoolProperties
}
resource aksUserNodePool 'Microsoft.ContainerService/managedClusters/agentPools@2023-07-01' = {
parent: aks
name: 'npuser'
properties: {
...
vmSize: 'Standard_E4as_v5'
...
}
}
resource aks 'Microsoft.ContainerService/managedClusters@2024-01-01' = {
name: aksName
properties: {
...
agentPoolProfiles: firstRun ? [ union( aksSystemNodePoolProperties, {name: aksSystemNodePoolName}) ] : []
...
}
...
}
What happened: Tried creating AKS via arm template and as per doc agentPoolProfiles is not a mandatory parameter, Hence i didn't pass agentPoolProfiles . Rather i was creating agentPools via resources array section of arm template.But my deployment failed saying missing parameter : agentPoolProfiles .
What you expected to happen: I expected the deployment should proceed.
How to reproduce it (as minimally and precisely as possible): Let me know if you want my arm template.