AKS NodeOSUpgrade Channel for Nodeimage

denniszielke commented 3 years ago

What happened: VMs have installed patches and now all need to be restarted. Problem is that I as a customer have to implement a reliable process that is properly restarting one vm after another.

What you expected to happen: Basically I want something like a nodepool restart as an api (similar to node image upgrade - just without the image upgrade). That is iterating all vms of a node pool - doing a vm restart, which is respecting PDBs, waits for vms to become available until it restarts the next vm. Maybe even surge support?

ghost commented 3 years ago

Hi denniszielke, AKS bot here :wave: Thank you for posting on the AKS Repo, I'll do my best to get a kind human from the AKS team to assist you.

I might be just a bot, but I'm told my suggestions are normally quite good, as such: 1) If this case is urgent, please open a Support Request so that our 24/7 support team may help you faster. 2) Please abide by the AKS repo Guidelines and Code of Conduct. 3) If you're having an issue, could it be described on the AKS Troubleshooting guides or AKS Diagnostics? 4) Make sure your subscribed to the AKS Release Notes to keep up to date with all that's new on AKS. 5) Make sure there isn't a duplicate of this issue already reported. If there is, feel free to close this one and '+1' the existing issue. 6) If you have a question, do take a look at our AKS FAQ. We place the most common ones there!

dennis-benzinger-hybris commented 3 years ago

And AFAIK there's no nice way to get the VMSSs. You have to look into the MC resource group and know the naming pattern.

ghost commented 3 years ago

Triage required from @Azure/aks-pm

ghost commented 3 years ago

Action required from @Azure/aks-pm

phealy commented 3 years ago

The general way I see this handled is with Kured, which can reboot the nodes when /var/run/reboot-required is set. Because it respects draining nodes and PodDisruptionBudget it should do so safely.

You can also find the nodepool name by looking in .spec.providerID on the node objects, like so:

$ kubectl get nodes -o json | jq -r '.items[].spec.providerID | . |= split("/")[10]' | sort -u

aks-nodepool1-42757005-vmss

Let me know if this doesn't answer your question!

ghost commented 3 years ago

Thanks for reaching out. I'm closing this issue as it was marked with "Answer Provided" and it hasn't had activity for 2 days.

denniszielke commented 3 years ago

Sorry this is not solved. Kured has tons of issues and is no longer officially recommended for AKS. @palma21 can you comment?

DavidDavids commented 2 years ago

I have a large customer looking for similar... I think the idea would best be phrased as a "Scheduled, Rolling Node Restarts in a NodePool". The customer is interested in giving the underlying VM's the chance to get any patches @ the OS level regularly. but the nature of their application is 24/7 365 days up time.

kaarthis commented 2 years ago

We are discussing various options on the table and should have an update in the next sprint.

rouke-broersma commented 2 years ago

The general way I see this handled is with Kured, which can reboot the nodes when /var/run/reboot-required is set. Because it respects draining nodes and PodDisruptionBudget it should do so safely.

You can also find the nodepool name by looking in .spec.providerID on the node objects, like so:
$ kubectl get nodes -o json | jq -r '.items[].spec.providerID | . |= split("/")[10]' | sort -u

aks-nodepool1-42757005-vmss
Let me know if this doesn't answer your question!

Kured does not work with autoscaling in AKS, it will result in a reboot loop because the new node that is spun up to replace the drained node will not have the patches, which Kured will try to patch on the next scheduled run, which will create a new unpacthed node and so on and so on.

kaarthis commented 2 years ago

This is actually a feature in consideration as another minor Auto upgrade channel / Nodepool property that can help with getting security OS patches at a regular cadence and which should be able to work with the other major K8s upgrades (Stable, patch) etc.

srmars commented 2 years ago

Let's us know if you have any update on this feature request.

kaarthis commented 2 years ago

Soliciting feedback here ... We are thinking of a Security channel (similar to Auto upgrade channels that can work parallely with the Auto upgrade channels and has the following two options ---- None / Managed-Patching

What Managed-Patching means is

a) Within a maintenance schedule that the customer chooses - Use Planned Maintenance for your Azure Kubernetes Service (AKS) cluster (preview) - Azure Kubernetes Service | Microsoft Docs Note -Above doc will come with specific options for the security channel soon : Upgrade Maintenance Schedule · Issue #2983 · Azure/AKS (github.com)), so customer can choose non business hours to avoid or mitigate disruptions. b) All the security updates (OS security fixes or kernel updates) Eg: Automated OS patches Linux be applied and a managed reboot process will be executed (Only if necessary i.e if Kernel patching was done/ Reboot flag is on the node) – Of course there will be steps taken to ensure this process works with Auto Scaler within this maintenance schedule / period. The present implementation can work with Scaling up and scale down shall be deactivated during the security updates. c) New nodes coming up in this period should be fully patched.

None means ‘Disable all Security Updates’ which are on by default today in all Linux nodes (This is called Unattended Upgrades for reference).

Questions:

Which of these options works for your needs ?
What in your opinion should be the default option?
Would you need one more option for the security channel - i.e Enable just Automatic Security updates (Unattended Upgrades in Linux) without controlled Reboots or in other words let customers control reboots as necessary ? If so WHY?
We are thinking of the security channel at cluster level, Do you need it at Nodepool level ? If so WHY?
Any other options to manage security channel that you maybe thinking with rationale?

Do write back to us and it will help us immensely as we go through the implementation here.

kaarthis commented 2 years ago

@rouke-broersma @sri53 @DavidDavids @denniszielke @dennis-menge for viz on above.

kaarthis commented 2 years ago

Any feedback here ?

rouke-broersma commented 2 years ago

Sorry I forgot to reply. As long as this works correctly with autoscaling clusters this works for us.

kaarthis commented 2 years ago

Updated with more questions @rouke-broersma , @dennis-benzinger-hybris @denniszielke @DavidDavids @sri53 and others.

denniszielke commented 2 years ago

Yes, I like it too and it is more clear in terms of operational responsibility than what we have today.

rouke-broersma commented 2 years ago

Questions:

Which of these options works for your needs ? What in your opinion should be the default option?

The current default is a good balance between security and intrusiveness as a default (node patches but no reboot). I don't think no updates should be the default. I don't think unscheduled reboots should be the default.

Would you need one more option for the security channel - i.e Enable just Automatic Security updates (Unattended Upgrades in Linux) without controlled Reboots or in other words let customers control reboots as necessary ? If so WHY?

I would not use it but like I said for the previous question this is to me the best default so it should exist as an option.

We are thinking of the security channel at cluster level, Do you need it at Nodepool level ? If so WHY?

At the moment we do not need it but I can imagine we might have applications running that are not yet cloud native /highly available. These applications are likely to have a specific node pool due to probably needing a specific vm size that can fit this bigger monolithic application. If we can set the security channel at the node pool we could set this node pool to a maintenance schedule or to no-reboot while having a better patching strategy for the node pools that run cloud native highly available workloads.

Any other options to manage security channel that you maybe thinking with rationale?

Managed patching without a maintenance schedule ie on-demand patching? The current description makes it seem like managed patching would only be available together with a schedule and I don't think that's necessary. Many workloads could handle on demand patching fine. However we are eagerly awaiting GA of maintenance schedules so we can take advantage of these auto upgrade/security channels :)

kaarthis commented 2 years ago

Thanks for the detailed feedback @rouke-broersma - Good to know you may not always need a schedule here. If so the current cadence of nightly patching and on demand reboot (on detection) works ? (I.e without any schedules).. Good to know the nodepool level feedback and will see if there are concrete asks of customers here who want that and why?

rouke-broersma commented 2 years ago

Ondemand reboot (automatic with auto scaling disabled) would work for most but not all of our use cases. Some customers are more cloud native ready than others. We might decide to use schedules for all clusters just for the sake of keeping the clusters more the same but for a lot of customers we could definitely choose to forego a schedule.

That does bring me to a new question. What would happen to new nodes? Ideally they would be fully patched by the time they are ready to receive workloads.

kaarthis commented 2 years ago

Yes we want to ensure they are fully patched esp if opting in for the 'Managed-patching' experience.

akanso commented 2 years ago

another advantage I see to this feature apart from OS updates and Security patches, is to rejuvenate the AKS cluster by restarting all the nodes and consequently all the pods/containers. It would also help test the resiliency of the apps deployed in the cluster and see if they can survive the entire cluster nodes restarts without outages (e.g. without dropping any requests)

srmars commented 2 years ago

Hi @kaarthis Apologies for the late response.

Below is want I am exactly looking for which is mentioned by @denniszielke.

Basically I want something like a nodepool restart as an api (similar to node image upgrade - just without the image upgrade). That is iterating all vms of a node pool - doing a vm restart, which is respecting PDBs, waits for vms to become available until it restarts the next vm

Use case:- Updating the DNS IP in the AKS cluster VNet. Currently If we want to get the new DNS IP's to all the AKS nodes we need to do nodepool upgrade with image only. For this, I am looking for without image upgrade, want an api to restart the node to get the DNS IP to all the aks nodes.

mac2000 commented 2 years ago

Another good use case is current ongoing issue in Azure with DNS on Ubuntu 18.04

Azure customers running Canonical Ubuntu 18.04 experiencing DNS errors

> Starting at approximately 06:00 UTC on 30 Aug 2022, a number of customers running Ubuntu 18.04 (bionic) VMs who have Ubuntu Unattended-Upgrades enabled would receive systemd version 237-3ubuntu10.54. A bug in this version will lead to DNS resolution errors. Reports of this issue are confined to this single Ubuntu version. This bug and a potential fix have been highlighted on the Canonical / Ubuntu website, which we encourage impacted customers to read: https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1988119 A potential workaround customers can consider is to reboot impacted VM instances so that they receive a fresh DHCP lease and new DNS resolver(s). If you are running a VM with Ubuntu 18.04 image, and you are experiencing connectivity issues, you can evaluate the above mitigation options. Due to this issue, there is downstream impact to other Azure services and customers are also receiving communications directly via Azure Service Health. A large portion of impact has been to Azure Kubernetes Service (AKS) in multiple regions, and other Azure services reliant on AKS. For AKS resources, our engineering team deployed an auto-remediation for AKS clusters in all regions and noticed a substantial progress with impacted clusters recovering to a healthy state. However, it was determined that there were a subset of cases where the AKS nodes were not covered by the auto-remediation detection and not fixed as a result. Engineers have developed and are in the process of deploying a fix to these nodes that were not mitigated in the earlier iteration. We are making steady progress on this deployment and will continue to share an update. Also please note, the offending package Ubuntu updates have been removed until further investigation is completed. The next update will be provided in 3 hours, or as events warrant.

The interesting part there is:

A potential workaround customers can consider is to reboot impacted VM instances so that they receive a fresh DHCP lease and new DNS resolver(s)

But there is no way to reboot nodes in node pool 🤷‍♂️

There are some workarounds like this one: Restart a node in AKS aka:

az vmss restart -g my-group-name -n my-vm-name --instance-ids 2

But in my case it just running and doing nothing, so did not work out

Another possible workaround is:

turn off node autoscale and scale them manually x2
drain old nodes one by one waiting till everything migrated to new nodes
turn on node autoscale and wait till old nodes removed

srmars commented 2 years ago

If this feature available now it would have saved lot of time for the ongoing DNS issue :)

zioproto commented 2 years ago

This feature would have been useful to mitigate the DNS Ubuntu issue of August 30th 2022 that also affected AKS.

The first proposed workaround was "reboot your instances": https://bugs.launchpad.net/ubuntu/+source/systemd/+bug/1988119/

kaarthis commented 2 years ago

Hi All ,

@sri53 , @zioproto @mac2000

We are deciding a name for the newly proposed channel which will have 4 options none | unmanaged | security-patch | node-image

None - Customer chooses to do nothing. Opt out of all security updates. Unmanaged - This is the current default which denotes Unattended Upgrades nightly. Security-Patch: New and upcoming which means AKS will take the onus of rolling out the UU packages from canonical in customer's maintenance window of choosing. No need to maintain Kured as this will determine if a re-image / reboot is needed and perform that . Idea is to keep minimal disruption. Node-Image - This denotes the existing Auto upgrade Node image option which is getting ported over to this security channel.

Hope this clarifies.. Also for a suitable name for this channel we are thinking between NodeOSUpgradeChannel or NodeUpgradeChannel which one is preferred? Given we already have an AutoUpgrade channel (that will remain mainly for K8s upgrades)

rouke-broersma commented 2 years ago

@kaarthis what do we choose if we want both node image updates and validated security patches?

kaarthis commented 2 years ago

Why would you need that given security patching is on the same VHD and Node image is typically installing new VHD both are mutually exclusive in that sense. However you can choose say Node image / Security-Patch alongside K8S auto upgrade channel like Stable or Rapid which was previously not possible.

rouke-broersma commented 2 years ago

@kaarthis because security patches are probably available well before a new vhd is. Or am I misunderstanding?

dennis-menge commented 2 years ago

Hey @kaarthis, sorry for way too late reply, my notification settings were messed up.

I like the approach you are suggesting and it makes total sense, I just want to propose one addition.

Can we still expose some kind of Nodepool API to manually trigger a VMSS restart (which respects Surge Rate Settings) to be able to deal with special situations? Hereby, I am not only referring to the recent DNS problem, but also to the original problem.

A lot of customers will not be able to easily opt into the Security-Patch channel because their applications are not yet resilient enough to survive such cluster operations without downtime / interruption. Hence, they are forced to stay on the Unmanaged channel for the time being.

By introducing (or rather exposing) a public Rolling Restart API on the Node pool resources, we can at least give them the capability to be able to schedule a coordinated restart (i.e., to apply certain UU that require a restart).

kaarthis commented 2 years ago

Thank you @dennis-menge , speaking to some large customers from @denniszielke looks like there is going to be good traction and adoption on the newly proposed channel - Hence we are going ahead with the Security channel solution at this point and after this goes live if there is still some hesitancy and concerns happy to revisit the Restart API at VMSS level.

ondrejhlavacek commented 2 years ago

I can offer a different point of view, maybe a bit unrelated. We're running a SaaS platform on AKS. There's nothing like "outside business hours", our operating times are 24/7. It wasn't that difficult to make the app resilient over the years, migrating to Kubernetes didn't pose too many new challenges from this point of view. Maybe the opposite, it provided us with a predictable and testable infrastructure behaviour.

Having latest security updates is a must for us, we're handling customers data. If they are installed one instance at a time or all instances at the same time doesn't really matter to us. Autoscaling will spin up enough nodes to transfer the workload and then scale them down once the process is finished.

What we're dealing with is that the autoupgrade channel doesn't wait long enough for the drain to finish. Some of our workload cannot be interrupted (evicted) and has to finish on the node it's currently running. It can take hours for it to finish. Manually invoked node drains work like a charm. Details here - https://github.com/Azure/AKS/issues/3008#issue-1267623517.

kaarthis commented 1 year ago

Happy new year everyone!! We are coming up with a preview of the NodeOSUpgrade channel it has the following 4 options and will work in tandem with Auto upgrade channel existing today --

4 options on NodeOSUpgrade Channel are --- None, Unmanaged, SecurityPatch, NodeImage

None - No security patches at all , Unattended upgrades set to off. Unmanaged - present setting of 'Unattended upgrades' i.e nightly canonical security patches with no maintenance window on them. SecurityPatch - This allows AKS to roll out canonical security updates in a weekly cadence (default) or better in customer configured maintenance schedule. There wont be a need to maintain Kured as AKS will decide to reimage nodes as and when there is a 'Kernel reboot' needed for the patches within the maintenance window. NodeImage - This is providing a fresh weekly (default) node image (VHD) to the VM with all the up to date security patches or in a schedule & cadence of your choice if given. Preview is planned tentatively in Q1 CY23.

However the more pertinent question for Everyone here is on GA what should be an agreeable default between SecurityPatch / NodeImage options in the NodeOSUpgradeChannel - Please let us know what would work for you???

rouke-broersma commented 1 year ago

Shouldn't it just remain the same as it is now? @kaarthis which means Unmanaged. Otherwise this would be a breaking change.

And my question still remains open. What if we want SecurityPatch + NodeImage updates? Currently we use Unmanaged + NodeImage (using auto upgrade channel). This means we receive all security updates that do not require a node reboot automatically + a fresh image whenever it is available. If nodeimage upgrade is removed from the auto update channel, and we cannot select security patch + node image in the node os upgrade channel, we will be losing instead of gaining functionality.

Or will node image still be available in the auto upgrade channel?

kaarthis commented 1 year ago

Reality is,running unmanaged UU is the key thing we want to reduce as it has led to multiple Sev 1s due to canonical updates without a maintenance window or control on it. From that perspective - running weekly node image upgrade is good enough in security cover . If you wish to keep the UU on, you can select the nodeImage option in nodeOSUpgrade channel, and turn UU back on with a daemon set (not recommended but possible). AFAIK your question node image would be available in the nodeosupgradechannel alone going forward... Potentially it would mean a choice of security patch + Auto upgrade channel (K8s) or Nodeimage + AutoUpgrade channel (K8s).. Ofc there are options to run UU with a daemonset per above.

rouke-broersma commented 1 year ago

Reality is,running unmanaged UU is the key thing we want to reduce as it has led to multiple Sev 1s due to canonical updates without a maintenance window or control on it. From that perspective - running weekly node image upgrade is good enough in security cover . If you wish to keep the UU on, you can select the nodeImage option in nodeOSUpgrade channel, and turn UU back on with a daemon set (not recommended but possible). AFAIK your question node image would be available in the nodeosupgradechannel alone going forward... Potentially it would mean a choice of security patch + Auto upgrade channel (K8s) or Nodeimage + AutoUpgrade channel (K8s).. Ofc there are options to run UU with a daemonset per above.

Can you guarantee the node image updates will actually be weekly going forward? In our experience they are sometimes weeks late, which means we would miss critical security patches for weeks unless we use UU or managed patches.

We are fine with managed security patches in-image + node image updates when they are available. Would this combo not be possible as an option? We like that our nodes are regularly refreshed instead of only patched in-image.

zioproto commented 1 year ago

@kaarthis I am testing --node-os-upgrade-channel NodeImage for a customer.

my simple repro environment is the following:

az group create --name testcluster --location eastus
az aks create \
 --location eastus \
 --name testcluster \
 --enable-addons monitoring \
 -g testcluster \
 --network-plugin azure  \
 --kubernetes-version 1.25.4  \
 --node-vm-size Standard_DC4s_v2 \
 --node-count 2 \
 --auto-upgrade-channel rapid \
 --node-os-upgrade-channel  NodeImage

The rapid channel upgraded my cluster at 1.25.5 but I had no more quota for Standard_DC4s_v2, so AKS failed to create a surge node, and the node upgrade to 1.25.5 failed.

I can see it clearly in the ActivityLogs:

az monitor activity-log list --namespace Microsoft.ContainerService  --offset 48h -g testcluster -o json | jq 'map(select(.status.value == "Failed")) | .[] | {eventTimestamp, operationName, status, properties}'

Where I get:

{
  "eventTimestamp": "2023-03-07T00:14:12+00:00",
  "operationName": {
    "localizedValue": "Create or Update Agent Pool",
    "value": "Microsoft.ContainerService/managedClusters/agentpools/write"
  },
  "status": {
    "localizedValue": "Failed",
    "value": "Failed"
  },
  "properties": {
    "message": "Upgrade Failed, error: {\n  \"code\": \"ReconcileVMSSAgentPoolFailed\",\n  \"message\": \"Code=\\\"OperationNotAllowed\\\" Message=\\\"Operation could not be completed as it results in exceeding approved standardDCSv2Family Cores quota. Additional details - Deployment Model: Resource Manager, Location: eastus, Current Limit: 16, Current Usage: 16, Additional Required: 4, (Minimum) New Limit Required: 20. Submit a request for Quota increase at https://aka.ms/ProdportalCRP/#blade/Microsoft_Azure_Capacity/UsageAndQuota.ReactView/Parameters/[REDACTED] by specifying parameters listed in the ‘Details’ section for deployment to succeed. Please read more about quota limits at https://docs.microsoft.com/en-us/azure/azure-supportability/per-vm-quota-requests\\\"\"\n }"
  }
}

I have now increased the quota.

I did not specify any maintenance window.

Question 1 Will the node-os-upgrade-channel attempt the upgrade again ? What is the retry mechanism logic ?

Question 2 My understanding is that --node-os-upgrade-channel NodeImage will upgrade the image also within the same AKS version and not only when the control plane upgrade a patch version from 1.25.4 to 1.25.5. I mean there will be multiple image versions for the AKS version 1.25.5 for example. But how to track this ?

I am checking the node image used by the VMSS in this way:

az vmss show --name aks-nodepool1-30501256-vmss -g MC_TESTCLUSTER_TESTCLUSTER_EASTUS -o json |jq .virtualMachineProfile.storageProfile.image
Reference.id

And I get an output like:

"/subscriptions/109a5e88-712a-48ae-9078-9ca8b3c81345/resourceGroups/AKS-Ubuntu/providers/Microsoft.Compute/galleries/AKSUbuntu/images/2204gen2containerd/versions/2023.02.15"

The name if this image does not have any reference to the AKS version it belongs to. In there a better way to track these image versions to understand what changed ?

Thanks

rouke-broersma commented 1 year ago

@zioproto the node image name does not include the AKS version because a node image version can support multiple AKS versions. They're not 1-1 related.

ghost commented 1 year ago

Thanks for reaching out. I'm closing this issue as it was marked with "Answer Provided" and it hasn't had activity for 2 days.

zioproto commented 1 year ago

@kaarthis I am afraid the msftbot closed the issue by mistake. Should this be open until the feature goes GA ?

kaarthis commented 1 year ago

Opening this till ga

kaarthis commented 1 year ago

NodeImage went GA. We will work on 'SecurityPatch' separately and track it as a separate issue.

Azure / AKS

AKS NodeOSUpgrade Channel for Nodeimage #2181