Azure / AKS

Azure Kubernetes Service
https://azure.github.io/AKS/
1.94k stars 301 forks source link

[Feature] Azure VM scheduled event handling #3719

Open dtotopus opened 1 year ago

dtotopus commented 1 year ago

Azure VM Instabilities and Scheduled Events Impacting Service Stability

Issue Description

We are experiencing instabilities with AKS because of Azure virtual machines (VMs), specifically related to scheduled events causing VM freezes. These instabilities are impacting our service stability and operational efficiency as we get some of our nodes frozen for several seconds few times a week.

Source of Information

We have referred to the official Azure documentation on scheduled events[https://learn.microsoft.com/en-us/azure/virtual-machines/linux/scheduled-events], available at Azure VM Scheduled Events, for a better understanding of the issue.

Expected Behavior

Scheduled events are intended to provide advance notice of underlying maintenance tasks or events impacting Azure VMs. These events should be handled seamlessly by the Azure infrastructure, ensuring minimal or no service disruption for the customers. We expect the VMs to be able to handle scheduled events without freezing or negatively impacting service availability.

Current Behavior

Unfortunately, our experience with AKS with scheduled events has been far from satisfactory. The following issues have been observed:

VM Freezes

When a scheduled event is triggered events like Azure VMs freeze may cause partial or complete service outage. As by default such events are not handled by AKS - it will detect frozen nodes but will not evict pods out of it. As a result some of the traffic will be served to the frozen endpoints. This can be addressed by tracking vm events using Azure api but this doesn't make it any better as freezes are usually pretty short 6-30 seconds - meaning that any operation like draining the node will take even more time and cause more chaos. It would also affect workloads that can not be stopped immediatelly so we can't really drain the node in advance in all cases. Basically this maintenance "feature" renders regular instances as unstable as spot ones.

Inadequate Notification Time

The minimal notification period provided ( 15 minutes ) prior to the scheduled event is insufficient to take necessary preventive measures. This short notice period severely limits ability to prepare and mitigate any potential impact on our services as we have some services that do certain tasks which need more than 30 minutes of stable uptime. In contrary AWS sends an email days/weeks before maintenance event ( which are very rare as well ) and in some cases allows you to change the event date.

Lack of Flexible Control

We are unable to control or influence the timing of scheduled events. As a result, these events may occur during peak business hours or critical operational periods, exacerbating the impact on our services. It is possible to develop a scheduler that would automatically handle failover in case of such events but such approach feels really weird as in case of AKS - we would be writing a node scheduler on top of Kubernetes which is already an orchestrator. And as before - this wouldn't solve stability requirement for long running jobs or databases.

Impact

The current state of Azure VM freezes and AKS inabillity to track scheduled events affects our service stability, availability, and customer satisfaction. The ongoing VM freezes during scheduled events, along with the short notification time, pose challenges and hinder our operational efficiency. It is essential for us to have a reliable and predictable VM environment to ensure the continuous delivery of our services.

Expected Solution

Improved Handling of Scheduled Events

We do expect either AKS to be able to handle scheduled freezes on it's own by rendering nodes visibly unavailable and temporarely stop sending traffic to them. Or to develop a mechanism that ensures Azure VMs can handle scheduled events without freezing or causing service interruptions. It is crucial that these events are processed seamlessly in the background, preventing any negative impact on our services.

Extended Notification Time

We request an increase in the notification time provided for scheduled events. A minimum of few day prior notification would significantly help us in preparing for any upcoming maintenance tasks or events and reducing service disruptions.

Additional Information

Providing the log of VM freezes events we've experienced in a week

2023-06-05T22:06:51+03:00   {"@timestamp":"2023-06-05T19:06:51.819Z","log.level":"info","message":"Received new VM event","Event":{"Description":"Host server is undergoing maintenance.","DurationInSeconds":30,"EventId":"1D78D2B0-B806-4123-9B2A-58556820F059","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["aks-spot-22319102-vmss_1"]},"log":{"logger":"app"}}
2023-06-05T19:30:43+03:00   {"@timestamp":"2023-06-05T16:30:43.620Z","log.level":"debug","message":"Heartbeat report","Event":{"DocumentIncarnation":2,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":30,"EventId":"B2BC520E-BDA2-44A0-BF75-0C320524BB47","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["aks-testspot-38041100-vmss_25"]}]},"log":{"logger":"app"}}
2023-06-05T19:26:59+03:00   {"@timestamp":"2023-06-05T16:26:59.748Z","log.level":"info","message":"Received new VM event","Event":{"Description":"Host server is undergoing maintenance.","DurationInSeconds":30,"EventId":"B2BC520E-BDA2-44A0-BF75-0C320524BB47","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["aks-testspot-38041100-vmss_25"]},"log":{"logger":"app"}}
2023-06-02T22:18:26+03:00   {"@timestamp":"2023-06-02T19:18:26.205Z","log.level":"debug","message":"Heartbeat report","Event":{"DocumentIncarnation":32,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":6,"EventId":"3474C1DD-0917-49F4-85F7-84B9E366719B","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["regular-node-30975042-vmss_6"]}]},"log":{"logger":"app"}}
2023-06-02T22:18:25+03:00   {"@timestamp":"2023-06-02T19:18:25.048Z","log.level":"debug","message":"Heartbeat report","Event":{"DocumentIncarnation":32,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":6,"EventId":"3474C1DD-0917-49F4-85F7-84B9E366719B","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["regular-node-30975042-vmss_6"]}]},"log":{"logger":"app"}}
2023-06-02T22:08:21+03:00   {"@timestamp":"2023-06-02T19:08:21.062Z","log.level":"debug","message":"Heartbeat report","Event":{"DocumentIncarnation":32,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":6,"EventId":"3474C1DD-0917-49F4-85F7-84B9E366719B","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["regular-node-30975042-vmss_6"]}]},"log":{"logger":"app"}}
2023-06-02T22:08:19+03:00   {"@timestamp":"2023-06-02T19:08:19.886Z","log.level":"debug","message":"Heartbeat report","Event":{"DocumentIncarnation":32,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":6,"EventId":"3474C1DD-0917-49F4-85F7-84B9E366719B","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["regular-node-30975042-vmss_6"]}]},"log":{"logger":"app"}}
2023-06-02T22:07:53+03:00   {"@timestamp":"2023-06-02T19:07:53.654Z","log.level":"info","message":"Received new VM event","Event":{"Description":"Host server is undergoing maintenance.","DurationInSeconds":6,"EventId":"3474C1DD-0917-49F4-85F7-84B9E366719B","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["regular-node-30975042-vmss_6"]},"log":{"logger":"app"}}
2023-06-02T22:07:38+03:00   {"@timestamp":"2023-06-02T19:07:38.716Z","log.level":"info","message":"Received new VM event","Event":{"Description":"Host server is undergoing maintenance.","DurationInSeconds":6,"EventId":"3474C1DD-0917-49F4-85F7-84B9E366719B","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["regular-node-30975042-vmss_6"]},"log":{"logger":"app"}}
2023-06-02T16:55:40+03:00   {"@timestamp":"2023-06-02T13:55:40.194Z","log.level":"debug","message":"Heartbeat report","Event":{"DocumentIncarnation":29,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":6,"EventId":"737A957B-E803-4FFC-99AA-E2D3FC53EBC2","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["regular-node-30975042-vmss_1"]}]},"log":{"logger":"app"}}
2023-06-02T16:55:40+03:00   {"@timestamp":"2023-06-02T13:55:40.015Z","log.level":"debug","message":"Heartbeat report","Event":{"DocumentIncarnation":29,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":6,"EventId":"737A957B-E803-4FFC-99AA-E2D3FC53EBC2","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["regular-node-30975042-vmss_1"]}]},"log":{"logger":"app"}}
2023-06-02T16:48:41+03:00   {"@timestamp":"2023-06-02T13:48:41.493Z","log.level":"info","message":"Received new VM event","Event":{"Description":"Host server is undergoing maintenance.","DurationInSeconds":6,"EventId":"737A957B-E803-4FFC-99AA-E2D3FC53EBC2","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["regular-node-30975042-vmss_1"]},"log":{"logger":"app"}}
2023-06-02T16:48:20+03:00   {"@timestamp":"2023-06-02T13:48:20.239Z","log.level":"info","message":"Received new VM event","Event":{"Description":"Host server is undergoing maintenance.","DurationInSeconds":6,"EventId":"737A957B-E803-4FFC-99AA-E2D3FC53EBC2","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["regular-node-30975042-vmss_1"]},"log":{"logger":"app"}}
2023-06-02T12:58:42+03:00   {"@timestamp":"2023-06-02T09:58:42.366Z","log.level":"debug","message":"Heartbeat report","Event":{"DocumentIncarnation":16,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":30,"EventId":"465D3B0F-D7F2-4239-AC11-1B9800E73DBC","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["spot-node-34525998-vmss_6"]}]},"log":{"logger":"app"}}
2023-06-02T12:54:40+03:00   {"@timestamp":"2023-06-02T09:54:40.359Z","log.level":"debug","message":"Heartbeat report","Event":{"DocumentIncarnation":16,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":30,"EventId":"465D3B0F-D7F2-4239-AC11-1B9800E73DBC","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["spot-node-34525998-vmss_6"]}]},"log":{"logger":"app"}}
2023-06-02T12:50:38+03:00   {"@timestamp":"2023-06-02T09:50:38.339Z","log.level":"debug","message":"Heartbeat report","Event":{"DocumentIncarnation":16,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":30,"EventId":"465D3B0F-D7F2-4239-AC11-1B9800E73DBC","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["spot-node-34525998-vmss_6"]}]},"log":{"logger":"app"}}
2023-06-02T12:46:36+03:00   {"@timestamp":"2023-06-02T09:46:36.251Z","log.level":"debug","message":"Heartbeat report","Event":{"DocumentIncarnation":16,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":30,"EventId":"465D3B0F-D7F2-4239-AC11-1B9800E73DBC","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["spot-node-34525998-vmss_6"]}]},"log":{"logger":"app"}}
2023-06-02T12:46:09+03:00   {"@timestamp":"2023-06-02T09:46:09.010Z","log.level":"info","message":"Received new VM event","Event":{"Description":"Host server is undergoing maintenance.","DurationInSeconds":30,"EventId":"465D3B0F-D7F2-4239-AC11-1B9800E73DBC","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["spot-node-34525998-vmss_6"]},"log":{"logger":"app"}}
2023-06-02T12:18:21+03:00   {"@timestamp":"2023-06-02T09:18:21.014Z","log.level":"debug","message":"Heartbeat report","Event":{"DocumentIncarnation":13,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":30,"EventId":"B92C67A3-C003-4C0B-A3F0-FE9E13F62490","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["aks-spot-22319102-vmss_8"]}]},"log":{"logger":"app"}}
2023-06-02T12:14:19+03:00   {"@timestamp":"2023-06-02T09:14:19.102Z","log.level":"debug","message":"Heartbeat report","Event":{"DocumentIncarnation":13,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":30,"EventId":"B92C67A3-C003-4C0B-A3F0-FE9E13F62490","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["aks-spot-22319102-vmss_8"]}]},"log":{"logger":"app"}}
2023-06-02T12:10:17+03:00   {"@timestamp":"2023-06-02T09:10:17.195Z","log.level":"debug","message":"Heartbeat report","Event":{"DocumentIncarnation":13,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":30,"EventId":"B92C67A3-C003-4C0B-A3F0-FE9E13F62490","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["aks-spot-22319102-vmss_8"]}]},"log":{"logger":"app"}}
2023-06-02T12:06:15+03:00   {"@timestamp":"2023-06-02T09:06:15.278Z","log.level":"debug","message":"Heartbeat report","Event":{"DocumentIncarnation":13,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":30,"EventId":"B92C67A3-C003-4C0B-A3F0-FE9E13F62490","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["aks-spot-22319102-vmss_8"]}]},"log":{"logger":"app"}}
2023-06-02T12:06:08+03:00   {"@timestamp":"2023-06-02T09:06:08.222Z","log.level":"info","message":"Received new VM event","Event":{"Description":"Host server is undergoing maintenance.","DurationInSeconds":30,"EventId":"B92C67A3-C003-4C0B-A3F0-FE9E13F62490","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["aks-spot-22319102-vmss_8"]},"log":{"logger":"app"}}
2023-05-31T19:39:33+03:00   {"@timestamp":"2023-05-31T16:39:33.501Z","log.level":"info","message":"Heartbeat report","Event":{"DocumentIncarnation":2,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":6,"EventId":"32504B35-D66B-4D0A-8C64-C9DDBBD0EA13","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["spot-node-34525998-vmss_24"]}]}}
2023-05-31T19:37:32+03:00   {"@timestamp":"2023-05-31T16:37:32.466Z","log.level":"info","message":"Heartbeat report","Event":{"DocumentIncarnation":2,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":6,"EventId":"32504B35-D66B-4D0A-8C64-C9DDBBD0EA13","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["spot-node-34525998-vmss_24"]}]}}
2023-05-31T19:35:31+03:00   {"@timestamp":"2023-05-31T16:35:31.461Z","log.level":"info","message":"Heartbeat report","Event":{"DocumentIncarnation":2,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":6,"EventId":"32504B35-D66B-4D0A-8C64-C9DDBBD0EA13","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["spot-node-34525998-vmss_24"]}]}}
2023-05-31T19:33:30+03:00   {"@timestamp":"2023-05-31T16:33:30.467Z","log.level":"info","message":"Heartbeat report","Event":{"DocumentIncarnation":2,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":6,"EventId":"32504B35-D66B-4D0A-8C64-C9DDBBD0EA13","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["spot-node-34525998-vmss_24"]}]}}
2023-05-31T19:31:29+03:00   {"@timestamp":"2023-05-31T16:31:29.439Z","log.level":"info","message":"Heartbeat report","Event":{"DocumentIncarnation":2,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":6,"EventId":"32504B35-D66B-4D0A-8C64-C9DDBBD0EA13","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["spot-node-34525998-vmss_24"]}]}}
2023-05-31T19:29:28+03:00   {"@timestamp":"2023-05-31T16:29:28.407Z","log.level":"info","message":"Heartbeat report","Event":{"DocumentIncarnation":2,"Events":[{"Description":"Host server is undergoing maintenance.","DurationInSeconds":6,"EventId":"32504B35-D66B-4D0A-8C64-C9DDBBD0EA13","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["spot-node-34525998-vmss_24"]}]}}
2023-05-31T19:27:49+03:00   {"@timestamp":"2023-05-31T16:27:49.569Z","log.level":"warning","message":"Received new VM event","Event":{"Description":"Host server is undergoing maintenance.","DurationInSeconds":6,"EventId":"32504B35-D66B-4D0A-8C64-C9DDBBD0EA13","EventSource":"Platform","EventStatus":"Started","EventType":"Freeze","NotBefore":"","ResourceType":"VirtualMachine","Resources":["spot-node-34525998-vmss_24"]}}
seguler commented 1 year ago

Thanks for sharing this.

What is your application SLO? How does the freeze events impact your application SLO? Could you elaborate on what impact you're observing?

We have a feature called node autodrain that can evict applications for certain scheduled events but we do not currently evict for freeze events. Freeze events for the most part don't have any impact on the applications. In some cases, you may see a few seconds of impact (VM freeze, IO latency, some percent TCP drop).

In my opinion, such failures are very normal and expected per the publicly documented VM SLAs. We all should build our applications to be resilient to them (e.g. user TCP timeouts, automatic retries etc.).

We do plan to provide an option for customers to choose to automatically evict pods during a freeze event. However, in our internal tests, we actually observed more impact to our application SLO when we enabled to evict during freeze events. Freeze events are frequent, and they hit every node eventually. Evicting and running away from it did not reduce impact in our case. Impact was multitude greater when we tried to run away from them. Fwiw, we rarely saw an impact from freeze events due to all retries and resiliency mechanisms in our applications.

I do like the idea you have that perhaps we should update readiness probes during the freeze so that pods don't receive traffic. However, if the node was in bad shape, load balancer would probe it down after 10 seconds anyway. Is the impact you're not happy about just less than 10 seconds?

seguler commented 1 year ago

Also, what would you do if you knew the freeze events were scheduled in your node pool for the next 3 days and 1/3 of nodes every day will be impacted? 1 zone per day. Would you schedule your apps to workaround the impact?

dtotopus commented 1 year ago

Well we're not sure to be honest as we're still confused that this in general happens so often and needs additional handling on AKS. And as this applies to pretty much all the azure vm's we're now wondering not only about AKS but being able to run any statefull workloads like databases on Azure. Will gather more info but so far it looks like Azure has implemented some interesting design decisions which I've never seen on e.g. AWS . But maybe I'm mistaken and it's a normal practice nowadays

Snippet from the amazons docs "AWS can schedule events for your instances, such as a reboot, stop/start, or retirement. These events do not occur frequently. If one of your instances will be affected by a scheduled event, AWS sends an email to the email address that's associated with your AWS account prior to the scheduled event. The email provides details about the event, including the start and end date. Depending on the event, you might be able to take action to control the timing of the event" https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-instances-status-check_sched.html

ashwajce commented 2 months ago

We are are suffering from this issue, strange thing is we do not see in the event logs when freeze was over, so this makes troubleshooting even more difficult. image

ashwajce commented 2 months ago

Duplicate Refer: https://github.com/Azure/AKS/issues/3463