amazonlinux / amazon-linux-2023

Amazon Linux 2023
https://aws.amazon.com/linux/amazon-linux-2023/
Other
501 stars 38 forks source link

[Bug] - EC2 server became very unstable on Feb 2 18:00 UTC #626

Closed tayloraswift closed 3 months ago

tayloraswift commented 5 months ago

Describe the bug

i have a web server running on a t2.micro instance that had been running the latest version of Amazon Linux 2023 prior to 2023.3.20240131. for some reason at exactly 18:00 UTC today, CPU usage jumped to 10 percent and the web server became completely unresponsive for nearly two hours before recovering on its own. this was not accompanied by any anomalies in incoming network activity.

image

an older version of this server experienced frequent problems similar to this one while running an outdated version of Amazon Linux 2023. problems tended to coincide with notifications of pending OS updates, which were never installed. the problems disappeared upon migrating the application to a new instance running an updated version of Amazon Linux 2023. therefore my best hypothesis is a scheduled software update check is causing the node to run out of memory, which freezes all processes in place until the clog dissipates.

is this a plausible hypothesis? what can be done to prevent this from occurring in the future?

elsaco commented 5 months ago

Today is Groundhog Day and at exactly 18:00 UTC Punxsutawney Phil sneezed causing a ripple effect in the Matrix!

tayloraswift commented 5 months ago

haha, that’s funny! but this cost us a lot of traffic and rankings, and could cause our store to become delisted from search results. we have little understanding of why this happened, and no idea if it will happen again when the next update comes up.

is there a way to disable the automatic software update checks?

nmeyerhans commented 5 months ago

We use systemd timers to manage scheduled tasks. You can see the active timers with systemctl list-timers. If you think one of these is contributing to resource contention on your system, you can disable using systemctl. For example sudo systemctl disable --now update-motd.timer would disable the periodic check for a new distro release.

tayloraswift commented 5 months ago

thank you!

stewartsmith commented 5 months ago

Be sure you're running off a recent AL2023 AMI if running on small instance types. We added zram based swap for instance types with less than 800MB RAM. It's certainly easy to get into really tight memory situations without it. IIRC t2.micro fits into this category.

You can enable zram swap for other instance types too, and likely end up ahead with performance and latency.

If not away, zram swap is where a RAM disk that does transparent compression is used as a swap device.

stewartsmith commented 5 months ago

We use systemd timers to manage scheduled tasks. You can see the active timers with systemctl list-timers. If you think one of these is contributing to resource contention on your system, you can disable using systemctl. For example sudo systemctl disable --now update-motd.timer would disable the periodic check for a new distro release.

Thinking about this at the start of a new week, my guess is we should look at the priority settings for this timer, as we could make systemd set up the cgroups more appropriately to have less of a impact (although possibly take longer to run, which is fine - this is not a time critical thing).

That could help alleviate this condition here.

I'm thinking about something like:

# slightly less than default 100, as other things are likely to be more critical for application setup,
# but we don't want to starve getting out of the way 
StartupCPUWeight=90 
# Run with less priority than other tasks by default
CPUWeight=50
# limit to a quarter of a core
CPUQuota=25
# On startup, limit IO, but not by heaps
StartupIOWeight=90
# Run with less priority for IO than other task
IOWeight=50

Thoughts?

stewartsmith commented 4 months ago

https://github.com/amazonlinux/update-motd/commit/bcdcc798f996396cf4c5c296ae573f3530f2fca0 implements these limits

stewartsmith commented 4 months ago

I'm going to assume that the above will help this issue, if it doesn't, please update us (after the package update has shipped)

tayloraswift commented 4 months ago

i’ll have to migrate the application to a new instance (still haven’t figured out how to upgrade the OS on the same node without taking the site offline), but i’ll let you know if we’re still running into this with update-motd.timer enabled. thanks for your help!

stewartsmith commented 4 months ago

Check that the zram swap is enabled. Without it, it's quite likely you're going to run out of memory. If you're on an instance type with more memory where it's not enabled by default, consider manually enabling it, it's quite helpful for memory constrained environments.

stewartsmith commented 3 months ago

There should be some updates in https://docs.aws.amazon.com/linux/al2023/release-notes/relnotes-2023.4.20240319.html AL2023.4 that help lower priority of the scheduled tasks, which should help in this situation, thus resolving. Please reopen if you still observe issues.