datamade / how-to

📚 Doing all sorts of things, the DataMade way
MIT License
84 stars 12 forks source link

Risk and Reach SSL cert failed to renew after server silently ran out of space #156

Closed hancush closed 3 years ago

hancush commented 3 years ago

Description

I received a notification that the R&R SSL cert was set to expire. I shelled into the server and attempted to confirm that we'd installed the auto-renew cron tab, but found that there was no disk space left when I tried to tab complete. Ran df to confirm disk use was the problem.

ubuntu@ip-10-0-0-22:~$ ls /etc-bash: cannot create temp file for here-document: No space left on device
-bash: cannot create temp file for here-document: No space left on device
^C
ubuntu@ip-10-0-0-22:~$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            985M     0  985M   0% /dev
tmpfs           200M   21M  179M  11% /run
/dev/xvda1      7.7G  7.7G     0 100% /
tmpfs           996M  8.0K  996M   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs           996M     0  996M   0% /sys/fs/cgroup
tmpfs           200M     0  200M   0% /run/user/0
/dev/loop4       29M   29M     0 100% /snap/amazon-ssm-agent/2012
/dev/loop1       98M   98M     0 100% /snap/core/10126
/dev/loop0       98M   98M     0 100% /snap/core/10185
/dev/loop2       56M   56M     0 100% /snap/core18/1932
/dev/loop5       29M   29M     0 100% /snap/amazon-ssm-agent/2333
tmpfs           200M     0  200M   0% /run/user/1000

Second challenge: Without disk space, I couldn't sort du output to identify the largest files, so I Googled an alternative and settled on find with a minimum file size. At first I tried 10M, but that yielded a lot of files, so I upped it to 20M.

ubuntu@ip-10-0-0-22:~$ sudo du -a / 2>/dev/null | sort -n -r | head -n 20
sort: write failed: /tmp/sortjchOnF: No space left on device
ubuntu@ip-10-0-0-22:~$ sudo find / -size +20M -exec ls -lh {} +
# list of large files with human readable names

I noticed several large system journal files, so Googled again, and found that journalctl has easy cleanup commands. Ran one to free up more than half a gig.

ubuntu@ip-10-0-0-22:~$ sudo journalctl --vacuum-size=100M
# list of journal files removed
Vacuuming done, freed 690.7M of archived journals from /var/log/journal/6040c336eb29459c8a881c3850d1864e.

Many of the remaining large files seem to be many versions of Linux AWS headers:

ubuntu@ip-10-0-0-22:~$ sudo du -ahx / 2>/dev/null | sort -n -r | head -n 20
1020K   /var/lib/apt/lists/security.ubuntu.com_ubuntu_dists_bionic-security_restricted_binary-amd64_Packages
1020K   /usr/src/linux-aws-headers-4.15.0-1039/tools/testing/selftests
1020K   /usr/src/linux-aws-headers-4.15.0-1035/tools/testing/selftests
1020K   /usr/src/linux-aws-headers-4.15.0-1032/tools/testing/selftests
1020K   /usr/src/linux-aws-5.3-headers-5.3.0-1035/fs
1020K   /usr/src/linux-aws-5.3-headers-5.3.0-1034/fs
1020K   /usr/src/linux-aws-5.3-headers-5.3.0-1033/fs
1020K   /usr/src/linux-aws-5.3-headers-5.3.0-1032/fs
1020K   /usr/src/linux-aws-5.3-headers-5.3.0-1030/fs
1020K   /usr/src/linux-aws-5.3-headers-5.3.0-1028/fs
1020K   /usr/src/linux-aws-5.3-headers-5.3.0-1023/fs
1020K   /usr/src/linux-aws-5.3-headers-5.3.0-1019/fs
1020K   /usr/src/linux-aws-5.3-headers-5.3.0-1017/fs
1020K   /usr/share/locale/uk
1020K   /usr/lib/ruby/vendor_ruby
1020K   /usr/include/x86_64-linux-gnu/bits
1020K   /lib/modules/4.15.0-1032-aws/kernel/drivers/net/ethernet/mellanox/mlxsw
1020K   /lib/modules/4.15.0-1032-aws/kernel/drivers/net/ethernet/broadcom/bnx2x/bnx2x.ko
1016K   /var/lib/postgresql/10/main/base/16385/18202
1016K   /usr/src/linux-aws-headers-4.15.0-1065/arch/arm64/include

I think we should be able to clean up old versions and reclaim further space with apt-get autoremove, but I want to double check that with @fgregg to ensure it doesn't cause any unintended side effects.

Some next steps:

fgregg commented 3 years ago

auto-remove should be good!

hancush commented 3 years ago
ubuntu@ip-10-0-0-22:~$ sudo apt-get autoremove
Reading package lists... Done
Building dependency tree
Reading state information... Done
You might want to run 'apt --fix-broken install' to correct these.
The following packages have unmet dependencies:
 linux-headers-5.4.0-1029-aws : Depends: linux-aws-5.4-headers-5.4.0-1029 but it is not installed
E: Unmet dependencies. Try 'apt --fix-broken install' with no packages (or specify a solution).

Incredible that there's a version of the headers we haven't installed, har har. I'm hesitant to screw too much with things on a production server without a second set of eyeballs. Can we pair on this at some point?

fgregg commented 3 years ago

yes. let's see if we need all of r&d

hancush commented 3 years ago

Annual server maintenance checklist. Update app / server inventory, especially for legacy setup.

hancush commented 3 years ago

Did apt --fix-broken-install. Kept local version of package whenever apt prompted that a file may have changed. Then was able to successfully run apt-get autoremove and reclaim > 2 GB of space. We probably don't want to automate this, but it would be good to clean up apt packages on an annual basis, hence the annual maintenance checklist.

We probably also want to confirm system journal to a max size.

Finally, we should turn on disk use alarms so we receive alerts if servers run out of space.

hancush commented 3 years ago

Revised work list:

hancush commented 3 years ago

Created a revised inventory of applications deployed on legacy AWS infrastructure: https://docs.google.com/spreadsheets/d/1_c1_v4IJ5wLpjUt0p0Feq3LXskItu5Ml9J6gZucSqpw/edit?usp=sharing (excepting most of the sites on Staging). Very gratifying to see how many static sites we've migrated to Netlify, and definitely see some opportunities to migrate even more (for example, SSCE and SFM are both slated to migrate to Heroku this year).

hancush commented 3 years ago

I'll create an annual server maintenance issue and schedule the next round of work (2021)