cloyne / network

5 stars 5 forks source link

RAID Recovery on Server2 #127

Closed Robit closed 1 year ago

Robit commented 1 year ago

TLDR: After moving the servers back to Cloyne, the server hosting cloyne.org, the mailserver, and other critical things suffered a disk failure. After a nontrivial reconstruction of the backend storage, data is now once again replicated between two disks, preventing data loss in the case of a single disk failure.

When the servers were moved back to Cloyne in early September, server2 failed to boot. This was isolated to a failed disk in the raid array mounted to /srv.

As described in https://github.com/cloyne/network/wiki/Cloyne's-Setup:-Domains-and-Servers, server2 has two raid arrays for data storage.

/srv: /dev/md1 /srv/mnt: /dev/md0 (used for daily local backup of files and databases, using tozd/rdiff-backup Docker image)

md0 : active raid1 sdd1[1] sde1[0]
      488253248 (0.5tb) blocks super 1.2 [2/2] [UU]
md1 : active raid1 sdb1[1] sda1[2]
      1465006080 (1.5tb) blocks super 1.2 [2/2] [UU]
      bitmap: 9/11 pages [36KB], 65536KB chunk

In this case, the md1 array had suffered a disk failure but could be mounted in a degraded state using only one of the mirrored disks. For some time, md1 ran in a degraded state

md1 : active raid1 sda1[1]
      1465006080 blocks super 1.2 [2/1] [U_]
      bitmap: 11/11 pages [44KB], 65536KB chunk

However, this is unsustainable for long term use as it leaves the server vulnerable to a single disk failure causing data loss, especially given the under-tested nature of the md0 backup array. In addition, no 1.5tb disks were free to mount as a spare for md1. While the filesystem on the array was only 180gb, resizing the raid to add a smaller drive was a nontrivial task. As such, I used two 512gb spare disks to construct a new raid array and copy the data over. I found the documentation at https://www.digitalocean.com/community/tutorials/how-to-create-raid-arrays-with-mdadm-on-ubuntu-16-04 very useful for helping me with this task.

In order to ensure data preservation, I also created a tarball of the current /srv system in the md0 array as a backup incase of any partitioning mishaps or other issues. This proved useful during the process when I accidentally turned both arrays to raid0 (striping data, with no redundancy). As a result, both arrays needed to be reconstructed sequentially, using the other array as a temporary backup storage for the contents of the array being reconstructed.

Another issue to note was that during the process, the rdiff-backup container responsible for backing up the filesystem ran and backed up the tarball containing server contents, inflating the archive size and filling the backup disk. This issue was fixed by using https://github.com/rdiff-backup/rdiff-backup/blob/master/docs/rdiff-backup-delete.1.adoc to delete the file from the backup and disabling the rdiff backup container for the rest of the operation.

The new raid array statistics looks similar to the old ones, with the exception that md1 is now 0.5tb instead of 1.5tb. Both disks comprising the old raid array were removed from the drive bay and labelled accordingly (one probably degraded, and the other fine). The drive labels were also updated to match the new disks.

Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10]
md1 : active raid1 sdb[1] sda[0]
      488255488 blocks super 1.2 [2/2] [UU]
      bitmap: 3/4 pages [12KB], 65536KB chunk

md0 : active raid1 sdc[0] sdd[1]
      488255488 blocks super 1.2 [2/2] [UU]
      bitmap: 4/4 pages [16KB], 65536KB chunk

unused devices: <none>
Robit commented 1 year ago

Server has been running fine for a few days after recovery. Closing issue