Warn about storage errors

cockpit-project / cockpit

Cockpit is a web-based graphical interface for servers.

http://www.cockpit-project.org/

GNU Lesser General Public License v2.1

11.22k stars 1.11k forks source link

Warn about storage errors #15460

Open garrett opened 3 years ago

garrett commented 3 years ago

Similar to warning about disks being out of space, we should also warn about storage errors.

The scope is different (hence using another issue for this), but should use the same framework.

Examples of errors (that are not space-related):

error: disk could not be mounted for x/y/z reason(s)
warning: SMART errors that don't stop functioning of disk
error: SMART found bad sectors (disk is dying; data loss may have happened
warning: RAID disk not mounted, RAID is degraded
error: RAID could not be mounted
error: Could not mount remote filesystem (NFS, SMB/CIFS, AFP, etc.)

And there are tons more.

We probably want to split these into separate groups and handle them accordingly. That is:

local, simple disk-based storage
local, complex storage (RAID)
network storage

And then consider if they're errors, warnings, or just informational (but important to know).

garrett commented 3 years ago

I would like these errors to show up on the health section of the overview page. They should all link to the details page of the specific item (disk, RAID, NFS mount, etc.) so it could be inspected further, or some action could be taken (such as checking the disk).

We probably also want to represent these issues on the storage page too, on each relevant disk, if we don't already.

ne20002 commented 3 years ago

I second that. Having the warnings (and warnings for disks, s.m.a.r.t and raid) all visible on the overview page would be great.

I also would like to have an option, to 'accept' warnings so they do not come up every time and can be distinguished from newly or unresolved ones. E.g.: on my screenshot in #15886 I have a warning for an unmounted partition which is ok in this case (it's an efi partion on the second SSD, where the system uses on of the two for boot and mount it, but the other is marked as a warning; it's just the fallback for a system having the /boot and / on a raid 1). Maybe having 'accepted' warnings with a white icon instead of the yello one?

garrett commented 3 years ago

I also would like to have an option, to 'accept' warnings so they do not come up every time and can be distinguished from newly or unresolved ones.

Cockpit shouldn't store its own internal state; it should reflect what's on the system. So, whatever warnings that are shown and are dismissable in Cockpit should exist elsewhere in the system and be dismissable there too.

E.g.: on my screenshot in #15886 I have a warning for an unmounted partition which is ok in this case (it's an efi partion on the second SSD, where the system uses on of the two for boot and mount it, but the other is marked as a warning; it's just the fallback for a system having the /boot and / on a raid 1).

I also have some warnings on my local home server, but they're due to using systemd automount instead of automount, so Cockpit shows a warning that they won't be mounted on next boot. (However, they will be via systemd instead of normal automount.) In this case, we should probably fix the issue directly and "teach" Cockpit to understand systemd automount.

In your case, are you referring to disks using standard devices, like /dev/sda, /dev/sdb, etc.? Or are you using something unique like UUID or LABEL (as long as you keep labels unique)? If I don't use something unique on my server, then the SSD could mount before the spinning disk RAID, or vice versa. With a unique way to address each in /etc/fstab, it doesn't matter which actually shows up first.

In other words, is Linux randomly picking a partition from one of your disks for boot? Or am I misunderstanding this?

ne20002 commented 3 years ago

In other words, is Linux randomly picking a partition from one of your disks for boot? Or am I misunderstanding this?

Correct

I have two NVME SSD and both are bootable. Both have

an efi partition for booting
a partition as raid 1 for system
a partition as raid 1 for swap
a partition as raid 1 used as lvm cache for the disk raid (3 disks as raid 5)
a partition as raid 1 for data

The boot is using one of the efi partitions. The one of the second SSD is a fallback if the first SSD dies. Doesn't make much sense to have system and swap on a raid 1 if I can't boot if one of the SSD dies. Linux mounts then one of the two partitions into /boot/efi. As they are identical, the other is not mounted and the error is shown.

The setup works fine (tried by removing the first SSD, sync afterwards was faster then expected).

I thought of not mounting the two partitions but I'm unsure if the mount is needed e.g. for updates. I assume it wouldn't need to be mounted for boot process.

So this 'warning' is not an issue and I'd like to have it distinguished from newly uppopping issues. If the icon on the overview page would have a little text that may be enough to see what's the issue behind.

garrett commented 3 years ago

The one of the second SSD is a fallback if the first SSD dies. Doesn't make much sense to have system and swap on a raid 1 if I can't boot if one of the SSD dies.

Aha; gotcha. How do you make sure the partitions are in sync?

I'm not sure how we would tell Cockpit to ignore an error like this. Perhaps use x-cockpit-ignore-mount-errors in fstab or something? :smirk: (Not sure if we should do something like that. And it would have to be in the UI, not some magic string to add. I'd have to talk with @mvollmer about this or any other solution that could work.) This is similar to how you have partitions show up in Nautilus, with x-gvfs-show as an option in /etc/fstab, by-the-way.

It's definitely interesting to consider cases like this though.

ne20002 commented 3 years ago

Aha; gotcha. How do you make sure the partitions are in sync?

Maybe that's the cause of the problem. I have the setup from a tutorial ... and it stated to copy the mounted partition with dd to the second disk. I had a look on the details and here we are:

the two efi partitions have the same (file system) UUID but different PARTUUID. As for the GPT the PARTUUID is relevant, the problem started with the copy of the source / mounted efi partition with dd.
Instead of simply copy the partition with dd, I need to ensure that the UUID will be different. So for that I will have only one partition mounted and that it is always the same. And the script will copy the partition from time to time (weekly?) with dd plus updating the UUID afterwards with fsutil. That way I will get updates on the efi partition (if any), and have a copy of the partition with a different UUID. The file system UUID shall no tbe needed for booting (but I'll check that).

garrett commented 3 years ago

I looked into this a little, and it appears a lot of people seem to RAID1 their /boot/ partition and handle their normal RAID (of whatever level) for data separately. Linux is supposed to understand the boot partition of a RAID1 as a normal partition during boot, and when the RAID1 for /boot/ is mounted, things get sync'd up to both drives automatically.

At least that's my understanding from the following page: https://serverfault.com/questions/634482/setting-up-a-bootable-multi-device-raid-1-using-linux-software-raid/703080#703080

/dev/sda1 + /dev/sdb1 = /dev/md0 Software RAID-1 with metadata 0.90 /dev/sda2 + /dev/sdb2 = /dev/md### Software RAID-1 with LVM on top

And also from the Arch Wiki, which is quite valuable for information regardless of distro https://wiki.archlinux.org/title/RAID#Standard_RAID_levels:

RAID1

[...] Please note that with a software implementation, the RAID 1 level is the only option for the boot partition, because bootloaders reading the boot partition do not understand RAID, but a RAID 1 component partition can be read as a normal partition.

...but all of this seems to talk about an MD software RAID1 for /boot/; specifically not hardware and not LVM.

(And I'm not telling you to modify your system accordingly. A working, booting system is a good thing to have. :wink:)

Even if most people set up RAIDs with /boot like that (and I don't know if they do, or if they use a script like you do), it probably would be good to have some sort of way to tell Cockpit to not show warnings under certain circumstances.

ne20002 commented 3 years ago

Hi Thank, you. This is nearly exactly what I have.

one md raid 1, as pv_system, vg_system with two lv_boot (/boot) and lv_root (/)
one md raid 1 for swap
one md raid 1 for lvm cache of disk array
one md raid 1 for extra fast data

My system is working. The problem not covered with the mentioned approach is the efi boot partition. This can't be put on a raid.So it's usually on one of the ssds. In my setup, I have it on nvme0nip1 as installed during installation.

Now: the motherboard allows a boot order and this is ssd0, ssd1. In case ssd0 dies, the efi boot partition is gone and the system wouldn't be able to boot. To work around this, a second efi boot partition is created on ssd1 and synced with the one from ssd0 manually.

This works. Having system shut down, removed ssd0 ... boot .. all well (raids degraded, but system booted).

As on the article: After setting up your RAID-1 on /boot, you have to install GRUB onto each drive in the RAID-1 array.

And this is what I basically did by coping the efi partition. The problem now has been, that both efi partitions had the same UUID, therefore in fstab is only one entry. Trying to mount both partitions with the same path /boot/efi resulted in the error shown.

I have now created different UUID. So the error is gone, but only one partition is mounted in /boot/efi. So I still have to sync.

I will have a look if reinstalling GRUB, where GRUB then should handle this, on both disks will give me the same solution. But even then, only one can be mounted on /boot/efi. ;)

Anyway, back to the original issue:

I believe that there are situations where a 'system' will print errors even though this error is not an error in the exact situation. So having an option to ignore errors like that and preventing them from always showing in the overview as errors would be handy.

But it's two things: 'bring all error/warnings onto the overview page' and 'optimize listing of errors with ability to ignore certain errors' Having the first is absolutely ok for most users, I assume.