TritonDataCenter / smartos-live

For more information, please see http://smartos.org/ For any questions that aren't answered there, please join the SmartOS discussion list: https://smartos.topicbox.com/groups/smartos-discuss
1.58k stars 247 forks source link

smartmontools should be available in the live CD #703

Open jcea opened 7 years ago

jcea commented 7 years ago

Utilities like "smartctl" SHOULD be available in the global zone, with no extra installation, to verify the health of the harddisks. Periodic SMART validation should be in any SysAdmin checklist, beside the regular "zpool scrub".

Please, include "smartctl" in the live image, usable in from the global zone.

laris commented 7 years ago

+1

mloftis commented 7 years ago

+1 I always end up dragging in pkg basically just for smartmontools.

poige commented 6 years ago

What about other monitoring facilities, say, mcelog?

jussisallinen commented 6 years ago

As discussed on the SmartOS mailing list there are certain constraints and pkgsrc is the correct place for additional software:

https://www.listbox.com/member/archive/184463/2018/01/sort/time_rev/page/5/entry/2:150/20180119061102:6CCB5424-FD09-11E7-9E41-F2C8B0E33A87/

bahamat commented 6 years ago

FMA already reads SMART data from disks and acts accordingly.

rostwald commented 2 years ago

sorry for reviving this old issue...

FMA already reads SMART data from disks and acts accordingly.

And how can one show the SMART data then?

I'm currently in the situation that on a remote host one disk of a mirror vdev failed and 'diskinfo -P' doesn't show serial numbers for any of the disks. So without the ability to show SMART data or at least a serial number for a given drive, I can't get the disk replaced. I can't just tell the on-site tech to rip out each drive, diagnose it on another host to find the failed drive...

jperkin commented 2 years ago

You should be able to get serial numbers from iostat -E.

jasonbking commented 2 years ago

If the drive isn’t even reporting its serial number, it’s probably so far gone that SMART queries aren’t likely to respond either. I’m guessing there is no enclosure device (diskinfo would show the location if so).

In that scenario, usually I end up resorting to running dd on the good drives to eliminate them by looking at the activity. It’s crude but can be effective.

On May 10, 2022, at 11:51 AM, Jonathan Perkin @.***> wrote:

 You should be able to get serial numbers from iostat -E.

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.

rostwald commented 2 years ago

Actually 'diskinfo -P' doesn't return a serial for any drive except the USB-flashdrive used for booting in all 5 of our smartOS nodes.

# diskinfo -P
DISK                    VID      PID              SERIAL               FLT LOC LOCATION
c0t5000C50041F9400Fd0   SEAGATE  ST2000NM0001     -                    -   -   -
c0t5000C50041F934DBd0   SEAGATE  ST2000NM0001     -                    -   -   -
c0t5000C50041F93BCFd0   SEAGATE  ST2000NM0001     -                    -   -   -
c0t5000C50041F93EFBd0   SEAGATE  ST2000NM0001     -                    -   -   -
c1t00A075012BF9E606d0   NVMe     Micron_7300_MTFDHBA960TDF -                    -   -   -
c2t00A075012BF9E5F4d0   NVMe     Micron_7300_MTFDHBA960TDF -                    -   -   -
c3t0d0                  Kingston DataTraveler 3.0 50E549C20249BFC0D9B12403 -   -   -
c4t0d0                  Single   Flash Reader     058F63356336         -   -   -

---

# diskinfo -P
DISK                    VID      PID              SERIAL               FLT LOC LOCATION
c1t0d0                  SanDisk' Cruzer Fit       4C530000061127117191 -   -   -
c2t0d0                  INTEL    SSDSC2KB480G7    -                    -   -   -
c2t3d0                  INTEL    SSDSC2KB480G7    -                    -   -   -

---

# diskinfo -P
DISK                    VID      PID              SERIAL               FLT LOC LOCATION
c1t0d0                  SanDisk  Ultra Fit        4C530001111231118393 -   -   -
c2t2d0                  INTEL    SSDSC2KB240G7    -                    -   -   -
c2t3d0                  INTEL    SSDSC2KB240G7    -                    -   -   -

(and so on for the other 2 nodes with 2 Intel DC S4500 SSDs)

@jperkin with 'iostat -E' I at least get some error statistics and the serial number to identify the disk, thanks!

# iostat -E sd1
sd1       Soft Errors: 1 Hard Errors: 12 Transport Errors: 220 
Vendor: ATA      Product: INTEL SSDSC2KB48 Revision: 0142 Serial No: PHYS8251011L480 
Size: 480.10GB <480103981056 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0 
Illegal Request: 33 Predictive Failure Analysis: 0 

I still think smartmontools (or any other tool to show/monitor SMART data) would be beneficial to monitor drives and alert on various indicators BEFORE the drive actually dies... The drive in this node - as often - didn't just die but periodically stalled the system as it went silent for several seconds every now and then.