Open jcea opened 7 years ago
+1
+1 I always end up dragging in pkg basically just for smartmontools.
What about other monitoring facilities, say, mcelog
?
As discussed on the SmartOS mailing list there are certain constraints and pkgsrc is the correct place for additional software:
FMA already reads SMART data from disks and acts accordingly.
sorry for reviving this old issue...
FMA already reads SMART data from disks and acts accordingly.
And how can one show the SMART data then?
I'm currently in the situation that on a remote host one disk of a mirror vdev failed and 'diskinfo -P' doesn't show serial numbers for any of the disks. So without the ability to show SMART data or at least a serial number for a given drive, I can't get the disk replaced. I can't just tell the on-site tech to rip out each drive, diagnose it on another host to find the failed drive...
You should be able to get serial numbers from iostat -E
.
If the drive isn’t even reporting its serial number, it’s probably so far gone that SMART queries aren’t likely to respond either. I’m guessing there is no enclosure device (diskinfo would show the location if so).
In that scenario, usually I end up resorting to running dd on the good drives to eliminate them by looking at the activity. It’s crude but can be effective.
On May 10, 2022, at 11:51 AM, Jonathan Perkin @.***> wrote:
You should be able to get serial numbers from iostat -E.
— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you are subscribed to this thread.
Actually 'diskinfo -P' doesn't return a serial for any drive except the USB-flashdrive used for booting in all 5 of our smartOS nodes.
# diskinfo -P
DISK VID PID SERIAL FLT LOC LOCATION
c0t5000C50041F9400Fd0 SEAGATE ST2000NM0001 - - - -
c0t5000C50041F934DBd0 SEAGATE ST2000NM0001 - - - -
c0t5000C50041F93BCFd0 SEAGATE ST2000NM0001 - - - -
c0t5000C50041F93EFBd0 SEAGATE ST2000NM0001 - - - -
c1t00A075012BF9E606d0 NVMe Micron_7300_MTFDHBA960TDF - - - -
c2t00A075012BF9E5F4d0 NVMe Micron_7300_MTFDHBA960TDF - - - -
c3t0d0 Kingston DataTraveler 3.0 50E549C20249BFC0D9B12403 - - -
c4t0d0 Single Flash Reader 058F63356336 - - -
---
# diskinfo -P
DISK VID PID SERIAL FLT LOC LOCATION
c1t0d0 SanDisk' Cruzer Fit 4C530000061127117191 - - -
c2t0d0 INTEL SSDSC2KB480G7 - - - -
c2t3d0 INTEL SSDSC2KB480G7 - - - -
---
# diskinfo -P
DISK VID PID SERIAL FLT LOC LOCATION
c1t0d0 SanDisk Ultra Fit 4C530001111231118393 - - -
c2t2d0 INTEL SSDSC2KB240G7 - - - -
c2t3d0 INTEL SSDSC2KB240G7 - - - -
(and so on for the other 2 nodes with 2 Intel DC S4500 SSDs)
@jperkin with 'iostat -E' I at least get some error statistics and the serial number to identify the disk, thanks!
# iostat -E sd1
sd1 Soft Errors: 1 Hard Errors: 12 Transport Errors: 220
Vendor: ATA Product: INTEL SSDSC2KB48 Revision: 0142 Serial No: PHYS8251011L480
Size: 480.10GB <480103981056 bytes>
Media Error: 0 Device Not Ready: 0 No Device: 0 Recoverable: 0
Illegal Request: 33 Predictive Failure Analysis: 0
I still think smartmontools (or any other tool to show/monitor SMART data) would be beneficial to monitor drives and alert on various indicators BEFORE the drive actually dies... The drive in this node - as often - didn't just die but periodically stalled the system as it went silent for several seconds every now and then.
Utilities like "smartctl" SHOULD be available in the global zone, with no extra installation, to verify the health of the harddisks. Periodic SMART validation should be in any SysAdmin checklist, beside the regular "zpool scrub".
Please, include "smartctl" in the live image, usable in from the global zone.