DegradedArray on server3 #115

Open clonm opened 5 years ago

clonm commented 5 years ago

Looks like a disk failed. See also #114. Need to go through and figure out which serial number is the failed disk, and swap it out.

clonm commented 5 years ago

Output of lsblk -fs:

NAME      FSTYPE      LABEL     UUID                                   MOUNTPOINT
sda1      ext4                  6cf195d0-3d61-44f1-adfe-32e2be0ef589   /
sda2      swap                  f9d85b58-993d-4f14-8b9b-a7473cc6e5ed   [SWAP]
vg0-srv   ext4                  234aaade-0a00-4034-94fd-327e4656de6a   /srv
├─md0     LVM2_member           a3cbng-LNOD-tFIN-b7Jk-7I4Q-AsBt-J8ALjQ 
│ ├─sdb1  linux_raid_ server3:0 6343cc3a-1818-08ad-6e15-c8d9b83e4c2f   
│ │ └─sdb                                                              
│ └─sdc1  linux_raid_ server3:0 6343cc3a-1818-08ad-6e15-c8d9b83e4c2f   
│   └─sdc                                                              
├─md1     LVM2_member           2QUdr1-zHAf-YdIZ-HnLs-Cd1e-FApB-XrYsZb 
│ ├─sdd1  linux_raid_ server3:1 2408dcd6-a135-1e73-4cb1-e6a2a64a2aa6   
│ │ └─sdd                                                              
│ └─sde1  linux_raid_ server3:1 2408dcd6-a135-1e73-4cb1-e6a2a64a2aa6   
│   └─sde                                                              
├─md2     LVM2_member           lft9mX-bY4A-0LVs-tBr4-ei9M-Nwb0-0rge10 
│ ├─sdf1  linux_raid_ server3:2 f4d64eb3-fda4-f669-3f53-ee91ddb8ebf9   
│ │ └─sdf                                                              
│ └─sdh1  linux_raid_ server3:2 f4d64eb3-fda4-f669-3f53-ee91ddb8ebf9   
│   └─sdh                                                              
├─md3     LVM2_member           RrgDwD-8TsR-UblH-Qvjf-U5M2-C2Up-3Kr7E2 
│ ├─sdi1  linux_raid_ server3:3 94cb4f84-8506-bfd2-233f-4c6710d70379   
│ │ └─sdi                                                              
│ └─sdk1  linux_raid_ server3:3 94cb4f84-8506-bfd2-233f-4c6710d70379   
│   └─sdk                                                              
├─md5     LVM2_member           HLKlVn-3wuK-yh2Y-P0Ux-hLvq-9Rz3-1EIUDr 
│ └─sdj1  linux_raid_ server3:5 a5a813fd-d87a-a785-02f8-8bfde370b59d   
│   └─sdj                                                              
├─md6     LVM2_member           dCLeqW-x55j-zf2h-3Vjx-sIV3-FqOC-KmBeK6 
│ ├─sdl1  linux_raid_ server3:6 8c1bcb52-e684-f0f5-aae0-28928b9144cb   
│ │ └─sdl                                                              
│ └─sdm1  linux_raid_ server3:6 8c1bcb52-e684-f0f5-aae0-28928b9144cb   
│   └─sdm                                                              
└─md7     LVM2_member           ryStBk-SRMn-v2fG-2bEr-NHev-uRTO-MccbS7 
  ├─sdn1  linux_raid_ server3:7 53c6b4d3-ecd7-4f8f-514d-8002f73c57be   
  │ └─sdn                                                              
  └─sdo1  linux_raid_ server3:7 53c6b4d3-ecd7-4f8f-514d-8002f73c57be   

So, sdg is bad. Here's that part of lshw

                   description: SCSI Disk
                   product: 9650SE-16M DISK
                   vendor: AMCC
                   physical id: 0.5.0
                   bus info: scsi@2:0.5.0
                   logical name: /dev/sdg
                   version: 4.10
                   serial: 3QD0A7EC57C4970016E9
                   size: 698GiB (749GB)
                   capabilities: partitioned partitioned:dos
                   configuration: ansiversion=5 logicalsectorsize=512 sectorsize=512 signature=00078bb4
                      description: Linux raid autodetect partition
                      physical id: 1
                      bus info: scsi@2:0.5.0,1
                      logical name: /dev/sdg1
                      capacity: 698GiB
                      capabilities: primary multi

so it's disk 11. hence, smartctl -a -d 3ware,11 /dev/twa0:

Model Family:     Western Digital Red
Device Model:     WDC WD30EFRX-68EUZN0
Serial Number:    WD-WMC4N0228885
LU WWN Device Id: 5 0014ee 0ae542bbf
Firmware Version: 80.00A80
User Capacity:    3,000,592,982,016 bytes [3.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5400 rpm
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2 (minor revision not indicated)
SATA Version is:  SATA 3.0, 6.0 Gb/s (current: 3.0 Gb/s)
Local Time is:    Sat Oct 13 12:02:14 2018 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                                        was never started.
                                        Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                                        without error or no self-test has ever 
                                        been run.
Total time to complete Offline 
data collection:                (41940) seconds.
Offline data collection
capabilities:                    (0x7b) SMART execute Offline immediate.
                                        Auto Offline data collection on/off support.
                                        Suspend Offline collection upon new
                                        Offline surface scan supported.
                                        Self-test supported.
                                        Conveyance Self-test supported.
                                        Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                                        power-saving mode.
                                        Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                                        General Purpose Logging supported.
Short self-test routine 
recommended polling time:        (   2) minutes.
Extended self-test routine
recommended polling time:        ( 420) minutes.
Conveyance self-test routine
recommended polling time:        (   5) minutes.
SCT capabilities:              (0x703d) SCT Status supported.
                                        SCT Error Recovery Control supported.
                                        SCT Feature Control supported.
                                        SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   238   177   021    Pre-fail  Always       -       3100
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       53
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   100   253   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   057   057   000    Old_age   Always       -       31693
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       53
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       52
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0022   119   109   000    Old_age   Always       -       31
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Extended offline    Completed without error       00%      4760         -

SMART Selective self-test log data structure revision number 1
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.