AnalogJ / scrutiny

Hard Drive S.M.A.R.T Monitoring, Historical Trends & Real World Failure Thresholds
MIT License
5.36k stars 171 forks source link

[BUG] failed disk while its not failed #690

Open Hr46ph opened 2 months ago

Hr46ph commented 2 months ago

Describe the bug All 3 NVMe disks show as failed and I have no idea why. For one, I might have a clue but not for the 2 others.

The only place it shows failed is on the dashboard and when I click a disk, the label at 'status'.

Expected behavior Healthy disks because there is nothing wrong with them.

Screenshots image

image

Log Files I have 2 of these:

# smartctl -a /dev/nvme0 
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.10.9-arch1-2] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Lexar SSD NM620 2TB
Serial Number:                     
Firmware Version:                   9846
PCI Vendor/Subsystem ID:            0x1e4b
IEEE OUI Identifier:                0xcaf25b
Total NVM Capacity:                 2,048,408,248,320 [2.04 TB]
Unallocated NVM Capacity:           0
Controller ID:                      0
NVMe Version:                       1.4
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,048,408,248,320 [2.04 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            caf25b 0430001008
Local Time is:                      Sat Sep 14 13:33:51 2024 CEST
Firmware Updates (0x14):            2 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x02):         Cmd_Eff_Lg
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     90 Celsius
Critical Comp. Temp. Threshold:     95 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     6.50W       -        -    0  0  0  0        0       0
 1 +     5.80W       -        -    1  1  1  1        0       0
 2 +     3.60W       -        -    2  2  2  2        0       0
 3 -   0.7460W       -        -    3  3  3  3     5000   10000
 4 -   0.7260W       -        -    4  4  4  4     8000   45000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        36 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    0%
Data Units Read:                    38,469,207 [19.6 TB]
Data Units Written:                 24,861,201 [12.7 TB]
Host Read Commands:                 274,542,336
Host Write Commands:                495,212,791
Controller Busy Time:               543
Power Cycles:                       85
Power On Hours:                     6,610
Unsafe Shutdowns:                   60
Media and Data Integrity Errors:    0
Error Information Log Entries:      3
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               36 Celsius
Temperature Sensor 2:               42 Celsius
Thermal Temp. 1 Transition Count:   61
Thermal Temp. 1 Total Time:         74

Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged

And this is number 3:

# smartctl -a /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.6.49-1-lts] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       SAMSUNG MZVLB256HAHQ-00000
Serial Number:                     
Firmware Version:                   EXD7201Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 256,060,514,304 [256 GB]
Unallocated NVM Capacity:           0
Controller ID:                      4
NVMe Version:                       1.2
Number of Namespaces:               1
Namespace 1 Size/Capacity:          256,060,514,304 [256 GB]
Namespace 1 Utilization:            79,673,511,936 [79.6 GB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            002538 8a81b268d0
Local Time is:                      Sat Sep 14 13:38:17 2024 CEST
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     81 Celsius
Critical Comp. Temp. Threshold:     82 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     7.02W       -        -    0  0  0  0        0       0
 1 +     6.30W       -        -    1  1  1  1        0       0
 2 +     3.50W       -        -    2  2  2  2        0       0
 3 -   0.0760W       -        -    3  3  3  3      210    1200
 4 -   0.0050W       -        -    4  4  4  4     2000    8000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        29 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    36%
Data Units Read:                    33,151,051 [16.9 TB]
Data Units Written:                 92,050,787 [47.1 TB]
Host Read Commands:                 1,311,000,162
Host Write Commands:                2,155,486,718
Controller Busy Time:               6,461
Power Cycles:                       378
Power On Hours:                     5,271
Unsafe Shutdowns:                   159
Media and Data Integrity Errors:    0
Error Information Log Entries:      20,405
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               29 Celsius
Temperature Sensor 2:               30 Celsius

Error Information (NVMe Log 0x01, 16 of 64 entries)
Num   ErrCount  SQId   CmdId  Status  PELoc          LBA  NSID    VS  Message
  0      20405     0  0x0010  0x4004      -            0     0     -  Invalid Field in Command

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged
Hr46ph commented 2 months ago

Not sure if any of the other logging is relevant, I figured I'd wait the resonse before supplying more info. Let me know what you need.

Thanks!

chris114782 commented 1 month ago

Getting exactly the same issue:

dashboard

detail

# sudo smartctl -a /dev/nvme0
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.8.0-45-generic] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       CT1000T700SSD5
Serial Number:                      **redacted**
Firmware Version:                   PACR5101
PCI Vendor/Subsystem ID:            0xc0a9
IEEE OUI Identifier:                0x00a075
Controller ID:                      0
NVMe Version:                       2.0
Number of Namespaces:               1
Namespace 1 Size/Capacity:          1,000,204,886,016 [1.00 TB]
Namespace 1 Formatted LBA Size:     512
Namespace 1 IEEE EUI-64:            00a075 **redacted**
Local Time is:                      Fri Oct  4 20:09:17 2024 BST
Firmware Updates (0x12):            1 Slot, no Reset required
Optional Admin Commands (0x0017):   Security Format Frmw_DL Self_Test
Optional NVM Commands (0x005e):     Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat Timestmp
Log Page Attributes (0x3e):         Cmd_Eff_Lg Ext_Get_Lg Telmtry_Lg Pers_Ev_Lg Log0_FISE_MI
Maximum Data Transfer Size:         128 Pages
Warning  Comp. Temp. Threshold:     87 Celsius
Critical Comp. Temp. Threshold:     89 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +    11.50W       -        -    0  0  0  0      800    1000
 1 +     8.00W       -        -    0  0  0  0      800    1000
 2 +     6.00W       -        -    0  0  0  0      800    1000
 3 -   0.1440W       -        -    0  0  0  0     3000    3000
 4 -   0.1440W       -        -    0  0  0  0     3000    3000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         1
 1 -    4096       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        33 Celsius
Available Spare:                    100%
Available Spare Threshold:          5%
Percentage Used:                    4%
Data Units Read:                    225,608,936 [115 TB]
Data Units Written:                 51,358,910 [26.2 TB]
Host Read Commands:                 6,634,833,183
Host Write Commands:                613,483,504
Controller Busy Time:               5,734
Power Cycles:                       32
Power On Hours:                     8,830
Unsafe Shutdowns:                   4
Media and Data Integrity Errors:    0
Error Information Log Entries:      0
Warning  Comp. Temperature Time:    0
Critical Comp. Temperature Time:    0

Error Information (NVMe Log 0x01, 16 of 16 entries)
No Errors Logged

Self-test Log (NVMe Log 0x06)
Self-test status: No self-test in progress
No Self-tests Logged

Running master#57dc547

retnag commented 2 days ago

I'm assuming you're running scrutiny using docker. Have a look at: https://github.com/AnalogJ/scrutiny/issues/26#issuecomment-696817130