Napsty / check_smart

Monitoring Plugin to check hard drives, solid state drives and NVMe drives using SMART
https://www.claudiokuenzler.com/monitoring-plugins/check_smart.php
GNU General Public License v3.0
67 stars 20 forks source link

Percent_Lifetime_Remain threshold unset with -w #92

Closed ymartin-ovh closed 1 year ago

ymartin-ovh commented 1 year ago

Hello

It seems there is an issue on -w option handling. When I give a threshold for a particular smartctl item (not lifetime), the Percent_Lifetime_Remain threshold is not set to 90%:

warning => ./check_smart -i auto -g '/dev/sda' -w Reallocated_Sector_Ct=250 -l ok => ./check_smart -i auto -g '/dev/sda' -w Reallocated_Sector_Ct=250,Percent_Lifetime_Remain=90 -l ok => ./check_smart -i auto -g '/dev/sda' -l

Before working on a patch, can you tell me if this behaviour is normal or not.

Regards

Napsty commented 1 year ago

Can you show the current Percent_Lifetime_Remain value?

ymartin-ovh commented 1 year ago
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
...
202 Percent_Lifetime_Remain 0x0030   099   099   001    Old_age   Offline      -       1
...
Napsty commented 1 year ago

Agree, simply adding a -l parameter should be enough to check for the Percent_Lifetime_Remain attribute. Need to check why this didn't work.

Napsty commented 1 year ago

@ymartin-ovh can you please try with https://raw.githubusercontent.com/Napsty/check_smart/issue-92/check_smart.pl ? does it work?

ymartin-ovh commented 1 year ago

Hello

Your patch fix warning threshold when it's not given but introduce a new bug (as your set inconditionally the value):

ok (threshold set to 90%) ./check_smart.pl -i auto -g '/dev/sda' -w Reallocated_Sector_Ct=250 -l ./check_smart.pl --skip-load-cycles -l -i auto -g '/dev/{sdb,sda}'

ko => ./check_smart.pl --skip-load-cycles -l -i auto -g '/dev/{sdb,sda}' -w Percent_Lifetime_Remain=85 OK: [/dev/sdb] - Device is clean [/dev/sdb] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90) --- [/dev/sda] - Device is clean [/dev/sda] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90)|

ymartin-ovh commented 1 year ago

before, I have 85% => /usr/lib/nagios/ovh/check_smart --skip-load-cycles -l -i auto -g '/dev/{sdb,sda}' -w Percent_Lifetime_Remain=85 OK: [/dev/sdb] - Device is clean [/dev/sdb] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 85) --- [/dev/sda] - Device is clean [/dev/sda] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 85)|

Napsty commented 1 year ago

Can you please run with --debug as it's easier for me to find out what happens in the background, thx. You can combine with --hide-sn to hide sensitive serial numbers.

ymartin-ovh commented 1 year ago

./check_smart.pl --skip-load-cycles -l -i auto -g '/dev/{sdb,sda}' -w Percent_Lifetime_Remain=85 --debug --hide-sn
Found /dev/sdb
Found /dev/sda
###########################################################
(debug) CHECK 1: getting overall SMART health status for /dev/sdb 
###########################################################

(debug) executing:
sudo /usr/sbin/smartctl -d auto -Hi /dev/sdb

(debug) output:
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.124-ovh-vps-grsec-zfs-classid] (local build)
 Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

 === START OF INFORMATION SECTION ===
 Model Family:     Micron 5100 Pro / 52x0 / 5300 SSDs
 Device Model:     Micron_5300_MTFDDAK480TDS
 Serial Number:    22263A2BB86F
 LU WWN Device Id: 5 00a075 13a2bb86f
 Firmware Version: D3MU001
 User Capacity:    480,103,981,056 bytes [480 GB]
 Sector Sizes:     512 bytes logical, 4096 bytes physical
 Rotation Rate:    Solid State Device
 Form Factor:      2.5 inches
 TRIM Command:     Available, deterministic, zeroed
 Device is:        In smartctl database [for details use: -P show]
 ATA Version is:   ACS-4 (minor revision not indicated)
 SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
 Local Time is:    Mon Sep 18 11:59:31 2023 CEST
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled

 === START OF READ SMART DATA SECTION ===
 SMART overall-health self-assessment test result: PASSED

(debug) parsing line:
Device Model:     Micron_5300_MTFDDAK480TDS

(debug) found model:  Micron_5300_MTFDDAK480TDS

(debug) parsing line:
Serial Number:    22263A2BB86F

(debug) Hiding serial number

(debug) found serial number <HIDDEN>

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK
###########################################################
(debug) CHECK 2: getting silent SMART health check
###########################################################

(debug) executing:
sudo /usr/sbin/smartctl -d auto -q silent -A /dev/sdb

(debug) exit code:
0

(debug) zero exit code, status OK

###########################################################
(debug) CHECK 3: getting detailed statistics from attributes
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing
###########################################################

(debug) executing:
sudo /usr/sbin/smartctl -d auto -A /dev/sdb

(debug) output:
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.124-ovh-vps-grsec-zfs-classid] (local build)
 Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

 === START OF READ SMART DATA SECTION ===
 SMART Attributes Data Structure revision number: 16
 Vendor Specific SMART Attributes with Thresholds:
 ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x002f   100   100   050    Pre-fail  Always       -       0
   5 Reallocated_Sector_Ct   0x0032   100   100   001    Old_age   Always       -       0
   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       6549
  12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       27
 170 Reserved_Block_Pct      0x0033   100   100   010    Pre-fail  Always       -       0
 171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
 172 Erase_Fail_Count        0x0032   100   100   001    Old_age   Always       -       0
 173 Avg_Block-Erase_Count   0x0032   098   098   000    Old_age   Always       -       129
 174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       26
 183 SATA_Int_Downshift_Ct   0x0032   100   100   000    Old_age   Always       -       0
 184 End-to-End_Error        0x0032   100   100   000    Old_age   Always       -       0
 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
 188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       155
 194 Temperature_Celsius     0x0022   066   057   000    Old_age   Always       -       34 (Min/Max 16/43)
 195 Hardware_ECC_Recovered  0x0032   100   100   000    Old_age   Always       -       0
 196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
 197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
 198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
 199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
 202 Percent_Lifetime_Remain 0x0030   098   098   001    Old_age   Offline      -       2
 206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
 246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       166575160859
 247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       5213407814
 248 Bckgnd_Program_Page_Cnt 0x0032   100   100   000    Old_age   Always       -       373948235
 180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   100   100   000    Pre-fail  Always       -       2161
 210 RAIN_Success_Recovered  0x0032   100   100   000    Old_age   Always       -       0
 211 Integ_Scan_Complete_Cnt 0x0032   100   100   000    Old_age   Always       -       63
 212 Integ_Scan_Folding_Cnt  0x0032   100   100   000    Old_age   Always       -       1

(debug) Raw Check List ATA: Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count,Erase_Fail_Count_Total
(debug) Raw Check List NVMe: Media_and_Data_Integrity_Errors
(debug) Exclude List for Checks: 
(debug) Exclude List for Perfdata: 
(debug) Warning Thresholds:
Percent_Lifetime_Remain=90

(debug) Raw_Read_Error_Rate not in raw check list (raw value: 0)

(debug) Reallocated_Sector_Ct is OK (0)

(debug) Power_On_Hours not in raw check list (raw value: 6549)

(debug) Power_Cycle_Count not in raw check list (raw value: 27)

(debug) Reserved_Block_Pct not in raw check list (raw value: 0)

(debug) Program_Fail_Count not in raw check list (raw value: 0)

(debug) Erase_Fail_Count not in raw check list (raw value: 0)

(debug) Avg_Block-Erase_Count not in raw check list (raw value: 129)

(debug) Unexpect_Power_Loss_Ct not in raw check list (raw value: 26)

(debug) SATA_Int_Downshift_Ct not in raw check list (raw value: 0)

(debug) End-to-End_Error not in raw check list (raw value: 0)

(debug) Reported_Uncorrect is OK (0)

(debug) Command_Timeout not in raw check list (raw value: 155)

(debug) Temperature_Celsius not in raw check list (raw value: 34)

(debug) Hardware_ECC_Recovered not in raw check list (raw value: 0)

(debug) Reallocated_Event_Count is OK (0)

(debug) Current_Pending_Sector is OK (0)

(debug) Offline_Uncorrectable is OK (0)

(debug) UDMA_CRC_Error_Count not in raw check list (raw value: 0)

(debug) Percent_Lifetime_Remain is non-zero (2) but less than 90

(debug) Write_Error_Rate not in raw check list (raw value: 0)

(debug) Total_LBAs_Written not in raw check list (raw value: 166575160859)

(debug) Host_Program_Page_Count not in raw check list (raw value: 5213407814)

(debug) Bckgnd_Program_Page_Cnt not in raw check list (raw value: 373948235)

(debug) Unused_Rsvd_Blk_Cnt_Tot not in raw check list (raw value: 2161)

(debug) RAIN_Success_Recovered not in raw check list (raw value: 0)

(debug) Integ_Scan_Complete_Cnt not in raw check list (raw value: 63)

(debug) Integ_Scan_Folding_Cnt not in raw check list (raw value: 1)

(debug) gathered perfdata:

###########################################################
(debug) LOCAL STATUS: OK, FINAL STATUS: OK
###########################################################

###########################################################
(debug) CHECK 1: getting overall SMART health status for /dev/sda 
###########################################################

(debug) executing:
sudo /usr/sbin/smartctl -d auto -Hi /dev/sda

(debug) output:
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.124-ovh-vps-grsec-zfs-classid] (local build)
 Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

 === START OF INFORMATION SECTION ===
 Model Family:     Micron 5100 Pro / 52x0 / 5300 SSDs
 Device Model:     Micron_5300_MTFDDAK480TDS
 Serial Number:    22263A2BB83E
 LU WWN Device Id: 5 00a075 13a2bb83e
 Firmware Version: D3MU001
 User Capacity:    480,103,981,056 bytes [480 GB]
 Sector Sizes:     512 bytes logical, 4096 bytes physical
 Rotation Rate:    Solid State Device
 Form Factor:      2.5 inches
 TRIM Command:     Available, deterministic, zeroed
 Device is:        In smartctl database [for details use: -P show]
 ATA Version is:   ACS-4 (minor revision not indicated)
 SATA Version is:  SATA 3.3, 6.0 Gb/s (current: 6.0 Gb/s)
 Local Time is:    Mon Sep 18 11:59:31 2023 CEST
 SMART support is: Available - device has SMART capability.
 SMART support is: Enabled

 === START OF READ SMART DATA SECTION ===
 SMART overall-health self-assessment test result: PASSED

(debug) parsing line:
Device Model:     Micron_5300_MTFDDAK480TDS

(debug) found model:  Micron_5300_MTFDDAK480TDS

(debug) parsing line:
Serial Number:    22263A2BB83E

(debug) Hiding serial number

(debug) found serial number <HIDDEN>

(debug) parsing line:
SMART overall-health self-assessment test result: PASSED

(debug) found string 'PASSED'; status OK
###########################################################
(debug) CHECK 2: getting silent SMART health check
###########################################################

(debug) executing:
sudo /usr/sbin/smartctl -d auto -q silent -A /dev/sda

(debug) exit code:
0

(debug) zero exit code, status OK

###########################################################
(debug) CHECK 3: getting detailed statistics from attributes
(debug) information contains a few more potential trouble spots
(debug) plus, we can also use the information for perfdata/graphing
###########################################################

(debug) executing:
sudo /usr/sbin/smartctl -d auto -A /dev/sda

(debug) output:
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-5.15.124-ovh-vps-grsec-zfs-classid] (local build)
 Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

 === START OF READ SMART DATA SECTION ===
 SMART Attributes Data Structure revision number: 16
 Vendor Specific SMART Attributes with Thresholds:
 ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x002f   100   100   050    Pre-fail  Always       -       0
   5 Reallocated_Sector_Ct   0x0032   100   100   001    Old_age   Always       -       0
   9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       6549
  12 Power_Cycle_Count       0x0032   100   100   001    Old_age   Always       -       27
 170 Reserved_Block_Pct      0x0033   100   100   010    Pre-fail  Always       -       0
 171 Program_Fail_Count      0x0032   100   100   000    Old_age   Always       -       0
 172 Erase_Fail_Count        0x0032   100   100   001    Old_age   Always       -       0
 173 Avg_Block-Erase_Count   0x0032   098   098   000    Old_age   Always       -       129
 174 Unexpect_Power_Loss_Ct  0x0032   100   100   000    Old_age   Always       -       26
 183 SATA_Int_Downshift_Ct   0x0032   100   100   000    Old_age   Always       -       0
 184 End-to-End_Error        0x0032   100   100   000    Old_age   Always       -       0
 187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
 188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       155
 194 Temperature_Celsius     0x0022   065   057   000    Old_age   Always       -       35 (Min/Max 16/43)
 195 Hardware_ECC_Recovered  0x0032   100   100   000    Old_age   Always       -       0
 196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
 197 Current_Pending_Sector  0x0032   100   100   000    Old_age   Always       -       0
 198 Offline_Uncorrectable   0x0030   100   100   000    Old_age   Offline      -       0
 199 UDMA_CRC_Error_Count    0x0032   100   100   000    Old_age   Always       -       0
 202 Percent_Lifetime_Remain 0x0030   098   098   001    Old_age   Offline      -       2
 206 Write_Error_Rate        0x000e   100   100   000    Old_age   Always       -       0
 246 Total_LBAs_Written      0x0032   100   100   000    Old_age   Always       -       166523290925
 247 Host_Program_Page_Count 0x0032   100   100   000    Old_age   Always       -       5211799331
 248 Bckgnd_Program_Page_Cnt 0x0032   100   100   000    Old_age   Always       -       377450209
 180 Unused_Rsvd_Blk_Cnt_Tot 0x0033   100   100   000    Pre-fail  Always       -       2161
 210 RAIN_Success_Recovered  0x0032   100   100   000    Old_age   Always       -       0
 211 Integ_Scan_Complete_Cnt 0x0032   100   100   000    Old_age   Always       -       63
 212 Integ_Scan_Folding_Cnt  0x0032   100   100   000    Old_age   Always       -       0

(debug) Raw Check List ATA: Current_Pending_Sector,Reallocated_Sector_Ct,Program_Fail_Cnt_Total,Uncorrectable_Error_Cnt,Offline_Uncorrectable,Runtime_Bad_Block,Reported_Uncorrect,Reallocated_Event_Count,Erase_Fail_Count_Total
(debug) Raw Check List NVMe: Media_and_Data_Integrity_Errors
(debug) Exclude List for Checks: 
(debug) Exclude List for Perfdata: 
(debug) Warning Thresholds:
Percent_Lifetime_Remain=90

(debug) Raw_Read_Error_Rate not in raw check list (raw value: 0)

(debug) Reallocated_Sector_Ct is OK (0)

(debug) Power_On_Hours not in raw check list (raw value: 6549)

(debug) Power_Cycle_Count not in raw check list (raw value: 27)

(debug) Reserved_Block_Pct not in raw check list (raw value: 0)

(debug) Program_Fail_Count not in raw check list (raw value: 0)

(debug) Erase_Fail_Count not in raw check list (raw value: 0)

(debug) Avg_Block-Erase_Count not in raw check list (raw value: 129)

(debug) Unexpect_Power_Loss_Ct not in raw check list (raw value: 26)

(debug) SATA_Int_Downshift_Ct not in raw check list (raw value: 0)

(debug) End-to-End_Error not in raw check list (raw value: 0)

(debug) Reported_Uncorrect is OK (0)

(debug) Command_Timeout not in raw check list (raw value: 155)

(debug) Temperature_Celsius not in raw check list (raw value: 35)

(debug) Hardware_ECC_Recovered not in raw check list (raw value: 0)

(debug) Reallocated_Event_Count is OK (0)

(debug) Current_Pending_Sector is OK (0)

(debug) Offline_Uncorrectable is OK (0)

(debug) UDMA_CRC_Error_Count not in raw check list (raw value: 0)

(debug) Percent_Lifetime_Remain is non-zero (2) but less than 90

(debug) Write_Error_Rate not in raw check list (raw value: 0)

(debug) Total_LBAs_Written not in raw check list (raw value: 166523290925)

(debug) Host_Program_Page_Count not in raw check list (raw value: 5211799331)

(debug) Bckgnd_Program_Page_Cnt not in raw check list (raw value: 377450209)

(debug) Unused_Rsvd_Blk_Cnt_Tot not in raw check list (raw value: 2161)

(debug) RAIN_Success_Recovered not in raw check list (raw value: 0)

(debug) Integ_Scan_Complete_Cnt not in raw check list (raw value: 63)

(debug) Integ_Scan_Folding_Cnt not in raw check list (raw value: 0)

(debug) gathered perfdata:

###########################################################
(debug) LOCAL STATUS: OK, FINAL STATUS: OK
###########################################################

(debug) final status/output: OK
(debug) drives  ok: [/dev/sdb] - Device is clean [/dev/sdb] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90) [/dev/sda] - Device is clean [/dev/sda] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90)
(debug) drives nok: 
(debug)   msg_list: [/dev/sdb] - Device is clean [/dev/sdb] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90)^[/dev/sda] - Device is clean [/dev/sda] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90)

OK: [/dev/sdb] - Device is clean [/dev/sdb] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90) --- [/dev/sda] - Device is clean [/dev/sda] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90)|
Napsty commented 1 year ago

To me it looks like the correct behaviour. Both your drives sda and sdb have a Percent_Lifetime_Remain value of 2:

 202 Percent_Lifetime_Remain 0x0030   098   098   001    Old_age   Offline      -       2
 202 Percent_Lifetime_Remain 0x0030   098   098   001    Old_age   Offline      -       2

The attribute list can be seen in the debug output.

So to test the warning threshold, you must set it equal to or lower than 2:

./check_smart.pl --skip-load-cycles -l -i auto -g '/dev/{sdb,sda}' -w Percent_Lifetime_Remain=2 --debug --hide-sn

Please try that and comment here again with your findings.

PS: I just noticed that --hide-sn didn't properly work. But that's another issue to look at ;-)

ymartin-ovh commented 1 year ago

No there is an issue in your patch:

./check_smart.pl --skip-load-cycles -l -i auto -g '/dev/{sdb,sda}' -w Percent_Lifetime_Remain=85 OK: [/dev/sdb] - Device is clean [/dev/sdb] - Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90) --- [/dev/sda] -

I put 85 and the output mention 90 => but less than threshold 90

Also, in smart, lifetime value is inverted between raw value and the real meaning of lifetime remaining percentage. This is explained in drive datasheet and also in check_smart perl code.

Napsty commented 1 year ago

I put 85 and the output mention 90 => but less than threshold 90

Ah yes, now I see it.

Napsty commented 1 year ago

Let me try to comprehend the issue correctly.

When you want to use the Percent_Lifetime_Remain check, using -l then the check will work and alert automatically when the value reaches 90. If the value is below 90, the plugin will output the value but below warning level:

$ ./check_smart.pl -d /dev/sda -i auto --debug -l
[...]
(debug) Warning Thresholds:
Percent_Lifetime_Remain=90
[...]
(debug) Percent_Lifetime_Remain is non-zero (2) but less than 90
[...]
OK: Drive  Samsung SSD 850 EVO 500GB S/N XXX: no SMART errors detected.  Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90)|Reallocated_Sector_Ct=0 Power_On_Hours=26002 Power_Cycle_Count=934 Wear_Leveling_Count=35 Used_Rsvd_Blk_Cnt_Tot=0 Program_Fail_Cnt_Total=0 Erase_Fail_Count_Total=0 Runtime_Bad_Block=0 Uncorrectable_Error_Cnt=0 Airflow_Temperature_Cel=32 ECC_Error_Rate=0 CRC_Error_Count=0 Percent_Lifetime_Remain=2 POR_Recovery_Count=12 Total_LBAs_Written=41523523747

But when you want to overwrite the Percent_Lifetime_Remain threshold (let's say 50), then your own threshold is overwritten again with 90:

$ ./check_smart.pl -d /dev/sda -i auto --debug -l -w "Percent_Lifetime_Remain=50"
[...]
(debug) Warning Thresholds:
Percent_Lifetime_Remain=90
[...]
(debug) Percent_Lifetime_Remain is non-zero (2) but less than 90
[...]
OK: Drive  Samsung SSD 850 EVO 500GB S/N XXX: no SMART errors detected.  Percent_Lifetime_Remain is non-zero (2) (but less than threshold 90)|Reallocated_Sector_Ct=0 Power_On_Hours=26002 Power_Cycle_Count=934 Wear_Leveling_Count=35 Used_Rsvd_Blk_Cnt_Tot=0 Program_Fail_Cnt_Total=0 Erase_Fail_Count_Total=0 Runtime_Bad_Block=0 Uncorrectable_Error_Cnt=0 Airflow_Temperature_Cel=32 ECC_Error_Rate=0 CRC_Error_Count=0 Percent_Lifetime_Remain=2 POR_Recovery_Count=12 Total_LBAs_Written=41523523747

Is that the problem this issue is about? Or did I misunderstand something?

_Note: I faked the SMARTCTL output on this drive, as the Samsung SSDs don't have a Percent_LifetimeRemain attribute.

ymartin-ovh commented 1 year ago

Initially my isssue is when -w is used with another threshold definition like Reallocated_Sector_Ct, Percent_Lifetime_Remain=90 is not pushed in the warn_list (see: https://github.com/Napsty/check_smart/blob/master/check_smart.pl#L231)

Napsty commented 1 year ago

when -w is used with another threshold definition like Reallocated_Sector_Ct, Percent_Lifetime_Remain=90 is not pushed in the warn_list

Yep, but this should now work.

$ ./check_smart.pl -d /dev/sda -i auto --debug -l -w "Uncorrectable_Error_Cnt=10,Reallocated_Sector_Ct=10"
[...]
(debug) Warning Thresholds:
Percent_Lifetime_Remain=90
Reallocated_Sector_Ct=10
Uncorrectable_Error_Cnt=10
[...]

Can you confirm with the latest version? -> https://raw.githubusercontent.com/Napsty/check_smart/issue-92/check_smart.pl

ymartin-ovh commented 1 year ago

when -w is used with another threshold definition like Reallocated_Sector_Ct, Percent_Lifetime_Remain=90 is not pushed in the warn_list

Yep, but this should now work.

$ ./check_smart.pl -d /dev/sda -i auto --debug -l -w "Uncorrectable_Error_Cnt=10,Reallocated_Sector_Ct=10"
[...]
(debug) Warning Thresholds:
Percent_Lifetime_Remain=90
Reallocated_Sector_Ct=10
Uncorrectable_Error_Cnt=10
[...]

Can you confirm with the latest version? -> https://raw.githubusercontent.com/Napsty/check_smart/issue-92/check_smart.pl

No your patch overwrite the user given value because of the push at the warn_list tail. The default value should be in the head of the list to do this properly. Eventually, I provide a fix in #93.

Regards

Napsty commented 1 year ago

Thx for the PR. Please set your if condition in line 231: https://github.com/Napsty/check_smart/blob/master/check_smart.pl#L231

This way the Percent_Lifetime_Remain threshold is only set once and added to the warn_list array from the beginning.

ymartin-ovh commented 1 year ago

The if condition l231 is not needed anymore as it is implemented l240 in #93

Napsty commented 1 year ago

Just tested it locally, lgtm

  1. Using -l : Sets Percent_Lifetime_Remain=90 into warn_list :heavy_check_mark:
  2. Using a different threshold using -l -w "Percent_Lifetime_Remain=70,CRC_Error_Count=10" works :heavy_check_mark:
  3. Using another attribute threshold -l -w "CRC_Error_Count=10" uses the default threshold of 90 again for Percent_Lifetime_Remain :heavy_check_mark:
Napsty commented 1 year ago

Fixed with #93