FireDrunk / ZFSmond

Tiny ZFS Web Interface written in AngularJS and Flask Restful
GNU General Public License v3.0
27 stars 12 forks source link

Exceptions returned from smartctl not caught/handled (crash) #5

Closed louwrentius closed 9 years ago

louwrentius commented 9 years ago

File "/usr/local/lib/python2.7/dist-packages/pySMART/device.py", line 479, in update self.serial = line.split(':')[1].split()[0].rstrip() IndexError: list index out of range

FireDrunk commented 9 years ago

Hmm, the error is in the pySMART library, this makes it hard to fix. Perhaps i can contact the original author, but i doubt he will fix it.

louwrentius commented 9 years ago

No, I think you need to put a Try / Except IndexError around your code to catch this error. I notice that Debian Wheezy is also using an older smartctl version.

FireDrunk commented 9 years ago

I will put the try catch in, but that wont fix the underlying issue, it will only stop the code from crashing, but still not yield results.

FireDrunk commented 9 years ago

Fixed in commit: https://github.com/FireDrunk/ZFSmond/commit/fcb239da8767c24f525b1d8561cc3f3a9665598d

mth309 commented 9 years ago

Hey guys, I'm the author of pySMART. I just happened to stumble into this thread while Googling, and I'd love to fix the issue you found. I can see where this would crash if smartctl prints "Serial Number:" and then only whitespace (or nothing) following. I've never seen that behavior before; I've seen smartctl skip printing some of the lines when a value doesn't exist, so it seemed logical to assume that if it printed a line there'd be a non-whitespace/non-null value to parse on that line...

Louwrentius, do you happen to have the output of smartctl for the device that was causing the crash? Do you know what version of smartctl you were using, on which OS? It looked like maybe Debian Wheezy from your post above which looks like it might have 5.41 by default? The minimum version I've tested with on Linux is 5.42, but I doubt that's related to this issue. It seems like maybe your device is actually reporting an all-whitespace serial number to smartctl (?), and my code crashes because I never expected to have to parse that. :) I just want to be sure this is what's really going on, and most likely I'll make the other line parsings more robust to prevent these kind of problems in the future. Thanks!

louwrentius commented 9 years ago

Hi,

No problem: here is the requested output. This box is Ubuntu.

root@server:/usr/src/zfsmond# dpkg -l | grep -i smart

ii libatasmart4 0.18-3 ATA S.M.A.R.T. reading and parsing library

_ii _smartmontools 5.41+svn3365-1 control and monitor storage systems using S.M.A.R.T.

root@server:/usr/src/zfsmond# cat /etc/u

ucf.conf udev/ ufw/ updatedb.conf update-manager/ update-motd.d/

root@server:/usr/src/zfsmond# cat /etc/debian_version

wheezy/sid

root@server:/usr/src/zfsmond#

root@server:/usr/src/zfsmond# show disk -smp

-----------------------------------------------------------------------

| Dev | Model | GB | /dev/disk/by-path |

-----------------------------------------------------------------------

| sda | ST250LM004 HN-M250MBB | 250 | pci-0000:00:1f.2-scsi-0:0:0:0 |

| sdb | SAMSUNG HM250JI | 250 | pci-0000:00:1f.2-scsi-1:0:0:0 |

| sdc | OCZ-VERTEX2 | 60 | pci-0000:00:1f.2-scsi-2:0:0:0 |

| sdd | ST2000DM001-9YN164 | 2000 | pci-0000:03:04.0-scsi-0:0:0:0 |

| sde | ST2000DM001-1CH164 | 2000 | pci-0000:03:04.0-scsi-0:0:1:0 |

| sdf | ST2000DM001-9YN164 | 2000 | pci-0000:03:04.0-scsi-0:0:2:0 |

| sdg | ST2000DM001-9YN164 | 2000 | pci-0000:03:04.0-scsi-0:0:3:0 |

| sdh | ST2000DM001-1ER164 | 2000 | pci-0000:03:04.0-scsi-0:0:4:0 |

| sdi | ST2000DM001-9YN164 | 2000 | pci-0000:03:04.0-scsi-0:0:5:0 |

| zd0 | | 536 | |

| zd16 | | 536 | |

| zd32 | | 536 | |

-----------------------------------------------------------------------

root@server:/usr/src/zfsmond# smart -a -d ata /dev/disk/by-path/pci-0000:03:04.0-scsi-0:0:0:0

pci-0000:03:04.0-scsi-0:0:0:0 pci-0000:03:04.0-scsi-0:0:0:0-part1 pci-0000:03:04.0-scsi-0:0:0:0-part9

root@server:/usr/src/zfsmond# smart -a -d ata /dev/disk/by-path/pci-0000:03:04.0-scsi-0:0:0:0

The program 'smart' is currently not installed. You can install it by typing:

apt-get install smartpm-core

root@server:/usr/src/zfsmond# smartctl -a -d ata /dev/disk/by-path/pci-0000:03:04.0-scsi-0:0:0:0

_smartctl 5.41 2011-06-09 r3365 x86_64-linux-3.2.0-68-generic_

Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===

Device Model: ST2000DM001-9YN164

Serial Number: Z1E0RR08

LU WWN Device Id: 5 000c50 04d49e7ef

Firmware Version: CC4C

User Capacity: 2,000,398,934,016 bytes [2.00 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: 8

ATA Standard is: ATA-8-ACS revision 4

Local Time is: Sat Jun 13 14:08:21 2015 CEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x00) Offline data collection activity

Self-test execution status: ( 0) The previous self-test routine completed

Total time to complete Offline

data collection: ( 575) seconds.

Offline data collection

capabilities: (0x73) SMART execute Offline immediate.

SMART capabilities: (0x0003) Saves SMART data before entering

Error logging capability: (0x01) Error logging supported.

Short self-test routine

recommended polling time: ( 1) minutes.

Extended self-test routine

recommended polling time: ( 224) minutes.

Conveyance self-test routine

recommended polling time: ( 2) minutes.

SCT capabilities: (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

_ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAWVALUE

_183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always

_184 End-to-End_Error 0x0032 100 100 099 Old_age Always

_187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always

_188 Command_Timeout 0x0032 100 100 000 Old_age Always

_189 High_Fly_Writes 0x003a 099 099 000 Old_age Always

_190 Airflow_Temperature_Cel 0x0022 069 057 045 Old_age Always

_191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always

_192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always

_193 Load_Cycle_Count 0x0032 056 056 000 Old_age Always

_194 Temperature_Celsius 0x0022 031 043 000 Old_age Always

_197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always

_198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline

_199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always

_240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline

_241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline

_242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1

Selective self-test flags (0x0):

If Selective self-test is pending on power-up, resume after 0 minute delay.

2015-06-13 8:32 GMT+02:00 mth309 notifications@github.com:

Hey guys, I'm the author of pySMART. I just happened to stumble into this thread while Googling, and I'd love to fix the issue you found. I can see where this would crash if smartctl prints "Serial Number:" and then only whitespace (or nothing) following. I've never seen that behavior before; I've seen smartctl skip printing some of the lines when a value doesn't exist, so it seemed logical to assume that if it printed a line there'd be a non-whitespace/non-null value to parse on that line...

Louwrentius, do you happen to have the output of smartctl for the device that was causing the crash? Do you know what version of smartctl you were using, on which OS? It looked like maybe Debian Wheezy from your post above which looks like it might have 5.41 by default? The minimum version I've tested with on Linux is 5.42, but I doubt that's related to this issue. It seems like maybe your device is actually reporting an all-whitespace serial number to smartctl (?), and my code crashes because I never expected to have to parse that. :) I just want to be sure this is what's really going on, and most likely I'll make the other line parsings more robust to prevent these kind of problems in the future. Thanks!

— Reply to this email directly or view it on GitHub https://github.com/FireDrunk/ZFSmond/issues/5#issuecomment-111680850.

mth309 commented 9 years ago

Thank you for providing all of that information. The only thing that still has me confused is that your example doesn’t appear to show a scenario which the existing code can’t handle. As shown below, pySMART should parse the 'Z1E0RR08' serial number out of that file just fine. I searched the file for the word “Serial” to see if it came up on a second line, and maybe that was the line causing the crash, but it’s only in there once.

Here are the basic tests that I expect pySMART to pass that seem to include the file you provided:

line = "Serial Number: Z1E0RR08" # Test w/ space delimiter

if 'Serial Number' in line or 'Serial number' in line:

... serial = line.split(':')[1].split()[0].rstrip()

...

serial

'Z1E0RR08' # Correct

line = "Serial Number:\tZ1E0RR08" # Test w/ tab delimiter

print(line)

Serial Number: Z1E0RR08

if 'Serial Number' in line or 'Serial number' in line:

... serial = line.split(':')[1].split()[0].rstrip()

...

serial

'Z1E0RR08' # Correct

line = "Serial Number: \t\t \t Z1E0RR08" # Test crazy mixture of spaces & tabs

if 'Serial Number' in line or 'Serial number' in line:

... serial = line.split(':')[1].split()[0].rstrip()

...

serial

'Z1E0RR08' # Correct

Now here is a test that based on your problem report I expected to see and fail. Specifically, a device reporting an all-whitespace (or null) value to the right of the colon:

line = "Serial Number: " # Test w/ all whitespace

if 'Serial Number' in line or 'Serial number' in line:

... serial = line.split(':')[1].split()[0].rstrip()

...

Traceback (most recent call last):

File "", line 2, in

IndexError: list index out of range # The crash you reported

In order to fix this, I could easily wrap all parsing statements in a try/except:

serial = None # serial is initialized to None in Device.init()

line = "Serial Number: " # Test whitespace again

if 'Serial Number' in line or 'Serial number' in line:

... try: # easy fix?

... serial = line.split(':')[1].split()[0].rstrip()

... except IndexError:

... pass # No need to do anything, just don’t crash

...

print serial

None # Prints fine when used later

This is more robust regardless, so I’ll probably go through and do it anyway, but my concern is that for the example file you provided this doesn’t seem necessary? I want to be sure that I’m correcting the issue you experienced, as opposed to just what I assumed the issue might be. For example, if there’s a parseable serial number being printed, but somehow crashing my parsing statement, I’d rather fix the parsing statement to correctly extract it than just fall back on “None”. Ideally, the combination of both fixes would be best, but I’d need to see a “valid” serial number line that confuses my parser.

Thank you, Marc

From: louwrentius [mailto:notifications@github.com] Sent: Saturday, June 13, 2015 5:14 AM To: FireDrunk/ZFSmond Cc: mth309 Subject: Re: [ZFSmond] Exceptions returned from smartctl not caught/handled (crash) (#5)

Hi,

No problem: here is the requested output. This box is Ubuntu.

root@server:/usr/src/zfsmond# dpkg -l | grep -i smart

ii libatasmart4 0.18-3 ATA S.M.A.R.T. reading and parsing library

_ii _smartmontools 5.41+svn3365-1 control and monitor storage systems using S.M.A.R.T.

root@server:/usr/src/zfsmond# cat /etc/u

ucf.conf udev/ ufw/ updatedb.conf update-manager/ update-motd.d/

root@server:/usr/src/zfsmond# cat /etc/debian_version

wheezy/sid

root@server:/usr/src/zfsmond#

root@server:/usr/src/zfsmond# show disk -smp

-----------------------------------------------------------------------

| Dev | Model | GB | /dev/disk/by-path |

-----------------------------------------------------------------------

| sda | ST250LM004 HN-M250MBB | 250 | pci-0000:00:1f.2-scsi-0:0:0:0 |

| sdb | SAMSUNG HM250JI | 250 | pci-0000:00:1f.2-scsi-1:0:0:0 |

| sdc | OCZ-VERTEX2 | 60 | pci-0000:00:1f.2-scsi-2:0:0:0 |

| sdd | ST2000DM001-9YN164 | 2000 | pci-0000:03:04.0-scsi-0:0:0:0 |

| sde | ST2000DM001-1CH164 | 2000 | pci-0000:03:04.0-scsi-0:0:1:0 |

| sdf | ST2000DM001-9YN164 | 2000 | pci-0000:03:04.0-scsi-0:0:2:0 |

| sdg | ST2000DM001-9YN164 | 2000 | pci-0000:03:04.0-scsi-0:0:3:0 |

| sdh | ST2000DM001-1ER164 | 2000 | pci-0000:03:04.0-scsi-0:0:4:0 |

| sdi | ST2000DM001-9YN164 | 2000 | pci-0000:03:04.0-scsi-0:0:5:0 |

| zd0 | | 536 | |

| zd16 | | 536 | |

| zd32 | | 536 | |

-----------------------------------------------------------------------

root@server:/usr/src/zfsmond# smart -a -d ata /dev/disk/by-path/pci-0000:03:04.0-scsi-0:0:0:0

pci-0000:03:04.0-scsi-0:0:0:0 pci-0000:03:04.0-scsi-0:0:0:0-part1 pci-0000:03:04.0-scsi-0:0:0:0-part9

root@server:/usr/src/zfsmond# smart -a -d ata /dev/disk/by-path/pci-0000:03:04.0-scsi-0:0:0:0

The program 'smart' is currently not installed. You can install it by typing:

apt-get install smartpm-core

root@server:/usr/src/zfsmond# smartctl -a -d ata /dev/disk/by-path/pci-0000:03:04.0-scsi-0:0:0:0

_smartctl 5.41 2011-06-09 r3365 x86_64-linux-3.2.0-68-generic_

Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net http://smartmontools.sourceforge.net

=== START OF INFORMATION SECTION ===

Device Model: ST2000DM001-9YN164

Serial Number: Z1E0RR08

LU WWN Device Id: 5 000c50 04d49e7ef

Firmware Version: CC4C

User Capacity: 2,000,398,934,016 bytes [2.00 TB]

Sector Sizes: 512 bytes logical, 4096 bytes physical

Device is: Not in smartctl database [for details use: -P showall]

ATA Version is: 8

ATA Standard is: ATA-8-ACS revision 4

Local Time is: Sat Jun 13 14:08:21 2015 CEST

SMART support is: Available - device has SMART capability.

SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===

SMART overall-health self-assessment test result: PASSED

General SMART Values:

Offline data collection status: (0x00) Offline data collection activity

Self-test execution status: ( 0) The previous self-test routine completed

Total time to complete Offline

data collection: ( 575) seconds.

Offline data collection

capabilities: (0x73) SMART execute Offline immediate.

SMART capabilities: (0x0003) Saves SMART data before entering

Error logging capability: (0x01) Error logging supported.

Short self-test routine

recommended polling time: ( 1) minutes.

Extended self-test routine

recommended polling time: ( 224) minutes.

Conveyance self-test routine

recommended polling time: ( 2) minutes.

SCT capabilities: (0x3085) SCT Status supported.

SMART Attributes Data Structure revision number: 10

Vendor Specific SMART Attributes with Thresholds:

_ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAWVALUE

*183 Runtime_Bad_Block 0x0032 100 100 000 Old_age Always

*184 End-to-End_Error 0x0032 100 100 099 Old_age Always

*187 Reported_Uncorrect 0x0032 100 100 000 Old_age Always

*188 Command_Timeout 0x0032 100 100 000 Old_age Always

*189 High_Fly_Writes 0x003a 099 099 000 Old_age Always

*190 Airflow_Temperature_Cel 0x0022 069 057 045 Old_age Always

*191 G-Sense_Error_Rate 0x0032 100 100 000 Old_age Always

*192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age Always

*193 Load_Cycle_Count 0x0032 056 056 000 Old_age Always

*194 Temperature_Celsius 0x0022 031 043 000 Old_age Always

*197 Current_Pending_Sector 0x0012 100 100 000 Old_age Always

*198 Offline_Uncorrectable 0x0010 100 100 000 Old_age Offline

*199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always

*240 Head_Flying_Hours 0x0000 100 253 000 Old_age Offline

*241 Total_LBAs_Written 0x0000 100 253 000 Old_age Offline

*242 Total_LBAs_Read 0x0000 100 253 000 Old_age Offline

SMART Error Log Version: 1

No Errors Logged

SMART Self-test log structure revision number 1

No self-tests have been logged. [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1

Selective self-test flags (0x0):

If Selective self-test is pending on power-up, resume after 0 minute delay.

2015-06-13 8:32 GMT+02:00 mth309 <notifications@github.com mailto:notifications@github.com >:

Hey guys, I'm the author of pySMART. I just happened to stumble into this thread while Googling, and I'd love to fix the issue you found. I can see where this would crash if smartctl prints "Serial Number:" and then only whitespace (or nothing) following. I've never seen that behavior before; I've seen smartctl skip printing some of the lines when a value doesn't exist, so it seemed logical to assume that if it printed a line there'd be a non-whitespace/non-null value to parse on that line...

Louwrentius, do you happen to have the output of smartctl for the device that was causing the crash? Do you know what version of smartctl you were using, on which OS? It looked like maybe Debian Wheezy from your post above which looks like it might have 5.41 by default? The minimum version I've tested with on Linux is 5.42, but I doubt that's related to this issue. It seems like maybe your device is actually reporting an all-whitespace serial number to smartctl (?), and my code crashes because I never expected to have to parse that. :) I just want to be sure this is what's really going on, and most likely I'll make the other line parsings more robust to prevent these kind of problems in the future. Thanks!

— Reply to this email directly or view it on GitHub https://github.com/FireDrunk/ZFSmond/issues/5#issuecomment-111680850.

— Reply to this email directly or view it on GitHub https://github.com/FireDrunk/ZFSmond/issues/5#issuecomment-111704900 . https://github.com/notifications/beacon/AMRf3dl-eG8yqIKpTc99p1AMcKg6mHuNks5oTBXvgaJpZM4Eb6lj.gif