AnalogJ / scrutiny

Hard Drive S.M.A.R.T Monitoring, Historical Trends & Real World Failure Thresholds
MIT License
5.3k stars 170 forks source link

[BUG] Duplicate /dev/sdc entry after restarting the container #178

Closed ltctceplrm closed 2 years ago

ltctceplrm commented 3 years ago

Describe the bug I've got one specific hard drive (/dev/sdc) that has some difficulty being picked up by smartmontools but gets noticed correctly right away by scrutiny, I've got that hard drive another in my docker compose file but whenever I restart the container /dev/sdc is duplicates, one with SMART data and one without. I can fix this by removing the empty /dev/sdc entry in the database and starting scrutiny up but next reboot it's again duplicated.

Expected behavior Since /dev/sdc is already in the database it shouldn't duplicate it after restarting the container.

Screenshots Screenshot of the webui Screenshot of how the database looks

Log Files Since this is the opposite of a missing device and the SMART data works fine I'm not sure if you want me to run it in collector mode or not? But looking through docker stats I see the following mentions of /dev/sdc with the creation date for the duplicated entry:

time="2021-05-06T18:47:15+02:00" level=info msg="Verifying required tools" type=metrics
time="2021-05-06T18:47:15+02:00" level=info msg="Executing command: smartctl --scan -j" type=metrics

 ___   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
AnalogJ/scrutiny/metrics                               dev-0.3.12

time="2021-05-06T18:47:15+02:00" level=info msg="Executing command: smartctl --info -j /dev/sda" type=metrics
time="2021-05-06T18:47:15+02:00" level=info msg="Generating WWN" type=metrics
time="2021-05-06T18:47:15+02:00" level=info msg="Executing command: smartctl --info -j -d sat /dev/sdb" type=metrics
time="2021-05-06T18:47:16+02:00" level=info msg="Generating WWN" type=metrics
time="2021-05-06T18:47:16+02:00" level=info msg="Executing command: smartctl --info -j -d sat /dev/sdc" type=metrics
time="2021-05-06T18:47:25+02:00" level=error msg="Could not retrieve device information for sdc: exit status 2" type=metrics
time="2021-05-06T18:47:25+02:00" level=info msg="Sending detected devices to API, for filtering & validation" type=metrics

2021/05/06 18:47:25 /scrutiny/webapp/backend/pkg/web/handler/register_devices.go:34
[135.502ms] [rows:1] INSERT INTO `devices` (`created_at`,`updated_at`,`deleted_at`,`wwn`,`host_id`,`device_name`,`manufacturer`,`model_name`,`interface_type`,`interface_speed`,`serial_number`,`firmware`,`rotation_speed`,`capacity`,`form_factor`,`smart_support`,`device_protocol`,`device_type`) VALUES ("2021-05-06 18:47:25.36","2021-05-06 18:47:25.36",NULL,"0x5000cca369e325d4","","sda","","HARD DRIVE XXX","","6.0 Gb/s","MN1240F32H9XND","MN6OA5Q0",7200,2000398934016,"3.5 inches",false,"ATA","scsi") ON CONFLICT (`wwn`) DO UPDATE SET `host_id`=`excluded`.`host_id`,`device_name`=`excluded`.`device_name`,`device_type`=`excluded`.`device_type`

2021/05/06 18:47:25 /scrutiny/webapp/backend/pkg/web/handler/register_devices.go:34
[104.440ms] [rows:1] INSERT INTO `devices` (`created_at`,`updated_at`,`deleted_at`,`wwn`,`host_id`,`device_name`,`manufacturer`,`model_name`,`interface_type`,`interface_speed`,`serial_number`,`firmware`,`rotation_speed`,`capacity`,`form_factor`,`smart_support`,`device_protocol`,`device_type`) VALUES ("2021-05-06 18:47:25.424","2021-05-06 18:47:25.424",NULL,"","","sdc","","","","","","",0,0,"",false,"","sat") ON CONFLICT (`wwn`) DO UPDATE SET `host_id`=`excluded`.`host_id`,`device_name`=`excluded`.`device_name`,`device_type`=`excluded`.`device_type`
time="2021-05-06T18:47:25+02:00" level=info msg="127.0.0.1 - 47a3adb31669 [06/May/2021:18:47:25 +0200] \"POST /api/devices/register\" 200 1353 \"\" \"Go-http-client/1.1\" (288ms)" clientIP=127.0.0.1 hostname=47a3adb31669 latency=288 method=POST path=/api/devices/register referer= respLength=1353 statusCode=200 userAgent=Go-http-client/1.1
time="2021-05-06T18:47:25+02:00" level=info msg="Collecting smartctl results for sda\n" type=metrics
time="2021-05-06T18:47:25+02:00" level=info msg="Executing command: smartctl -x -j /dev/sda" type=metrics
time="2021-05-06T18:47:27+02:00" level=info msg="Publishing smartctl results for 0x5000cca369e325d4\n" type=metrics

2021/05/06 18:47:27 /scrutiny/webapp/backend/pkg/web/handler/upload_device_metrics.go:45
[100.598ms] [rows:1] INSERT INTO `smarts` (`created_at`,`updated_at`,`deleted_at`,`device_wwn`,`test_date`,`smart_status`,`temp`,`power_on_hours`,`power_cycle_count`) VALUES ("2021-05-06 18:47:27.771","2021-05-06 18:47:27.771",NULL,"0x5000cca369e325d4","2021-05-06 18:47:25","passed",29,17198,9299)
time="2021-05-06T18:47:27+02:00" level=info msg="127.0.0.1 - 47a3adb31669 [06/May/2021:18:47:27 +0200] \"POST /api/device/0x5000cca369e325d4/smart\" 200 16 \"\" \"Go-http-client/1.1\" (232ms)" clientIP=127.0.0.1 hostname=47a3adb31669 latency=232 method=POST path=/api/device/0x5000cca369e325d4/smart referer= respLength=16 statusCode=200 userAgent=Go-http-client/1.1
time="2021-05-06T18:47:27+02:00" level=info msg="Collecting smartctl results for sdb\n" type=metrics
time="2021-05-06T18:47:27+02:00" level=info msg="Executing command: smartctl -x -j -d sat /dev/sdb" type=metrics
time="2021-05-06T18:47:28+02:00" level=error msg="smartctl returned an error code (64) while processing sdb\n" type=metrics
time="2021-05-06T18:47:28+02:00" level=error msg="smartctl detected a error log with errors" type=metrics
time="2021-05-06T18:47:28+02:00" level=info msg="Publishing smartctl results for 0x5000cca291ea6731\n" type=metrics
time="2021-05-06T18:47:28+02:00" level=info msg="127.0.0.1 - 47a3adb31669 [06/May/2021:18:47:28 +0200] \"POST /api/device/0x5000cca291ea6731/smart\" 200 16 \"\" \"Go-http-client/1.1\" (179ms)" clientIP=127.0.0.1 hostname=47a3adb31669 latency=179 method=POST path=/api/device/0x5000cca291ea6731/smart referer= respLength=16 statusCode=200 userAgent=Go-http-client/1.1
time="2021-05-06T18:47:28+02:00" level=info msg="Collecting smartctl results for sdc\n" type=metrics
time="2021-05-06T18:47:28+02:00" level=info msg="Executing command: smartctl -x -j -d sat /dev/sdc" type=metrics
time="2021-05-06T18:47:32+02:00" level=error msg="smartctl returned an error code (2) while processing sdc\n" type=metrics
time="2021-05-06T18:47:32+02:00" level=error msg="smartctl could not open device" type=metrics
time="2021-05-06T18:47:32+02:00" level=info msg="Publishing smartctl results for \n" type=metrics

2021/05/06 18:47:32 /scrutiny/webapp/backend/pkg/web/handler/upload_device_metrics.go:31 WHERE conditions required
[0.122ms] [rows:0] UPDATE `devices` SET `created_at`="2021-05-06 18:47:25.424",`updated_at`="2021-05-06 18:47:32.335",`device_name`="sdc",`device_type`="sat"
time="2021-05-06T18:47:32+02:00" level=error msg="An error occurred while updating device data from smartctl metrics WHERE conditions required"
time="2021-05-06T18:47:32+02:00" level=error msg="127.0.0.1 - 47a3adb31669 [06/May/2021:18:47:32 +0200] \"POST /api/device//smart\" 500 17 \"\" \"Go-http-client/1.1\" (1ms)" clientIP=127.0.0.1 hostname=47a3adb31669 latency=1 method=POST path=/api/device//smart referer= respLength=17 statusCode=500 userAgent=Go-http-client/1.1
time="2021-05-06T18:47:32+02:00" level=info msg="Main: Completed" type=metrics

Please let me know if I let any identifying info in the logs, it's sometimes hard to read.

My docker compose entry is:

scrutiny: container_name: scrutiny image: hotio/scrutiny ports:

  • "8070:8080" environment:
  • PGID=${PGID}
  • PUID=${PUID}
  • TZ=${TZ}
  • UMASK=002
  • INTERVAL=86400
  • API_ENDPOINT=http://localhost:8080
  • MODE=both logging: driver: json-file options: max-file: ${DOCKERLOGGING_MAXFILE} max-size: ${DOCKERLOGGING_MAXSIZE} restart: unless-stopped
    volumes:
  • ${DOCKERCONFDIR}/scrutiny:/config
  • /run/udev:/run/udev:ro cap_add:
  • SYS_RAWIO devices:
  • /dev/sda
  • /dev/sdb
  • /dev/sdc
spupuz commented 3 years ago

I after changing some failed disk I have the same error, would be better to analyze disks based on dev-by-and not /dev/sdX. Additionally, would be better been able to remove removed device or at lease hide them

AnalogJ commented 3 years ago

Scrutiny uses the disk WWN/UUID as the primary identifier, the device name (/dev/dba) is only used for visual purposes.

I definitely want to figure out where these ghost/empty device entries are coming from (a couple of other people have submitted similar issues). Would you be willing to give me full logs from your collector runs

also, can you run the following commands:

smartctl --scan -j
smartctl --info -j -d sat /dev/sdc
smartctl -x -j /dev/sdc

Thanks!

ltctceplrm commented 3 years ago

Thank you for checking this out, here's what I get when I run the commands: smartctl --scan -j

{ "json_format_version": [ 1, 0 ], "smartctl": { "version": [ 7, 2 ], "svn_revision": "5155", "platform_info": "x86_64-linux-5.10.0-8-amd64", "build_info": "(local build)", "argv": [ "smartctl", "--scan", "-j" ], "exit_status": 0 }, "devices": [ { "name": "/dev/sda", "info_name": "/dev/sda", "type": "scsi", "protocol": "SCSI" }, { "name": "/dev/sdb", "info_name": "/dev/sdb [SAT]", "type": "sat", "protocol": "ATA" }, { "name": "/dev/sdg", "info_name": "/dev/sdg [SAT]", "type": "sat", "protocol": "ATA" } ] }

smartctl --info -j -d sat /dev/sdc

{ "json_format_version": [ 1, 0 ], "smartctl": { "version": [ 7, 2 ], "svn_revision": "5155", "platform_info": "x86_64-linux-5.10.0-8-amd64", "build_info": "(local build)", "argv": [ "smartctl", "--info", "-j", "-d", "sat", "/dev/sdc" ], "exit_status": 2 }, "device": { "name": "/dev/sdc", "info_name": "/dev/sdc [SAT]", "type": "sat", "protocol": "ATA" } }

smartctl -x -j /dev/sdc

{ "json_format_version": [ 1, 0 ], "smartctl": { "version": [ 7, 2 ], "svn_revision": "5155", "platform_info": "x86_64-linux-5.10.0-8-amd64", "build_info": "(local build)", "argv": [ "smartctl", "-x", "-j", "/dev/sdc" ], "messages": [ { "string": "/dev/sdc: Unknown USB bridge [0x058f:0x6362 (0x100)]", "severity": "error" } ], "exit_status": 1 } }

For the collector logs do I just need to add this to docker compose?

DEBUG=true
COLLECTOR_LOG_FILE=/tmp/collector.log
AnalogJ commented 2 years ago

This should be fixed in the v0.4.x releases. We check for empty data (when smartctl silently fails) and we correctly ignore "ghost" devices.

Also the UI now allows you to identify your devices with different titles:

fixed in v0.4.7 🎉