AnalogJ / scrutiny

Hard Drive S.M.A.R.T Monitoring, Historical Trends & Real World Failure Thresholds
MIT License
5.05k stars 165 forks source link

[BUG] failed to run collector on esxi 7 #388

Open sjtuross opened 1 year ago

sjtuross commented 1 year ago

Describe the bug I got smartctl from smartmontools-linux-x86_64-static-7.4-r5414.tar.gz which is the latest CI build as of now from https://builds.smartmontools.org and put scrutiny-collector-metrics-linux-amd64 and collector.yaml in the same folder. Did some quick check below and all looked good.

[root@esxi2:/vmfs/volumes/5f79c6f5-a7338bdc-85f3-6cb3114d162c/TEMP/smartmontools] ./scrutiny-collector-metrics-linux-amd64 -v
2022/10/31 13:58:39 No configuration file found at /opt/scrutiny/config/collector.yaml. Using Defaults.
scrutiny-collector-metrics version 0.5.0
[root@esxi2:/vmfs/volumes/5f79c6f5-a7338bdc-85f3-6cb3114d162c/TEMP/smartmontools] ./smartctl -d sat -a /dev/disks/t10.ATA_____ST8000AS00022D1NA17Z_________________________________Z840R2T2
smartctl pre-7.4 2022-10-18 r5414 [x86_64-linux-7.0.3] (CircleCI)
Copyright (C) 2002-22, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Seagate Archive HDD (SMR)
Device Model:     ST8000AS0002-1NA17Z
Serial Number:    Z840R2T2
LU WWN Device Id: 5 000c50 0929e645d
Firmware Version: RT17
User Capacity:    8,001,563,222,016 bytes [8.00 TB]
Sector Sizes:     512 bytes logical, 4096 bytes physical
Rotation Rate:    5980 rpm
Device is:        In smartctl database 7.3/5413
ATA Version is:   ACS-2, ACS-3 T13/2161-D revision 3b
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Mon Oct 31 13:59:31 2022 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x82) Offline data collection activity
                    was completed without error.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (    0) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   1) minutes.
Extended self-test routine
recommended polling time:    ( 942) minutes.
Conveyance self-test routine
recommended polling time:    (   2) minutes.
SCT capabilities:          (0x30b5) SCT Status supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   108   099   006    Pre-fail  Always       -       19517840
  3 Spin_Up_Time            0x0003   091   091   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       32
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   078   060   030    Pre-fail  Always       -       77638867
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       9605
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       32
183 Runtime_Bad_Block       0x0032   100   100   000    Old_age   Always       -       0
184 End-to-End_Error        0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Command_Timeout         0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   068   054   045    Old_age   Always       -       32 (Min/Max 29/35)
191 G-Sense_Error_Rate      0x0032   100   100   000    Old_age   Always       -       0
192 Power-Off_Retract_Count 0x0032   097   097   000    Old_age   Always       -       7601
193 Load_Cycle_Count        0x0032   079   079   000    Old_age   Always       -       43461
194 Temperature_Celsius     0x0022   032   046   000    Old_age   Always       -       32 (0 12 0 0 0)
195 Hardware_ECC_Recovered  0x001a   108   099   000    Old_age   Always       -       19517840
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
240 Head_Flying_Hours       0x0000   100   253   000    Old_age   Offline      -       6754 (186 17 0)
241 Total_LBAs_Written      0x0000   100   253   000    Old_age   Offline      -       42506965708
242 Total_LBAs_Read         0x0000   100   253   000    Old_age   Offline      -       76619276138

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

This is collector.yaml

version: 1
host:
  id: "ESXi2"
devices:
  - device: /dev/disks/t10.ATA_____ST8000AS00022D1NA17Z_________________________________Z840R2T2
    type: 'sat'
api:
  endpoint: 'http://example.com:8080'
commands:
  metrics_smartctl_bin: './smartctl'

However, when I run ./scrutiny-collector-metrics-linux-amd64 run --config ./collector.yaml, it crashed with the log below.

BTW, this is the ESXi kernel version.

[root@esxi2:/vmfs/volumes/5f79c6f5-a7338bdc-85f3-6cb3114d162c/TEMP/smartmontools] uname -a VMkernel esxi2 7.0.3 #1 SMP Release build-20328353 Aug 22 2022 19:41:06 x86_64 x86_64 x86_64 ESXi

Expected behavior collector runs on ESXi 7.

Screenshots If applicable, add screenshots to help explain your problem.

Log Files

[root@esxi2:/vmfs/volumes/5f79c6f5-a7338bdc-85f3-6cb3114d162c/TEMP/smartmontools] ./scrutiny-collector-metrics-linux-amd64 run --config ./collector.yaml
2022/10/31 13:47:23 No configuration file found at /opt/scrutiny/config/collector.yaml. Using Defaults.

 ___   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
AnalogJ/scrutiny/metrics                        linux.amd64-0.5.0

2022/10/31 13:47:23 Loading configuration file: /vmfs/volumes/MX1T/TEMP/smartmontools/collector.yaml
runtime: epollwait on fd 4 failed with 38
fatal error: runtime: netpoll failed

runtime stack:
runtime.throw({0x8a7b24?, 0xc00005f928?})
    /opt/hostedtoolcache/go/1.18.3/x64/src/runtime/panic.go:992 +0x71
runtime.netpoll(0x4e4c3ee12c363?)
    /opt/hostedtoolcache/go/1.18.3/x64/src/runtime/netpoll_epoll.go:130 +0x34e
runtime.sysmon()
    /opt/hostedtoolcache/go/1.18.3/x64/src/runtime/proc.go:5131 +0x2d5
runtime.mstart1()
    /opt/hostedtoolcache/go/1.18.3/x64/src/runtime/proc.go:1418 +0x93
runtime.mstart0()
    /opt/hostedtoolcache/go/1.18.3/x64/src/runtime/proc.go:1376 +0x79
runtime.mstart()
    /opt/hostedtoolcache/go/1.18.3/x64/src/runtime/asm_amd64.s:367 +0x5

goroutine 1 [runnable]:
encoding/json.(*encodeState).error(...)
    /opt/hostedtoolcache/go/1.18.3/x64/src/encoding/json/encode.go:338
encoding/json.unsupportedTypeEncoder(0x82eb60?, {0x82eb60?, 0xc0001e62a0?, 0xc0001f2000?}, {0x0?, 0x0?})
    /opt/hostedtoolcache/go/1.18.3/x64/src/encoding/json/encode.go:720 +0x97
encoding/json.(*encodeState).reflectValue(0x824fe0?, {0x82eb60?, 0xc0001e62a0?, 0xc0001bf450?}, {0x70?, 0xf4?})
    /opt/hostedtoolcache/go/1.18.3/x64/src/encoding/json/encode.go:360 +0x78
encoding/json.interfaceEncoder(0xc00018b200, {0x824fe0?, 0xc000130540?, 0xc0001bf4e0?}, {0xd7?, 0x9d?})
    /opt/hostedtoolcache/go/1.18.3/x64/src/encoding/json/encode.go:716 +0xc8
encoding/json.arrayEncoder.encode({0xc0001bf4f0?}, 0xc00018b200, {0x80d800?, 0xc0001242e8?, 0xbe7240?}, {0x30?, 0xf5?})
    /opt/hostedtoolcache/go/1.18.3/x64/src/encoding/json/encode.go:916 +0xb5
encoding/json.sliceEncoder.encode({0xc00012e2d8?}, 0xc00018b200, {0x80d800?, 0xc0001242e8?, 0x80d800?}, {0x88?, 0x0?})
    /opt/hostedtoolcache/go/1.18.3/x64/src/encoding/json/encode.go:889 +0x2ca
encoding/json.(*encodeState).reflectValue(0x824fe0?, {0x80d800?, 0xc0001242e8?, 0xc0001e7140?}, {0x90?, 0x43?})
    /opt/hostedtoolcache/go/1.18.3/x64/src/encoding/json/encode.go:360 +0x78
encoding/json.interfaceEncoder(0xc00018b200, {0x824fe0?, 0xc0001e8b20?, 0xc000119e01?}, {0x40?, 0x8a?})
    /opt/hostedtoolcache/go/1.18.3/x64/src/encoding/json/encode.go:716 +0xc8
encoding/json.mapEncoder.encode({0xc00012e2a8?}, 0xc00018b200, {0x830120?, 0xc0001e69c0?, 0x830120?}, {0x50?, 0x54?})
    /opt/hostedtoolcache/go/1.18.3/x64/src/encoding/json/encode.go:814 +0x583
encoding/json.(*encodeState).reflectValue(0x0?, {0x830120?, 0xc0001e69c0?, 0x40d687?}, {0x78?, 0x0?})
    /opt/hostedtoolcache/go/1.18.3/x64/src/encoding/json/encode.go:360 +0x78
encoding/json.(*encodeState).marshal(0xc0001e6ab0?, {0x830120?, 0xc0001e69c0?}, {0x0?, 0x0?})
    /opt/hostedtoolcache/go/1.18.3/x64/src/encoding/json/encode.go:332 +0xfa
encoding/json.Marshal({0x830120, 0xc0001e69c0})
    /opt/hostedtoolcache/go/1.18.3/x64/src/encoding/json/encode.go:161 +0x45
encoding/json.MarshalIndent({0x830120?, 0xc0001e69c0?}, {0x0, 0x0}, {0x94f5f8, 0x1})
    /opt/hostedtoolcache/go/1.18.3/x64/src/encoding/json/encode.go:176 +0x4a
main.main.func2(0xc0001d43c0?)
    /home/runner/work/scrutiny/scrutiny/collector/cmd/collector-metrics/collector-metrics.go:132 +0x38e
github.com/urfave/cli/v2.(*Command).Run(0xc00012aea0, 0xc000132a40)
    /home/runner/work/scrutiny/scrutiny/vendor/github.com/urfave/cli/v2/command.go:164 +0x5bb
github.com/urfave/cli/v2.(*App).RunContext(0xc0001d6000, {0x9543b0?, 0xc000134020}, {0xc000132040, 0x4, 0x4})
    /home/runner/work/scrutiny/scrutiny/vendor/github.com/urfave/cli/v2/app.go:306 +0xbc5
github.com/urfave/cli/v2.(*App).Run(...)
    /home/runner/work/scrutiny/scrutiny/vendor/github.com/urfave/cli/v2/app.go:215
main.main()
    /home/runner/work/scrutiny/scrutiny/collector/cmd/collector-metrics/collector-metrics.go:182 +0x7d9
AnalogJ commented 1 year ago

can you run the collector in debug mode? (with --debug on the CLI?) It seems to be failing to JSON encode your config file when logging.

Everything in your collector config file looks pretty basic, except your device name.

/dev/disks/t10.ATA_____ST8000AS00022D1NA17Z_________________________________Z840R2T2

but I dont see any special characters there, so it should be fine too. Just incase, could you try quoting it in the config file?

sjtuross commented 1 year ago

Thank you for your reply. I tried quoting the device identifier, but it still crashed. This time I also got another error (occasionally) at the bottom.

I searched this error epollwait on fd 4 failed with 38 and found this golang issue https://github.com/golang/go/issues/24980. Do you think it's related?

Also https://groups.google.com/g/golang-nuts/c/R1vvk2pZW7w?pli=1 about the other error.

version: 1
host:
  id: "ESXi2"
devices:
  - device: '/dev/disks/t10.ATA_____ST8000AS00022D1NA17Z_________________________________Z840R2T2'
    type: 'sat'
api:
  endpoint: 'https://scrutiny.rossconsulting.cn'
commands:
  metrics_smartctl_bin: './smartctl'
[root@esxi2:/vmfs/volumes/5f79c6f5-a7338bdc-85f3-6cb3114d162c/TEMP/smartmontools] /vmfs/volumes/MX1T/TEMP/smartmontools/scrutiny-collector-metrics-linux-amd64 run --debug --config ./collector.yaml
2022/11/01 02:02:37 No configuration file found at /opt/scrutiny/config/collector.yaml. Using Defaults.

 ___   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
AnalogJ/scrutiny/metrics                        linux.amd64-0.5.0

2022/11/01 02:02:37 Loading configuration file: /vmfs/volumes/MX1T/TEMP/smartmontools/collector.yaml
runtime: epollwait on fd 4 failed with 38
fatal error: runtime: netpoll failed

runtime stack:
runtime.throw({0x8a7b24?, 0xc000061928?})
    /opt/hostedtoolcache/go/1.18.3/x64/src/runtime/panic.go:992 +0x71
runtime.netpoll(0x50ce312969043?)
    /opt/hostedtoolcache/go/1.18.3/x64/src/runtime/netpoll_epoll.go:130 +0x34e
runtime.sysmon()
    /opt/hostedtoolcache/go/1.18.3/x64/src/runtime/proc.go:5131 +0x2d5
runtime.mstart1()
    /opt/hostedtoolcache/go/1.18.3/x64/src/runtime/proc.go:1418 +0x93
runtime.mstart0()
    /opt/hostedtoolcache/go/1.18.3/x64/src/runtime/proc.go:1376 +0x79
runtime.mstart()
    /opt/hostedtoolcache/go/1.18.3/x64/src/runtime/asm_amd64.s:367 +0x5

goroutine 1 [runnable]:
gopkg.in/yaml%2ev2.yaml_parser_scan_plain_scalar(0xc00017e300, 0xc000195240)
    /home/runner/work/scrutiny/scrutiny/vendor/gopkg.in/yaml.v2/scannerc.go:2570 +0x140c
gopkg.in/yaml%2ev2.yaml_parser_fetch_plain_scalar(0xc00017e300)
    /home/runner/work/scrutiny/scrutiny/vendor/gopkg.in/yaml.v2/scannerc.go:1435 +0x7d
gopkg.in/yaml%2ev2.yaml_parser_fetch_next_token(0xc00017e300)
    /home/runner/work/scrutiny/scrutiny/vendor/gopkg.in/yaml.v2/scannerc.go:813 +0x65e
gopkg.in/yaml%2ev2.yaml_parser_fetch_more_tokens(0xc00017e300)
    /home/runner/work/scrutiny/scrutiny/vendor/gopkg.in/yaml.v2/scannerc.go:642 +0x19b
gopkg.in/yaml%2ev2.peek_token(...)
    /home/runner/work/scrutiny/scrutiny/vendor/gopkg.in/yaml.v2/parserc.go:47
gopkg.in/yaml%2ev2.yaml_parser_parse_document_start(0xc00017e300, 0xc00017e510, 0x1)
    /home/runner/work/scrutiny/scrutiny/vendor/gopkg.in/yaml.v2/parserc.go:200 +0x55
gopkg.in/yaml%2ev2.yaml_parser_state_machine(0x1dfdae7108?, 0x1000000000030?)
    /home/runner/work/scrutiny/scrutiny/vendor/gopkg.in/yaml.v2/parserc.go:101 +0x54
gopkg.in/yaml%2ev2.yaml_parser_parse(0x1dfdae7108?, 0x600?)
    /home/runner/work/scrutiny/scrutiny/vendor/gopkg.in/yaml.v2/parserc.go:72 +0x8c
gopkg.in/yaml%2ev2.(*parser).peek(0xc00017e300)
    /home/runner/work/scrutiny/scrutiny/vendor/gopkg.in/yaml.v2/decode.go:105 +0x30
gopkg.in/yaml%2ev2.(*parser).parse(0xc00017e300)
    /home/runner/work/scrutiny/scrutiny/vendor/gopkg.in/yaml.v2/decode.go:143 +0x45
gopkg.in/yaml%2ev2.unmarshal({0xc0001c2000, 0x100, 0x600}, {0x807b00?, 0xc00000e120}, 0x0)
    /home/runner/work/scrutiny/scrutiny/vendor/gopkg.in/yaml.v2/yaml.go:142 +0x305
gopkg.in/yaml%2ev2.Unmarshal(...)
    /home/runner/work/scrutiny/scrutiny/vendor/gopkg.in/yaml.v2/yaml.go:81
github.com/spf13/viper.(*Viper).unmarshalReader(0xc000020b40, {0x9517c0, 0xc00000e118}, 0xc0001bc0f0)
    /home/runner/work/scrutiny/scrutiny/vendor/github.com/spf13/viper/viper.go:1490 +0x57c
github.com/spf13/viper.(*Viper).MergeConfig(0xc00002a6c0?, {0x9517c0, 0xc00000e118})
    /home/runner/work/scrutiny/scrutiny/vendor/github.com/spf13/viper/viper.go:1389 +0x45
github.com/analogj/scrutiny/collector/pkg/config.(*configuration).ReadConfig(0xc000070840, {0x35e4c488e92?, 0x6?})
    /home/runner/work/scrutiny/scrutiny/collector/pkg/config/config.go:88 +0x125
main.main.func2(0xc0001a83c0?)
    /home/runner/work/scrutiny/scrutiny/collector/cmd/collector-metrics/collector-metrics.go:97 +0xaa
github.com/urfave/cli/v2.(*Command).Run(0xc000021200, 0xc00002ca40)
    /home/runner/work/scrutiny/scrutiny/vendor/github.com/urfave/cli/v2/command.go:164 +0x5bb
github.com/urfave/cli/v2.(*App).RunContext(0xc0001ac000, {0x9543b0?, 0xc000022098}, {0xc00001e0a0, 0x5, 0x5})
    /home/runner/work/scrutiny/scrutiny/vendor/github.com/urfave/cli/v2/app.go:306 +0xbc5
github.com/urfave/cli/v2.(*App).Run(...)
    /home/runner/work/scrutiny/scrutiny/vendor/github.com/urfave/cli/v2/app.go:215
main.main()
    /home/runner/work/scrutiny/scrutiny/collector/cmd/collector-metrics/collector-metrics.go:182 +0x7d9
[root@esxi2:/vmfs/volumes/5f79c6f5-a7338bdc-85f3-6cb3114d162c/TEMP/smartmontools] /vmfs/volumes/MX1T/TEMP/smartmontools/scrutiny-collector-metrics-linux-amd64 run --debug --config ./collector.yaml
2022/11/01 02:02:36 No configuration file found at /opt/scrutiny/config/collector.yaml. Using Defaults.

 ___   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
AnalogJ/scrutiny/metrics                        linux.amd64-0.5.0

2022/11/01 02:02:36 Loading configuration file: /vmfs/volumes/MX1T/TEMP/smartmontools/collector.yaml
DEBU[0000] json: unsupported type: map[interface {}]interface {}  type=metrics
INFO[0000] Verifying required tools                      type=metrics
INFO[0000] Executing command: ./smartctl --scan --json   type=metrics
ERRO[0000] Error scanning for devices: fork/exec ./smartctl: no space left on device  type=metrics
2022/11/01 02:02:36 ERROR: fork/exec ./smartctl: no space left on device
AnalogJ commented 1 year ago

Hey @sjtuross fork/exec ./smartctl: no space left on device is incredibly suspicious. Can you confirm that the disk you're using has space available, and is writable?

sjtuross commented 1 year ago

I am certain there is space left on device and it's writable by ESXi. All VMs can read and write fine. Probably go app can't run properly on ESXi which is not real Linux.

AnalogJ commented 1 year ago

could be related to this: http://woshub.com/vmware-esxi-no-space-left-device/

AnalogJ commented 1 year ago

Did a bit more reading on this. Could also be related to missing swap volume or exhausted inodes

AnalogJ commented 1 year ago

I'm going to close this issue for now, feel free to comment/open a new issue if you think theres something that I can fix in the Scrutiny codebase to support ESXI.

sjtuross commented 1 year ago

@AnalogJ I realized today that the no space left error is probably due to that smartctl can't find any devices. See below there is no devices listed in the output. Is there a way to disable scan and only use the devices specified in collector.yaml?

[root@esxi:/vmfs/volumes/5f79c6f5-a7338bdc-85f3-6cb3114d162c/TEMP/smartmontools] ./smartctl --scan --json
{
  "json_format_version": [
    1,
    0
  ],
  "smartctl": {
    "version": [
      7,
      4
    ],
    "pre_release": true,
    "svn_revision": "5414",
    "platform_info": "x86_64-linux-7.0.3",
    "build_info": "(CircleCI)",
    "argv": [
      "smartctl",
      "--scan",
      "--json"
    ],
    "exit_status": 0
  }
}
AnalogJ commented 1 year ago

huh, interesting.

You can customize the flags sent to the scan command using https://github.com/AnalogJ/scrutiny/blob/master/example.collector.yaml#L81

Removing scan completely would be a lot of work, however, if we can get it working on ESXI with some additional flags, I'd be happy to update our docs and/or add an ESXI troubleshooting guide.

Would you be willing to do some testing with the smartctl --scan command with additional flags?

sjtuross commented 1 year ago

Yes, I'm willing to try. What additional flags do you suggest?

AnalogJ commented 1 year ago

Here's a smartctl reference - https://linux.die.net/man/8/smartctl

and here's a couple of people using smartctl + an internal esxi tool to retrieve smart information:

It seems you/we may be able to write a wrapper around esxcli storage core device list that returns results in a similar format as smartctl --scan --json

sjtuross commented 1 year ago

I find a way to format the output of esxcli storage core device list, so it could be easier to parse it in a scan wrapper.

The csv formatter allows specifying some specific fields. In the below example, DevfsPath is the device identifier. Filtering on Vendor=ATA can exclude USB and iSCSI devices.

[root@esxi:/vmfs/volumes/5f79c6f5-a7338bdc-85f3-6cb3114d162c/TEMP/smartmontools] esxcli --formatter=csv --format-param=fields="DevfsPath,Vendor,Model" storage core device list
DevfsPath,Vendor,Model,
/vmfs/devices/disks/naa.6589cfc000000f7b3137fe00cd6d09ca,FreeNAS ,iSCSI Disk      ,
/vmfs/devices/disks/naa.6589cfc000000ff1bc98d51b11b6fdfa,TrueNAS ,iSCSI Disk      ,
/vmfs/devices/disks/mpx.vmhba32:C0:T0:L0,SanDisk ,Cruzer Blade    ,
/vmfs/devices/disks/naa.6589cfc000000b6136c98e7be4a2025f,FreeNAS ,iSCSI Disk      ,
/vmfs/devices/disks/naa.6589cfc000000bf70912866cff56e19a,TrueNAS ,iSCSI Disk      ,
/vmfs/devices/disks/t10.ATA_____WDC_WD180EMFZ2D11AFXA0___________________3WHDKL1J____________,ATA     ,WDC WD180EMFZ-11,
[root@esxi:/vmfs/volumes/5f79c6f5-a7338bdc-85f3-6cb3114d162c/TEMP/smartmontools]

It's also possible to format the output as json, but --format-param is not supported.

esxcli --debug --formatter=json storage core device list
[
  {
    "AttachedFilters": [],
    "DIXEnabled": false,
    "DIXGuardType": "NO GUARD SUPPORT",
    "DevfsPath": "/vmfs/devices/disks/naa.6589cfc000000f7b3137fe00cd6d09ca",
    "Device": "naa.6589cfc000000f7b3137fe00cd6d09ca",
    "DeviceMaxQueueDepth": 128,
    "DeviceType": "Direct-Access ",
    "DisplayName": "FreeNAS iSCSI Disk (naa.6589cfc000000f7b3137fe00cd6d09ca)",
    "DriveType": "unknown",
    "EmulatedDIXDIFEnabled": false,
    "HasSettableDisplayName": true,
    "IsBootDevice": false,
    "IsBootUSBDevice": false,
    "IsLocal": false,
    "IsLocalSASDevice": false,
    "IsOffline": false,
    "IsPerenniallyReserved": false,
    "IsPseudo": false,
    "IsRDMCapable": true,
    "IsRemovable": false,
    "IsSAS": false,
    "IsSSD": true,
    "IsSharedClusterwide": true,
    "IsUSB": false,
    "IsVVOLPE": false,
    "Model": "iSCSI Disk      ",
    "MultipathPlugin": "NMP",
    "NoofoutstandingIOswithcompetingworlds": 32,
    "NumberofPhysicalDrives": "unknown",
    "OtherUIDs": [
      "vml.010000000061633166366264386263343230303600695343534920"
    ],
    "PIActivated": false,
    "PIProtectionMask": "NO PROTECTION",
    "PIType": 0,
    "ProtectionEnabled": false,
    "QueueFullSampleSize": 0,
    "QueueFullThreshold": 0,
    "RAIDLevel": "unknown",
    "Revision": "0123",
    "SCSILevel": 7,
    "Size": 1048576,
    "Status": "on",
    "SupportedGuardTypes": [
      "NO GUARD SUPPORT"
    ],
    "ThinProvisioningStatus": "yes",
    "VAAIStatus": "supported",
    "Vendor": "FreeNAS "
  },
  {
    "AttachedFilters": [],
    "DIXEnabled": false,
    "DIXGuardType": "NO GUARD SUPPORT",
    "DevfsPath": "/vmfs/devices/disks/naa.6589cfc000000ff1bc98d51b11b6fdfa",
    "Device": "naa.6589cfc000000ff1bc98d51b11b6fdfa",
    "DeviceMaxQueueDepth": 128,
    "DeviceType": "Direct-Access ",
    "DisplayName": "TrueNAS iSCSI Disk (naa.6589cfc000000ff1bc98d51b11b6fdfa)",
    "DriveType": "unknown",
    "EmulatedDIXDIFEnabled": false,
    "HasSettableDisplayName": true,
    "IsBootDevice": false,
    "IsBootUSBDevice": false,
    "IsLocal": false,
    "IsLocalSASDevice": false,
    "IsOffline": false,
    "IsPerenniallyReserved": false,
    "IsPseudo": false,
    "IsRDMCapable": true,
    "IsRemovable": false,
    "IsSAS": false,
    "IsSSD": true,
    "IsSharedClusterwide": true,
    "IsUSB": false,
    "IsVVOLPE": false,
    "Model": "iSCSI Disk      ",
    "MultipathPlugin": "NMP",
    "NoofoutstandingIOswithcompetingworlds": 32,
    "NumberofPhysicalDrives": "unknown",
    "OtherUIDs": [
      "vml.010000000038303631356630613434663230313200695343534920"
    ],
    "PIActivated": false,
    "PIProtectionMask": "NO PROTECTION",
    "PIType": 0,
    "ProtectionEnabled": false,
    "QueueFullSampleSize": 0,
    "QueueFullThreshold": 0,
    "RAIDLevel": "unknown",
    "Revision": "0123",
    "SCSILevel": 7,
    "Size": 204800,
    "Status": "on",
    "SupportedGuardTypes": [
      "NO GUARD SUPPORT"
    ],
    "ThinProvisioningStatus": "yes",
    "VAAIStatus": "supported",
    "Vendor": "TrueNAS "
  },
  {
    "AttachedFilters": [],
    "DIXEnabled": false,
    "DIXGuardType": "NO GUARD SUPPORT",
    "DevfsPath": "/vmfs/devices/disks/mpx.vmhba32:C0:T0:L0",
    "Device": "mpx.vmhba32:C0:T0:L0",
    "DeviceMaxQueueDepth": 1,
    "DeviceType": "Direct-Access ",
    "DisplayName": "Local USB Direct-Access (mpx.vmhba32:C0:T0:L0)",
    "DriveType": "unknown",
    "EmulatedDIXDIFEnabled": false,
    "HasSettableDisplayName": false,
    "IsBootDevice": true,
    "IsBootUSBDevice": true,
    "IsLocal": true,
    "IsLocalSASDevice": false,
    "IsOffline": false,
    "IsPerenniallyReserved": false,
    "IsPseudo": false,
    "IsRDMCapable": false,
    "IsRemovable": true,
    "IsSAS": false,
    "IsSSD": false,
    "IsSharedClusterwide": false,
    "IsUSB": true,
    "IsVVOLPE": false,
    "Model": "Cruzer Blade    ",
    "MultipathPlugin": "NMP",
    "NoofoutstandingIOswithcompetingworlds": 1,
    "NumberofPhysicalDrives": "unknown",
    "OtherUIDs": [
      "vml.010000000032303034343331373431303535333430324345464372757a6572"
    ],
    "PIActivated": false,
    "PIProtectionMask": "NO PROTECTION",
    "PIType": 0,
    "ProtectionEnabled": false,
    "QueueFullSampleSize": 0,
    "QueueFullThreshold": 0,
    "RAIDLevel": "unknown",
    "Revision": "0103",
    "SCSILevel": 2,
    "Size": 15267,
    "Status": "on",
    "SupportedGuardTypes": [
      "NO GUARD SUPPORT"
    ],
    "ThinProvisioningStatus": "unknown",
    "VAAIStatus": "unsupported",
    "Vendor": "SanDisk "
  },
  {
    "AttachedFilters": [],
    "DIXEnabled": false,
    "DIXGuardType": "NO GUARD SUPPORT",
    "DevfsPath": "/vmfs/devices/disks/naa.6589cfc000000b6136c98e7be4a2025f",
    "Device": "naa.6589cfc000000b6136c98e7be4a2025f",
    "DeviceMaxQueueDepth": 128,
    "DeviceType": "Direct-Access ",
    "DisplayName": "FreeNAS iSCSI Disk (naa.6589cfc000000b6136c98e7be4a2025f)",
    "DriveType": "unknown",
    "EmulatedDIXDIFEnabled": false,
    "HasSettableDisplayName": true,
    "IsBootDevice": false,
    "IsBootUSBDevice": false,
    "IsLocal": false,
    "IsLocalSASDevice": false,
    "IsOffline": false,
    "IsPerenniallyReserved": false,
    "IsPseudo": false,
    "IsRDMCapable": true,
    "IsRemovable": false,
    "IsSAS": false,
    "IsSSD": true,
    "IsSharedClusterwide": true,
    "IsUSB": false,
    "IsVVOLPE": false,
    "Model": "iSCSI Disk      ",
    "MultipathPlugin": "NMP",
    "NoofoutstandingIOswithcompetingworlds": 32,
    "NumberofPhysicalDrives": "unknown",
    "OtherUIDs": [
      "vml.010000000030303063323966653034646230310000695343534920"
    ],
    "PIActivated": false,
    "PIProtectionMask": "NO PROTECTION",
    "PIType": 0,
    "ProtectionEnabled": false,
    "QueueFullSampleSize": 0,
    "QueueFullThreshold": 0,
    "RAIDLevel": "unknown",
    "Revision": "0123",
    "SCSILevel": 7,
    "Size": 4194304,
    "Status": "on",
    "SupportedGuardTypes": [
      "NO GUARD SUPPORT"
    ],
    "ThinProvisioningStatus": "yes",
    "VAAIStatus": "supported",
    "Vendor": "FreeNAS "
  },
  {
    "AttachedFilters": [],
    "DIXEnabled": false,
    "DIXGuardType": "NO GUARD SUPPORT",
    "DevfsPath": "/vmfs/devices/disks/naa.6589cfc000000bf70912866cff56e19a",
    "Device": "naa.6589cfc000000bf70912866cff56e19a",
    "DeviceMaxQueueDepth": 128,
    "DeviceType": "Direct-Access ",
    "DisplayName": "TrueNAS iSCSI Disk (naa.6589cfc000000bf70912866cff56e19a)",
    "DriveType": "unknown",
    "EmulatedDIXDIFEnabled": false,
    "HasSettableDisplayName": true,
    "IsBootDevice": false,
    "IsBootUSBDevice": false,
    "IsLocal": false,
    "IsLocalSASDevice": false,
    "IsOffline": false,
    "IsPerenniallyReserved": false,
    "IsPseudo": false,
    "IsRDMCapable": true,
    "IsRemovable": false,
    "IsSAS": false,
    "IsSSD": true,
    "IsSharedClusterwide": true,
    "IsUSB": false,
    "IsVVOLPE": false,
    "Model": "iSCSI Disk      ",
    "MultipathPlugin": "NMP",
    "NoofoutstandingIOswithcompetingworlds": 32,
    "NumberofPhysicalDrives": "unknown",
    "OtherUIDs": [
      "vml.010000000038303631356630613434663230313100695343534920"
    ],
    "PIActivated": false,
    "PIProtectionMask": "NO PROTECTION",
    "PIType": 0,
    "ProtectionEnabled": false,
    "QueueFullSampleSize": 0,
    "QueueFullThreshold": 0,
    "RAIDLevel": "unknown",
    "Revision": "0123",
    "SCSILevel": 7,
    "Size": 3145728,
    "Status": "on",
    "SupportedGuardTypes": [
      "NO GUARD SUPPORT"
    ],
    "ThinProvisioningStatus": "yes",
    "VAAIStatus": "supported",
    "Vendor": "TrueNAS "
  },
  {
    "AttachedFilters": [],
    "DIXEnabled": false,
    "DIXGuardType": "NO GUARD SUPPORT",
    "DevfsPath": "/vmfs/devices/disks/t10.ATA_____WDC_WD180EMFZ2D11AFXA0___________________3WHDKL1J____________",
    "Device": "t10.ATA_____WDC_WD180EMFZ2D11AFXA0___________________3WHDKL1J____________",
    "DeviceMaxQueueDepth": 31,
    "DeviceType": "Direct-Access ",
    "DisplayName": "Local ATA Disk (t10.ATA_____WDC_WD180EMFZ2D11AFXA0___________________3WHDKL1J____________)",
    "DriveType": "unknown",
    "EmulatedDIXDIFEnabled": false,
    "HasSettableDisplayName": true,
    "IsBootDevice": false,
    "IsBootUSBDevice": false,
    "IsLocal": true,
    "IsLocalSASDevice": false,
    "IsOffline": false,
    "IsPerenniallyReserved": false,
    "IsPseudo": false,
    "IsRDMCapable": false,
    "IsRemovable": false,
    "IsSAS": false,
    "IsSSD": false,
    "IsSharedClusterwide": false,
    "IsUSB": false,
    "IsVVOLPE": false,
    "Model": "WDC WD180EMFZ-11",
    "MultipathPlugin": "HPP",
    "NoofoutstandingIOswithcompetingworlds": 31,
    "NumberofPhysicalDrives": "unknown",
    "OtherUIDs": [
      "vml.0100000000335748444b4c314a202020202020202020202020574443205744"
    ],
    "PIActivated": false,
    "PIProtectionMask": "NO PROTECTION",
    "PIType": 0,
    "ProtectionEnabled": false,
    "QueueFullSampleSize": 0,
    "QueueFullThreshold": 0,
    "RAIDLevel": "unknown",
    "Revision": "0A81",
    "SCSILevel": 5,
    "Size": 17166336,
    "Status": "on",
    "SupportedGuardTypes": [
      "NO GUARD SUPPORT"
    ],
    "ThinProvisioningStatus": "unknown",
    "VAAIStatus": "unsupported",
    "Vendor": "ATA     "
  }
]

Reference:

mehmetaydogduu commented 8 months ago

Any update on this?

AnalogJ commented 8 months ago

If you're willing, this could be a good example of a custom collector - https://github.com/AnalogJ/scrutiny/tree/240178d742a5fe84b5b61952897a855f9425b790/collector/cmd

qianjunakasumi commented 7 months ago

I had the same problem and I attempted to do the following:

Edit collector.yaml:

commands:
  metrics_smartctl_bin: 'uname' # change to `uname` for testing
  metrics_scan_args: '-a --json' # --json required by scrutiny

Result:

[root@esxi:/tmp] scrutiny-collector run --config collector.yaml --debug
2024/02/14 08:33:34 No configuration file found at /opt/scrutiny/config/collector.yaml. Using Defaults.

 ___   ___  ____  __  __  ____  ____  _  _  _  _
/ __) / __)(  _ \(  )(  )(_  _)(_  _)( \( )( \/ )
\__ \( (__  )   / )(__)(   )(   _)(_  )  (  \  /
(___/ \___)(_)\_)(______) (__) (____)(_)\_) (__)
AnalogJ/scrutiny/metrics                        linux.amd64-0.7.2

2024/02/14 08:33:34 Loading configuration file: /tmp/collector.yaml
DEBU[0000] {
        "api": {
                "endpoint": "http://*****:8080"
        },
        "commands": {
                "metrics_info_args": "--info --json",
                "metrics_scan_args": "-a --json",
                "metrics_smart_args": "--xall --json",
                "metrics_smartctl_bin": "uname"
        },
        "devices": [
                /* Non-critical devices content is omitted */
                {
                        "device": "/vmfs/devices/disks/t10.ATA_____INTEL***__",
                        "type": "sat"
                },
        ],
        "host": {
                "id": "*******"
        },
        "log": {
                "file": "",
                "level": "DEBUG"
        },
        "version": 1
}<nil>  type=metrics
INFO[0000] Verifying required tools                      type=metrics
INFO[0000] Executing command: uname -a --json            type=metrics
ERRO[0000] Error scanning for devices: fork/exec /bin/uname: no space left on device  type=metrics
2024/02/14 08:33:34 ERROR: fork/exec /bin/uname: no space left on device
[root@esxi:/tmp] uname
VMkernel

Obviously, uname also return ERROR: fork/exec /bin/uname: no space left on device Although it seems a bit strange, let's take a look at the code:

https://github.com/AnalogJ/scrutiny/blob/a3dfce3561bcddcd8b70e4e7f483e22594c8af4d/collector/pkg/detect/detect.go#L29-L48

    detectedDeviceConnJson, err := d.Shell.Command(d.Logger, d.Config.GetString("commands.metrics_smartctl_bin"), args, "", os.Environ())
    if err != nil {
        d.Logger.Errorf("Error scanning for devices: %v", err)
        return nil, err
    }

Execution of the code terminates here, the following code is not executed (if it is executed it should report that it is not a valid JSON format)

    var detectedDeviceConns models.Scan
    err = json.Unmarshal([]byte(detectedDeviceConnJson), &detectedDeviceConns)
    if err != nil {
        d.Logger.Errorf("Error decoding detected devices: %v", err)
        return nil, err
    }

What my idea is that the ESXi security features don't allow fork/exec operations. (To minimally reproduce this guess we can code a Golang program that simply executes a command and then looks at the behavior in ESXi.)

Environment: ESXi 8.x