MrBr-github / lshca

GNU General Public License v3.0
9 stars 6 forks source link

Add error messaging on systems with bad health #56

Closed MrBr-github closed 2 years ago

MrBr-github commented 2 years ago

Currently it's problematic to understand why the information is missing from output

Possible error messages Sysfs data missing for <BDF>. Check for driver or fw issues. Start from dmesg

-w cable RDMA info missing for <BDF>, failed to read mlxcables and mlxlink info

Examles, below HCA has FW issue

root@fat...25:~# lshca
-------------------------------------------------------------------------------------------------------
Dev #1
 Desc: Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller (rev 01)
 PN: MBF2H332A-AECOT  rev. B1
 PSID:
 SN: MT2125X11405
 FW:
 Tempr: =N/A=
-------------------------------------------------------------------------------------------------------
  PCI_addr   | RDMA | Net | Numa | LnkStat | IpStat | Link | Rate | LnkCapWidth | HCA_Type
-------------------------------------------------------------------------------------------------------
0000:02:00.0 |      |     |  0   |         |        |      |      |    x8 G4    |
0000:02:00.1 |      |     |  0   |         |        |      |      |    x8 G4    |
-------------------------------------------------------------------------------------------------------
root@fat..25:~# lshca -w cable
-------------------------------------------------------------------------------------------------------
Dev #1
 Desc: Mellanox Technologies MT42822 BlueField-2 integrated ConnectX-6 Dx network controller (rev 01)
 PN: MBF2H332A-AECOT  rev. B1
 PSID:
 SN: MT2125X11405
 FW:
-------------------------------------------------------------------------------------------------------
RDMA | Net |         MST_device          | CblPN | CblSN | CblLng | PhyLinkStat | PhyLnkSpd
-------------------------------------------------------------------------------------------------------
     |     |  /dev/mst/mt41686_pciconf0  |       |       |        |             |
     |     | /dev/mst/mt41686_pciconf0.1 |       |       |        |             |
-------------------------------------------------------------------------------------------------------
MrBr-github commented 2 years ago
MrBr-github commented 2 years ago
  • Remove current debugging (commit 38f239e)
  • Use standard logging library

Done in 77b2fc3f

Failed command reporting done in 7617a00