IBM / CAST

CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27 stars 34 forks source link

HCDiag: `chk-ib-pcispeed` Test Fails to Detect Mellanox HCAs Speed / Width Values on RHEL 8.4 Node #1017

Closed nicolas-tallet closed 2 years ago

nicolas-tallet commented 2 years ago

Describe the bug HCDiag chk-ib-pcispeed test fails to detect proper speed and width values from the Mellanox adapters on a RHEL 8.4 node.

To Reproduce Steps to reproduce the behavior:

  1. Run chk-ib-pcispeed HCDiag Test on RHEL 8.4 node
    $ hcdiag_run.py --target "r311n09-adm" --test "chk-ib-pcispeed"
    INFO: xcat seems to be installed in /opt/xcat/bin. Running in Management mode
    Health Check Diagnostics version 1.8.3., running on Linux 4.14.0-115.14.1.el7a.ppc64le, p3xcatmn-adm machine.
    Using configuration file /data_local/sw/cast/1.8.3/hcdiag/etc/hcdiag.properties.
    Using tests configuration file /data_local/sw/cast/1.8.3/hcdiag/etc/test.properties.
    Health Check Diagnostics, run id 211213181602789110, initializing...
    Validating command argument test.
    Validating command argument target.
  2. Test should fail with the standard error message:
    
    Preparing to run chk-ib-pcispeed.
    Executable: /data_local/sw/cast/1.8.3/hcdiag/tests/chk-ib-pcispeed/chk-ib-pcispeed.sh exists on remote node(s).
    chk-ib-pcispeed started on 1 node(s) at 2021-12-13 18:16:05.252119. It might take up to 10s.
    .
    chk-ib-pcispeed ended on 1 node(s) at 2021-12-13 18:16:09.409955, rc= 1, elapsed time: 0:00:04.157836
    chk-ib-pcispeed FAIL on node r311n09-adm, serial number: 78875BA, rc= 8. (details in /tmp/211213181602789110/chk-ib-pcispeed/r311n09-adm-2021-12-13-18_16_07.output)

=============================== Results summary ===============================

18:16:05 =======================================================================

chk-ib-pcispeed FAIL on 1 node(s):

r311n09-adm

================================================================================

Health Check Diagnostics ended, exit code 100.

3. Output details show that parsing of the `lspci` command output produces unexpected values for both speed and width settings:

Running chk-ib-pcispeed.sh on r311n09, machine type 8335-GTX. Adapter: 0003:01:00.0, 16GT/, Widt. Error, expecting: 16GT/s, got: 16GT/ Error, expecting: x8, got: Widt Adapter: 0003:01:00.1, 16GT/, Widt. Error, expecting: 16GT/s, got: 16GT/ Error, expecting: x8, got: Widt Adapter: 0033:01:00.0, 16GT/, Widt. Error, expecting: 16GT/s, got: 16GT/ Error, expecting: x8, got: Widt Adapter: 0033:01:00.1, 16GT/, Widt. Error, expecting: 16GT/s, got: 16GT/ Error, expecting: x8, got: Widt Found 4 Mellanox adapters.

chk-ib-pcispeed.sh test PASS, rc=8 Remote_command_rc = 8


**Expected behavior**
Both speed and width settings should be properly parsed from the `lspci` command output.

**Environment (please complete the following information):**
 - RHEL 8.4 Environment
 - CAST 1.8.3

**Additional context**
Suggested fix for the issue:
* Original code:

speed=echo ${line} | awk '{print substr($3,1,length($3)-1)}' width=echo ${line} | awk '{print substr($5,1,length($5)-1)}'

* Improved code:

speed="$(echo "${line}" | awk 'match($0, /Speed\s([0-9]+GT\/s)/, a) {print a[1]}')" width="$(echo "${line}" | awk 'match($0, /Width\s(x[0-9]+)/, a) {print a[1]}')"

besawn commented 2 years ago

@nicolas-tallet Thank you for the detailed issue description. This issue is addressed by PR #1022.