IBM / CAST

CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27 stars 34 forks source link

Fix CSM log warning messages related to transitioning to DCGM 2.X from 1.X #954

Closed besawn closed 4 years ago

besawn commented 4 years ago

This PR addresses the issues described in issue #953.

Library name change

The first change is related to the DCGM library name CSM uses when attempting to dynamically load libdcgm.so.

DCGM has used different library naming conventions with different releases. In order to try to maintain compatibility with as many versions of DCGM as possible, we try to load the library using the most current library names first and fall back to the oldest library name last. If we succeed in loading the library using any of the names, everything should function normally. If we fail to load using any library name, continue to log a warning message.

Test 1: confirm the updated library loading logic works as expected with the most recent DCGM version

# Check the installed version of DCGM
[root@c650f99p36 ~]# rpm -q datacenter-gpu-manager
datacenter-gpu-manager-2.0.10-1.ppc64le

# Observe the library naming conventions used by this version
[root@c650f99p36 ~]# ls -al /usr/lib64/libdcgm.so*
lrwxrwxrwx 1 root root      12 Jul 23 17:08 /usr/lib64/libdcgm.so -> libdcgm.so.2
lrwxrwxrwx 1 root root      17 Jul 23 17:08 /usr/lib64/libdcgm.so.2 -> libdcgm.so.2.0.10
-rwxr-xr-x 1 root root 9944320 Jul 23 17:08 /usr/lib64/libdcgm.so.2.0.10

# Check the library loading messages look like what we expect
[root@c650f99p36 ~]# grep libdcgm /var/log/ibm/csm/csm_compute.log
[COMPUTE]2020-08-20 14:08:57.050352       csmd::info     | dlopen() successfully loaded /usr/lib64/libdcgm.so.2

# Run a simple set of tests to make sure basic CSM/DCGM integration is working
[root@c650f99p36 ~]# /u/besawn/bash/dcgm_checks.sh 
Is nvidia-persistenced running:                                                                          PASS
Is DCGM running:                                                                                         PASS
Does nvidia-smi report GPUs:                                                                             PASS
Do all GPUs reported by nvidia-smi have persistence mode enabled:                                        PASS
Is CSM running:                                                                                          PASS
CSM log shows successful dlopen():                                                                       PASS
CSM log shows successful symbol loading:                                                                 PASS
CSM log shows DCGM watch settings:                                                                       PASS
Does the CSM GPU count match the nvidia-smi GPU count:                                                   PASS
CSM Inventory: Does each GPU have a unique gpu id:                                                       PASS
CSM Inventory: Does each GPU have a unique pci bus id:                                                   PASS
CSM Inventory: Does each GPU have a unique serial number:                                                PASS
CSM Inventory: Does each GPU have a unique uuid:                                                         PASS
CSM Inventory: Does each GPU share the same device name:                                                 PASS
CSM Inventory: Does each GPU share the same hbm memory value:                                            PASS
CSM Inventory: Does each GPU share the same inforom image version:                                       PASS
CSM Inventory: Does each GPU share the same vbios:                                                       PASS
Success!

Test 2: confirm the updated library loading logic fails as expected when DCGM is not installed

# Check that DCGM is not installed
[root@c650f99p36 ~]# rpm -q datacenter-gpu-manager
package datacenter-gpu-manager is not installed

# Observe that there are no libraries present
[root@c650f99p36 ~]# ls -al /usr/lib64/libdcgm.so*
ls: cannot access '/usr/lib64/libdcgm.so*': No such file or directory

# Check the library loading messages look like what we expect
[root@c650f99p36 ~]# grep libdcgm /var/log/ibm/csm/csm_compute.log
[COMPUTE]2020-08-20 14:20:35.072559       csmd::info     | dlopen() /usr/lib64/libdcgm.so.2 returned: /usr/lib64/libdcgm.so.2: cannot open shared object file: No such file or directory
[COMPUTE]2020-08-20 14:20:35.072593       csmd::info     | dlopen() /usr/lib64/libdcgm.so.1 returned: /usr/lib64/libdcgm.so.1: cannot open shared object file: No such file or directory
[COMPUTE]2020-08-20 14:20:35.072622       csmd::info     | dlopen() /usr/lib64/libdcgm.so returned: /usr/lib64/libdcgm.so: cannot open shared object file: No such file or directory
[COMPUTE]2020-08-20 14:20:35.072645       csmd::warning  | Couldn't load libdcgm.so, CSM GPU functions are disabled.

# Run a simple set of tests to make sure basic CSM/DCGM integration is working, which fails, as expected
[root@c650f99p36 ~]# /u/besawn/bash/dcgm_checks.sh 
Is nvidia-persistenced running:                                                                          PASS
Is DCGM running:                                                                                       FAILED
On line 21: [systemctl -q is-active dcgm] returned 0

Fields not supported in DCGM 2.0

The second change contained in this PR is related to warning messages being logged for GPU environmental data fields. These fields were previously able to be collected and sent to the Big Data Store with DCGM 1.X, but DCGM 2.X has removed access to these fields.

Prior to the changes contained in this PR, the GPU environmental data logging looked like this when running with DCGM 2.X:

[COMPUTE]2020-08-25 12:37:50.006451     csmenv::debug    | GPU 0 serial_number (STR), value: 0321918195569
[COMPUTE]2020-08-25 12:37:50.009250     csmenv::debug    | GPU 0 power_usage (FP64), value: 35.274
[COMPUTE]2020-08-25 12:37:50.009285     csmenv::debug    | GPU 0 gpu_temp (INT64), value: 23
[COMPUTE]2020-08-25 12:37:50.009310     csmenv::debug    | GPU 0 gpu_utilization (INT64), value: 0
[COMPUTE]2020-08-25 12:37:50.009332     csmenv::debug    | GPU 0 mem_copy_utilization (INT64), value: 0
[COMPUTE]2020-08-25 12:37:50.009352     csmenv::debug    | GPU 0 enc_utilization (INT64), value: 0
[COMPUTE]2020-08-25 12:37:50.009372     csmenv::debug    | GPU 0 dec_utilization (INT64), value: 0
[COMPUTE]2020-08-25 12:37:50.009392     csmenv::warning  | GPU 0 nvlink_bandwidth_l0 unexpected case!
[COMPUTE]2020-08-25 12:37:50.009411     csmenv::warning  | GPU 0 nvlink_bandwidth_l0 version = 16781336 fieldId = 440 fieldType = 105 status = -6
[COMPUTE]2020-08-25 12:37:50.009431     csmenv::warning  | GPU 0 nvlink_bandwidth_l1 unexpected case!
[COMPUTE]2020-08-25 12:37:50.009449     csmenv::warning  | GPU 0 nvlink_bandwidth_l1 version = 16781336 fieldId = 441 fieldType = 105 status = -6
[COMPUTE]2020-08-25 12:37:50.009467     csmenv::warning  | GPU 0 nvlink_bandwidth_l2 unexpected case!
[COMPUTE]2020-08-25 12:37:50.009484     csmenv::warning  | GPU 0 nvlink_bandwidth_l2 version = 16781336 fieldId = 442 fieldType = 105 status = -6
[COMPUTE]2020-08-25 12:37:50.009502     csmenv::warning  | GPU 0 nvlink_bandwidth_l3 unexpected case!
[COMPUTE]2020-08-25 12:37:50.009520     csmenv::warning  | GPU 0 nvlink_bandwidth_l3 version = 16781336 fieldId = 443 fieldType = 105 status = -6
[COMPUTE]2020-08-25 12:37:50.009538     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l0 unexpected case!
[COMPUTE]2020-08-25 12:37:50.009556     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l0 version = 16781336 fieldId = 400 fieldType = 105 status = 0
[COMPUTE]2020-08-25 12:37:50.009574     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l1 unexpected case!
[COMPUTE]2020-08-25 12:37:50.009591     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l1 version = 16781336 fieldId = 401 fieldType = 105 status = 0
[COMPUTE]2020-08-25 12:37:50.009609     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l2 unexpected case!
[COMPUTE]2020-08-25 12:37:50.009627     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l2 version = 16781336 fieldId = 402 fieldType = 105 status = 0
[COMPUTE]2020-08-25 12:37:50.009645     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l3 unexpected case!
[COMPUTE]2020-08-25 12:37:50.009662     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l3 version = 16781336 fieldId = 403 fieldType = 105 status = 0
[COMPUTE]2020-08-25 12:37:50.009684     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l0 unexpected case!
[COMPUTE]2020-08-25 12:37:50.009702     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l0 version = 16781336 fieldId = 410 fieldType = 105 status = 0
[COMPUTE]2020-08-25 12:37:50.009720     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l1 unexpected case!
[COMPUTE]2020-08-25 12:37:50.009737     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l1 version = 16781336 fieldId = 411 fieldType = 105 status = 0
[COMPUTE]2020-08-25 12:37:50.009755     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l2 unexpected case!
[COMPUTE]2020-08-25 12:37:50.009772     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l2 version = 16781336 fieldId = 412 fieldType = 105 status = 0
[COMPUTE]2020-08-25 12:37:50.009790     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l3 unexpected case!
[COMPUTE]2020-08-25 12:37:50.009807     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l3 version = 16781336 fieldId = 413 fieldType = 105 status = 0
[COMPUTE]2020-08-25 12:37:50.009825     csmenv::warning  | GPU 0 nvlink_replay_error_count_l0 unexpected case!
[COMPUTE]2020-08-25 12:37:50.009842     csmenv::warning  | GPU 0 nvlink_replay_error_count_l0 version = 16781336 fieldId = 420 fieldType = 105 status = 0
[COMPUTE]2020-08-25 12:37:50.009860     csmenv::warning  | GPU 0 nvlink_replay_error_count_l1 unexpected case!
[COMPUTE]2020-08-25 12:37:50.009877     csmenv::warning  | GPU 0 nvlink_replay_error_count_l1 version = 16781336 fieldId = 421 fieldType = 105 status = 0
[COMPUTE]2020-08-25 12:37:50.009895     csmenv::warning  | GPU 0 nvlink_replay_error_count_l2 unexpected case!
[COMPUTE]2020-08-25 12:37:50.009913     csmenv::warning  | GPU 0 nvlink_replay_error_count_l2 version = 16781336 fieldId = 422 fieldType = 105 status = 0
[COMPUTE]2020-08-25 12:37:50.009930     csmenv::warning  | GPU 0 nvlink_replay_error_count_l3 unexpected case!
[COMPUTE]2020-08-25 12:37:50.009947     csmenv::warning  | GPU 0 nvlink_replay_error_count_l3 version = 16781336 fieldId = 423 fieldType = 105 status = 0
[COMPUTE]2020-08-25 12:37:50.009965     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l0 unexpected case!
[COMPUTE]2020-08-25 12:37:50.009993     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l0 version = 16781336 fieldId = 430 fieldType = 105 status = 0
[COMPUTE]2020-08-25 12:37:50.010011     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l1 unexpected case!
[COMPUTE]2020-08-25 12:37:50.010028     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l1 version = 16781336 fieldId = 431 fieldType = 105 status = 0
[COMPUTE]2020-08-25 12:37:50.010045     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l2 unexpected case!
[COMPUTE]2020-08-25 12:37:50.010063     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l2 version = 16781336 fieldId = 432 fieldType = 105 status = 0
[COMPUTE]2020-08-25 12:37:50.010080     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l3 unexpected case!
[COMPUTE]2020-08-25 12:37:50.010097     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l3 version = 16781336 fieldId = 433 fieldType = 105 status = 0
[COMPUTE]2020-08-25 12:37:50.010115     csmenv::debug    | GPU 0 power_violation (INT64), value: 0
[COMPUTE]2020-08-25 12:37:50.010135     csmenv::debug    | GPU 0 thermal_violation (INT64), value: 0
[COMPUTE]2020-08-25 12:37:50.010155     csmenv::debug    | GPU 0 sync_boost_violation (INT64), value: 0

The unsupported fields come in two types, fields that return a bad status when attempting to read them and fields that are always reported as having a blank value. I updated the logging code in inv_dcgm_access.cc to be able to distinguish these cases more clearly in the logs. After updating the logging code, the messages look like this:

[COMPUTE]2020-08-25 12:29:10.006664     csmenv::debug    | GPU 0 serial_number (STR), value: 0321918195569
[COMPUTE]2020-08-25 12:29:10.006700     csmenv::debug    | GPU 0 power_usage (FP64), value: 35.274
[COMPUTE]2020-08-25 12:29:10.006737     csmenv::debug    | GPU 0 gpu_temp (INT64), value: 23
[COMPUTE]2020-08-25 12:29:10.006763     csmenv::debug    | GPU 0 gpu_utilization (INT64), value: 0
[COMPUTE]2020-08-25 12:29:10.006784     csmenv::debug    | GPU 0 mem_copy_utilization (INT64), value: 0
[COMPUTE]2020-08-25 12:29:10.006805     csmenv::debug    | GPU 0 enc_utilization (INT64), value: 0
[COMPUTE]2020-08-25 12:29:10.006825     csmenv::debug    | GPU 0 dec_utilization (INT64), value: 0
[COMPUTE]2020-08-25 12:29:10.006845     csmenv::warning  | GPU 0 nvlink_bandwidth_l0 unexpected case!
[COMPUTE]2020-08-25 12:29:10.006863     csmenv::warning  | GPU 0 nvlink_bandwidth_l0 version = 16781336 fieldId = 440 fieldType = 105 status = -6
[COMPUTE]2020-08-25 12:29:10.006883     csmenv::warning  | GPU 0 nvlink_bandwidth_l1 unexpected case!
[COMPUTE]2020-08-25 12:29:10.006901     csmenv::warning  | GPU 0 nvlink_bandwidth_l1 version = 16781336 fieldId = 441 fieldType = 105 status = -6
[COMPUTE]2020-08-25 12:29:10.006919     csmenv::warning  | GPU 0 nvlink_bandwidth_l2 unexpected case!
[COMPUTE]2020-08-25 12:29:10.006936     csmenv::warning  | GPU 0 nvlink_bandwidth_l2 version = 16781336 fieldId = 442 fieldType = 105 status = -6
[COMPUTE]2020-08-25 12:29:10.006954     csmenv::warning  | GPU 0 nvlink_bandwidth_l3 unexpected case!
[COMPUTE]2020-08-25 12:29:10.006971     csmenv::warning  | GPU 0 nvlink_bandwidth_l3 version = 16781336 fieldId = 443 fieldType = 105 status = -6
[COMPUTE]2020-08-25 12:29:10.006989     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l0 (INT64) is blank, ignoring this field!
[COMPUTE]2020-08-25 12:29:10.007007     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l1 (INT64) is blank, ignoring this field!
[COMPUTE]2020-08-25 12:29:10.007024     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l2 (INT64) is blank, ignoring this field!
[COMPUTE]2020-08-25 12:29:10.007042     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l3 (INT64) is blank, ignoring this field!
[COMPUTE]2020-08-25 12:29:10.007059     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l0 (INT64) is blank, ignoring this field!
[COMPUTE]2020-08-25 12:29:10.007076     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l1 (INT64) is blank, ignoring this field!
[COMPUTE]2020-08-25 12:29:10.007093     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l2 (INT64) is blank, ignoring this field!
[COMPUTE]2020-08-25 12:29:10.007110     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l3 (INT64) is blank, ignoring this field!
[COMPUTE]2020-08-25 12:29:10.007127     csmenv::warning  | GPU 0 nvlink_replay_error_count_l0 (INT64) is blank, ignoring this field!
[COMPUTE]2020-08-25 12:29:10.007151     csmenv::warning  | GPU 0 nvlink_replay_error_count_l1 (INT64) is blank, ignoring this field!
[COMPUTE]2020-08-25 12:29:10.007168     csmenv::warning  | GPU 0 nvlink_replay_error_count_l2 (INT64) is blank, ignoring this field!
[COMPUTE]2020-08-25 12:29:10.007185     csmenv::warning  | GPU 0 nvlink_replay_error_count_l3 (INT64) is blank, ignoring this field!
[COMPUTE]2020-08-25 12:29:10.007202     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l0 (INT64) is blank, ignoring this field!
[COMPUTE]2020-08-25 12:29:10.007219     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l1 (INT64) is blank, ignoring this field!
[COMPUTE]2020-08-25 12:29:10.007236     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l2 (INT64) is blank, ignoring this field!
[COMPUTE]2020-08-25 12:29:10.007253     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l3 (INT64) is blank, ignoring this field!
[COMPUTE]2020-08-25 12:29:10.007270     csmenv::debug    | GPU 0 power_violation (INT64), value: 0
[COMPUTE]2020-08-25 12:29:10.007290     csmenv::debug    | GPU 0 thermal_violation (INT64), value: 0
[COMPUTE]2020-08-25 12:29:10.007310     csmenv::debug    | GPU 0 sync_boost_violation (INT64), value: 0

The final change is to remove this set of fields from collection, as they are no longer supported. After removal, the log messages now look like this:

[COMPUTE]2020-08-25 13:02:30.006654     csmenv::debug    | GPU 0 serial_number (STR), value: 0321918195569
[COMPUTE]2020-08-25 13:02:30.006693     csmenv::debug    | GPU 0 power_usage (FP64), value: 35.274
[COMPUTE]2020-08-25 13:02:30.006727     csmenv::debug    | GPU 0 gpu_temp (INT64), value: 22
[COMPUTE]2020-08-25 13:02:30.006753     csmenv::debug    | GPU 0 gpu_utilization (INT64), value: 0
[COMPUTE]2020-08-25 13:02:30.006774     csmenv::debug    | GPU 0 mem_copy_utilization (INT64), value: 0
[COMPUTE]2020-08-25 13:02:30.006794     csmenv::debug    | GPU 0 enc_utilization (INT64), value: 0
[COMPUTE]2020-08-25 13:02:30.006815     csmenv::debug    | GPU 0 dec_utilization (INT64), value: 0
[COMPUTE]2020-08-25 13:02:30.006834     csmenv::debug    | GPU 0 power_violation (INT64), value: 0
[COMPUTE]2020-08-25 13:02:30.006854     csmenv::debug    | GPU 0 thermal_violation (INT64), value: 0
[COMPUTE]2020-08-25 13:02:30.006874     csmenv::debug    | GPU 0 sync_boost_violation (INT64), value: 0