CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27
stars
34
forks
source link
Fix CSM log warning messages related to transitioning to DCGM 2.X from 1.X #954
This PR addresses the issues described in issue #953.
Library name change
The first change is related to the DCGM library name CSM uses when attempting to dynamically load libdcgm.so.
DCGM has used different library naming conventions with different releases. In order to try to maintain compatibility with as many versions of DCGM as possible, we try to load the library using the most current library names first and fall back to the oldest library name last. If we succeed in loading the library using any of the names, everything should function normally. If we fail to load using any library name, continue to log a warning message.
Test 1: confirm the updated library loading logic works as expected with the most recent DCGM version
# Check the installed version of DCGM
[root@c650f99p36 ~]# rpm -q datacenter-gpu-manager
datacenter-gpu-manager-2.0.10-1.ppc64le
# Observe the library naming conventions used by this version
[root@c650f99p36 ~]# ls -al /usr/lib64/libdcgm.so*
lrwxrwxrwx 1 root root 12 Jul 23 17:08 /usr/lib64/libdcgm.so -> libdcgm.so.2
lrwxrwxrwx 1 root root 17 Jul 23 17:08 /usr/lib64/libdcgm.so.2 -> libdcgm.so.2.0.10
-rwxr-xr-x 1 root root 9944320 Jul 23 17:08 /usr/lib64/libdcgm.so.2.0.10
# Check the library loading messages look like what we expect
[root@c650f99p36 ~]# grep libdcgm /var/log/ibm/csm/csm_compute.log
[COMPUTE]2020-08-20 14:08:57.050352 csmd::info | dlopen() successfully loaded /usr/lib64/libdcgm.so.2
# Run a simple set of tests to make sure basic CSM/DCGM integration is working
[root@c650f99p36 ~]# /u/besawn/bash/dcgm_checks.sh
Is nvidia-persistenced running: PASS
Is DCGM running: PASS
Does nvidia-smi report GPUs: PASS
Do all GPUs reported by nvidia-smi have persistence mode enabled: PASS
Is CSM running: PASS
CSM log shows successful dlopen(): PASS
CSM log shows successful symbol loading: PASS
CSM log shows DCGM watch settings: PASS
Does the CSM GPU count match the nvidia-smi GPU count: PASS
CSM Inventory: Does each GPU have a unique gpu id: PASS
CSM Inventory: Does each GPU have a unique pci bus id: PASS
CSM Inventory: Does each GPU have a unique serial number: PASS
CSM Inventory: Does each GPU have a unique uuid: PASS
CSM Inventory: Does each GPU share the same device name: PASS
CSM Inventory: Does each GPU share the same hbm memory value: PASS
CSM Inventory: Does each GPU share the same inforom image version: PASS
CSM Inventory: Does each GPU share the same vbios: PASS
Success!
Test 2: confirm the updated library loading logic fails as expected when DCGM is not installed
# Check that DCGM is not installed
[root@c650f99p36 ~]# rpm -q datacenter-gpu-manager
package datacenter-gpu-manager is not installed
# Observe that there are no libraries present
[root@c650f99p36 ~]# ls -al /usr/lib64/libdcgm.so*
ls: cannot access '/usr/lib64/libdcgm.so*': No such file or directory
# Check the library loading messages look like what we expect
[root@c650f99p36 ~]# grep libdcgm /var/log/ibm/csm/csm_compute.log
[COMPUTE]2020-08-20 14:20:35.072559 csmd::info | dlopen() /usr/lib64/libdcgm.so.2 returned: /usr/lib64/libdcgm.so.2: cannot open shared object file: No such file or directory
[COMPUTE]2020-08-20 14:20:35.072593 csmd::info | dlopen() /usr/lib64/libdcgm.so.1 returned: /usr/lib64/libdcgm.so.1: cannot open shared object file: No such file or directory
[COMPUTE]2020-08-20 14:20:35.072622 csmd::info | dlopen() /usr/lib64/libdcgm.so returned: /usr/lib64/libdcgm.so: cannot open shared object file: No such file or directory
[COMPUTE]2020-08-20 14:20:35.072645 csmd::warning | Couldn't load libdcgm.so, CSM GPU functions are disabled.
# Run a simple set of tests to make sure basic CSM/DCGM integration is working, which fails, as expected
[root@c650f99p36 ~]# /u/besawn/bash/dcgm_checks.sh
Is nvidia-persistenced running: PASS
Is DCGM running: FAILED
On line 21: [systemctl -q is-active dcgm] returned 0
Fields not supported in DCGM 2.0
The second change contained in this PR is related to warning messages being logged for GPU environmental data fields. These fields were previously able to be collected and sent to the Big Data Store with DCGM 1.X, but DCGM 2.X has removed access to these fields.
Prior to the changes contained in this PR, the GPU environmental data logging looked like this when running with DCGM 2.X:
The unsupported fields come in two types, fields that return a bad status when attempting to read them and fields that are always reported as having a blank value. I updated the logging code in inv_dcgm_access.cc to be able to distinguish these cases more clearly in the logs. After updating the logging code, the messages look like this:
This PR addresses the issues described in issue #953.
Library name change
The first change is related to the DCGM library name CSM uses when attempting to dynamically load libdcgm.so.
DCGM has used different library naming conventions with different releases. In order to try to maintain compatibility with as many versions of DCGM as possible, we try to load the library using the most current library names first and fall back to the oldest library name last. If we succeed in loading the library using any of the names, everything should function normally. If we fail to load using any library name, continue to log a warning message.
Test 1: confirm the updated library loading logic works as expected with the most recent DCGM version
Test 2: confirm the updated library loading logic fails as expected when DCGM is not installed
Fields not supported in DCGM 2.0
The second change contained in this PR is related to warning messages being logged for GPU environmental data fields. These fields were previously able to be collected and sent to the Big Data Store with DCGM 1.X, but DCGM 2.X has removed access to these fields.
Prior to the changes contained in this PR, the GPU environmental data logging looked like this when running with DCGM 2.X:
The unsupported fields come in two types, fields that return a bad status when attempting to read them and fields that are always reported as having a blank value. I updated the logging code in inv_dcgm_access.cc to be able to distinguish these cases more clearly in the logs. After updating the logging code, the messages look like this:
The final change is to remove this set of fields from collection, as they are no longer supported. After removal, the log messages now look like this: