IBM / CAST

CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27 stars 34 forks source link

CSM RHEL 8: new warning messages in CSM log files when with DCGM 2.X compared to DCGM 1.X #953

Closed besawn closed 4 years ago

besawn commented 4 years ago

Library name change

The name of the libdcgm.so.1 library has changed to libdcgm.so.2 with this release, resulting in the following warning:

[COMPUTE]2020-08-19 09:43:28.282258       csmd::warning  | dlopen() /usr/lib64/libdcgm.so.1 returned: /usr/lib64/libdcgm.so.1: cannot open shared object file: No such file or directory

However, CSM still falls back to loading the library using the name libdcgm.so, which is successful.

Fields not supported in DCGM 2.0 (type 1)

[COMPUTE]2020-08-19 09:50:00.005186     csmenv::warning  | GPU 0 nvlink_bandwidth_l0 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005288     csmenv::warning  | GPU 0 nvlink_bandwidth_l0 version = 16781336 fieldId = 440 fieldType = 105 status = -6
[COMPUTE]2020-08-19 09:50:00.005314     csmenv::warning  | GPU 0 nvlink_bandwidth_l1 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005334     csmenv::warning  | GPU 0 nvlink_bandwidth_l1 version = 16781336 fieldId = 441 fieldType = 105 status = -6
[COMPUTE]2020-08-19 09:50:00.005353     csmenv::warning  | GPU 0 nvlink_bandwidth_l2 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005372     csmenv::warning  | GPU 0 nvlink_bandwidth_l2 version = 16781336 fieldId = 442 fieldType = 105 status = -6
[COMPUTE]2020-08-19 09:50:00.005391     csmenv::warning  | GPU 0 nvlink_bandwidth_l3 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005410     csmenv::warning  | GPU 0 nvlink_bandwidth_l3 version = 16781336 fieldId = 443 fieldType = 105 status = -6

The fields above are described like this in /usr/include/dcgm_fields.h:

/*
 * NV Link Bandwidth Counter for Lane 0 - Not supported in DCGM 2.0
 */
#define DCGM_FI_DEV_NVLINK_BANDWIDTH_L0 440

/*
 * NV Link Bandwidth Counter for Lane 1 - Not supported in DCGM 2.0
 */
#define DCGM_FI_DEV_NVLINK_BANDWIDTH_L1 441

/*
 * NV Link Bandwidth Counter for Lane 2 - Not supported in DCGM 2.0
 */
#define DCGM_FI_DEV_NVLINK_BANDWIDTH_L2 442

/*
 * NV Link Bandwidth Counter for Lane 3 - Not supported in DCGM 2.0
 */
#define DCGM_FI_DEV_NVLINK_BANDWIDTH_L3 443

/*
 * NV Link Bandwidth Counter for Lane 4 - Not supported in DCGM 2.0
 */
#define DCGM_FI_DEV_NVLINK_BANDWIDTH_L4 444

/*
 * NV Link Bandwidth Counter for Lane 5 - Not supported in DCGM 2.0
 */
#define DCGM_FI_DEV_NVLINK_BANDWIDTH_L5 445

Fields not supported in DCGM 2.0 (type 2)

[COMPUTE]2020-08-19 09:50:00.005428     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l0 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005447     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l0 version = 16781336 fieldId = 400 fieldType = 105 status = 0
[COMPUTE]2020-08-19 09:50:00.005466     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l1 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005484     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l1 version = 16781336 fieldId = 401 fieldType = 105 status = 0
[COMPUTE]2020-08-19 09:50:00.005503     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l2 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005521     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l2 version = 16781336 fieldId = 402 fieldType = 105 status = 0
[COMPUTE]2020-08-19 09:50:00.005540     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l3 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005558     csmenv::warning  | GPU 0 nvlink_flit_crc_error_count_l3 version = 16781336 fieldId = 403 fieldType = 105 status = 0
[COMPUTE]2020-08-19 09:50:00.005576     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l0 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005594     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l0 version = 16781336 fieldId = 410 fieldType = 105 status = 0
[COMPUTE]2020-08-19 09:50:00.005619     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l1 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005637     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l1 version = 16781336 fieldId = 411 fieldType = 105 status = 0
[COMPUTE]2020-08-19 09:50:00.005655     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l2 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005672     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l2 version = 16781336 fieldId = 412 fieldType = 105 status = 0
[COMPUTE]2020-08-19 09:50:00.005690     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l3 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005707     csmenv::warning  | GPU 0 nvlink_data_crc_error_count_l3 version = 16781336 fieldId = 413 fieldType = 105 status = 0
[COMPUTE]2020-08-19 09:50:00.005725     csmenv::warning  | GPU 0 nvlink_replay_error_count_l0 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005743     csmenv::warning  | GPU 0 nvlink_replay_error_count_l0 version = 16781336 fieldId = 420 fieldType = 105 status = 0
[COMPUTE]2020-08-19 09:50:00.005761     csmenv::warning  | GPU 0 nvlink_replay_error_count_l1 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005778     csmenv::warning  | GPU 0 nvlink_replay_error_count_l1 version = 16781336 fieldId = 421 fieldType = 105 status = 0
[COMPUTE]2020-08-19 09:50:00.005796     csmenv::warning  | GPU 0 nvlink_replay_error_count_l2 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005813     csmenv::warning  | GPU 0 nvlink_replay_error_count_l2 version = 16781336 fieldId = 422 fieldType = 105 status = 0
[COMPUTE]2020-08-19 09:50:00.005831     csmenv::warning  | GPU 0 nvlink_replay_error_count_l3 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005848     csmenv::warning  | GPU 0 nvlink_replay_error_count_l3 version = 16781336 fieldId = 423 fieldType = 105 status = 0
[COMPUTE]2020-08-19 09:50:00.005866     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l0 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005883     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l0 version = 16781336 fieldId = 430 fieldType = 105 status = 0
[COMPUTE]2020-08-19 09:50:00.005901     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l1 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005918     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l1 version = 16781336 fieldId = 431 fieldType = 105 status = 0
[COMPUTE]2020-08-19 09:50:00.005936     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l2 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005953     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l2 version = 16781336 fieldId = 432 fieldType = 105 status = 0
[COMPUTE]2020-08-19 09:50:00.005971     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l3 unexpected case!
[COMPUTE]2020-08-19 09:50:00.005988     csmenv::warning  | GPU 0 nvlink_recovery_error_count_l3 version = 16781336 fieldId = 433 fieldType = 105 status = 0
besawn commented 4 years ago

This issue was resolved via https://github.com/IBM/CAST/pull/954.