IBM / CAST

CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27 stars 34 forks source link

CSM FVT updates related to dcgm setup with some other test case modif… #998

Closed williammorrison2 closed 2 years ago

williammorrison2 commented 3 years ago

…ications.

Purpose

Add some additional checking to existing test cases to reduce failures along with some script modifications

1. Enable the datacenter-gpu-manager daemon on all of the nodes where CSM is running in the cluster.

script: csmtest/setup/csm_install.sh

DCGM is running by default on the IST clusters. This daemon should be configured to start at boot on all of the nodes running CSM in the cluster and/or started via the process that is used to start and stop the CSM daemons. The fix ensures all the daemons are running on the appropriate CSM nodes. (Tweak of the original implementation) Manual check after the /u/wcmorris/CAST/csmtest/setup/csm_install.sh was ran

[root@c650f99p06 CAST]# xdsh all "systemctl status dcgm" | grep Active
c650f99p06:    Active: active (running) since Wed 2021-03-03 16:55:42 EST; 4 days ago
c650f99p18:    Active: active (running) since Wed 2021-03-03 16:55:42 EST; 4 days ago
c650f99p26:    Active: active (running) since Wed 2021-03-03 16:55:42 EST; 4 days ago
c650f99p16:    Active: active (running) since Wed 2021-03-03 16:55:43 EST; 4 days ago
c650f99p28:    Active: active (running) since Wed 2021-03-03 16:55:42 EST; 4 days ago
c650f99p36:    Active: active (running) since Wed 2021-03-03 16:55:43 EST; 4 days ago
c650f99p30:    Active: active (running) since Wed 2021-03-03 16:07:59 EST; 4 days ago

2. Enable the datacenter-gpu-manager daemon on all of the nodes where CSM is running in the cluster. (Similar to the previous update with some additional print output.)

3. Additional package removal in the csmtest/setup/csm_uninstall.sh

Removing the ibm-csm-tools package to ensure there are no collisions while reinstalling different packages.

@@ -100,6 +100,7 @@ curr_rpm_list+=`rpm -qa | grep ibm-flightlog`
 curr_rpm_list+=`rpm -qa | grep ibm-csm-db`
 curr_rpm_list+=`rpm -qa | grep ibm-csm-restd`
 curr_rpm_list+=`rpm -qa | grep ibm-csm-bds`
+curr_rpm_list+=`rpm -qa | grep ibm-csm-tools`
 curr_rpm_list=${curr_rpm_list//.ppc64le/.ppc64le }
 curr_rpm_list=${curr_rpm_list//.noarch/.noarch }

4. Test case that requires some additional retry attempts as the node is not ready for updating.

/buckets/basic/fvt_node_attributes_query_and_update.py ${SINGLE_COMPUTE} After running regression a few times, this test case failed on multiple attempts. Based on some of the evidence, it seems the that node is "not available". While retesting a few times the test now seems to pass.

Logging details: python_libraries.log

------------------------------------------------------------
             Starting Python Libraries Bucket
------------------------------------------------------------
Fri Mar  5 15:25:47 EST 2021
------------------------------------------------------------
[2021-03-05 15:25:47.5514] Test Case 1: Inventory Library - fvt_node_attributes_query_and_update.py:                                        FAILED
[2021-03-05 15:25:47.6347] Test Case 2: Workload Manager Library - fvt_allocation_create_and_delete.py:                                     FAILED

Logging details: python_libraries_flags.log

Test Case 1: Inventory Library - fvt_node_attributes_query_and_update.py
No matching records found.

Logging details: python_libraries_temp.log:

[csmapi][warning]   /u/wcmorris/CAST/csmi/src/common/src/csmi_common_utils.c-147: the Error Flag Set
[csmapi][error] csmi_sendrecv_cmd failed: 46 - csm_allocation_create[220330596]; Allocation ID: 1; Primary Job Id: 1; Secondary Job Id: 0;Database Error Message: The following nodes were not available:  ;The following nodes were not found: c650f99p18 ; Message: Allocation is being reverted; Unable to reserve nodes; Message: Allocation was successfully reverted;
Create Failed
Test Case 2: Workload Manager Library - fvt_allocation_create_and_delete.py
Expected RC: 0
Actual RC: 46

After the test case revision:

[root@c650f99p06 basic]# cat python_libraries.log
------------------------------------------------------------
             Starting Python Libraries Bucket
------------------------------------------------------------
Fri Mar  5 16:09:46 EST 2021
------------------------------------------------------------
[2021-03-05 16:09:47.4484] Test Case 1: Inventory Library - fvt_node_attributes_query_and_update.py:                                          PASS
[2021-03-05 16:09:47.5393] Test Case 2: Workload Manager Library - fvt_allocation_create_and_delete.py:                                       PASS
------------------------------------------------------------
             Python Libraries Bucket COMPLETED
------------------------------------------------------------

TODOs