CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27
stars
34
forks
source link
CSM FVT updates related to dcgm setup with some other test case modif… #998
Add some additional checking to existing test cases to reduce failures along with some script modifications
1. Enable the datacenter-gpu-manager daemon on all of the nodes where CSM is running in the cluster.
script: csmtest/setup/csm_install.sh
DCGM is running by default on the IST clusters.
This daemon should be configured to start at boot on all of the nodes running CSM in the cluster and/or started via the process that is used to start and stop the CSM daemons.
The fix ensures all the daemons are running on the appropriate CSM nodes. (Tweak of the original implementation)
Manual check after the /u/wcmorris/CAST/csmtest/setup/csm_install.sh was ran
[root@c650f99p06 CAST]# xdsh all "systemctl status dcgm" | grep Active
c650f99p06: Active: active (running) since Wed 2021-03-03 16:55:42 EST; 4 days ago
c650f99p18: Active: active (running) since Wed 2021-03-03 16:55:42 EST; 4 days ago
c650f99p26: Active: active (running) since Wed 2021-03-03 16:55:42 EST; 4 days ago
c650f99p16: Active: active (running) since Wed 2021-03-03 16:55:43 EST; 4 days ago
c650f99p28: Active: active (running) since Wed 2021-03-03 16:55:42 EST; 4 days ago
c650f99p36: Active: active (running) since Wed 2021-03-03 16:55:43 EST; 4 days ago
c650f99p30: Active: active (running) since Wed 2021-03-03 16:07:59 EST; 4 days ago
2. Enable the datacenter-gpu-manager daemon on all of the nodes where CSM is running in the cluster. (Similar to the previous update with some additional print output.)
Remove old CSM RPMs on Master
File clean up
Start Nvidia daemons
3. Additional package removal in the csmtest/setup/csm_uninstall.sh
Removing the ibm-csm-tools package to ensure there are no collisions while reinstalling different packages.
4. Test case that requires some additional retry attempts as the node is not ready for updating.
/buckets/basic/fvt_node_attributes_query_and_update.py ${SINGLE_COMPUTE}
After running regression a few times, this test case failed on multiple attempts. Based on some of the evidence, it seems the that node is "not available".
While retesting a few times the test now seems to pass.
Logging details: python_libraries.log
------------------------------------------------------------
Starting Python Libraries Bucket
------------------------------------------------------------
Fri Mar 5 15:25:47 EST 2021
------------------------------------------------------------
[2021-03-05 15:25:47.5514] Test Case 1: Inventory Library - fvt_node_attributes_query_and_update.py: FAILED
[2021-03-05 15:25:47.6347] Test Case 2: Workload Manager Library - fvt_allocation_create_and_delete.py: FAILED
Logging details: python_libraries_flags.log
Test Case 1: Inventory Library - fvt_node_attributes_query_and_update.py
No matching records found.
Logging details: python_libraries_temp.log:
[csmapi][warning] /u/wcmorris/CAST/csmi/src/common/src/csmi_common_utils.c-147: the Error Flag Set
[csmapi][error] csmi_sendrecv_cmd failed: 46 - csm_allocation_create[220330596]; Allocation ID: 1; Primary Job Id: 1; Secondary Job Id: 0;Database Error Message: The following nodes were not available: ;The following nodes were not found: c650f99p18 ; Message: Allocation is being reverted; Unable to reserve nodes; Message: Allocation was successfully reverted;
Create Failed
Test Case 2: Workload Manager Library - fvt_allocation_create_and_delete.py
Expected RC: 0
Actual RC: 46
After the test case revision:
[root@c650f99p06 basic]# cat python_libraries.log
------------------------------------------------------------
Starting Python Libraries Bucket
------------------------------------------------------------
Fri Mar 5 16:09:46 EST 2021
------------------------------------------------------------
[2021-03-05 16:09:47.4484] Test Case 1: Inventory Library - fvt_node_attributes_query_and_update.py: PASS
[2021-03-05 16:09:47.5393] Test Case 2: Workload Manager Library - fvt_allocation_create_and_delete.py: PASS
------------------------------------------------------------
Python Libraries Bucket COMPLETED
------------------------------------------------------------
TODOs
[ ] @williammorrison2 to review (CSM FVT regression)
…ications.
Purpose
Add some additional checking to existing test cases to reduce failures along with some script modifications
1. Enable the datacenter-gpu-manager daemon on all of the nodes where CSM is running in the cluster.
script:
csmtest/setup/csm_install.sh
DCGM is running by default on the IST clusters. This daemon should be configured to start at boot on all of the nodes running CSM in the cluster and/or started via the process that is used to start and stop the CSM daemons. The fix ensures all the daemons are running on the appropriate CSM nodes. (Tweak of the original implementation) Manual check after the
/u/wcmorris/CAST/csmtest/setup/csm_install.sh
was ran2. Enable the datacenter-gpu-manager daemon on all of the nodes where CSM is running in the cluster. (Similar to the previous update with some additional print output.)
3. Additional package removal in the
csmtest/setup/csm_uninstall.sh
Removing the
ibm-csm-tools
package to ensure there are no collisions while reinstalling different packages.4. Test case that requires some additional retry attempts as the node is not ready for updating.
/buckets/basic/fvt_node_attributes_query_and_update.py ${SINGLE_COMPUTE}
After running regression a few times, this test case failed on multiple attempts. Based on some of the evidence, it seems the that node is "not available". While retesting a few times the test now seems to pass.Logging details:
python_libraries.log
Logging details:
python_libraries_flags.log
Logging details:
python_libraries_temp.log:
After the test case revision:
TODOs