IBM / CAST

CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27 stars 34 forks source link

CSM FVT updates related to hcdiag ppping and node recovering test cas… #996

Closed williammorrison2 closed 3 years ago

williammorrison2 commented 3 years ago

…es along with hcdiag environmental configs.

Purpose

Hcdiag ppping test case:

This is most likely due to needing some more traffic over the network interface to "prime the pump". Added in a retry to the test.

Node.sh Node recovering test case:

From time to time while regression is running we hit a situation where the node bucket run_bucket "basic" "node" (in the complete fvt script) fails while RECOVERING the compute node(s). example output from the node.log file in /opt/ibm/csm/test/results/buckets/basic

#############################################

Finished. Cleaning up...
Test complete: rc=0
------------------------------------------------------------
Test Case 1: csm_node_resources_query_all:                                                               PASS
Test Case 1: check node_ready=n:                                                                         PASS
Test Case 2: Calling csm_node_resources_query on 1 node:                                                 PASS
Test Case 3: csm_node_resources_query on all nodes:                                                      PASS
Test Case 4: Calling csm_node_attributes_query on 1 node:                                                PASS
Test Case 5: Calling csm_node_attributes_query on all nodes:                                             PASS
Test Case 6: Calling csm_node_attributes_update to state=IN_SERVICE on all nodes:                        PASS
Test Case 7: Calling csm_node_attributes_query on all nodes:                                             PASS
Test Case 7: Checking for state=IN_SERVICE:                                                              PASS
Test Case 8: Calling csm_node_query_state_history on 1 node:                                             PASS
Test Case 8: Checking for state=IN_SERVICE and CSM_API:                                                  PASS
Test Case 9: Calling csm_node_attributes_query_details on 1 node:                                        PASS
Test Case 10: csm_node_attributes_query_details on all nodes (error expected):                           PASS
Test Case 11: Calling csm_node_attributes_query_history on 1 node:                                       PASS
Test Case 12: Calling csm_node_attributes_query_history on all nodes (error expected):                   PASS
Test Case 13: calling csm_node_delete:                                                                   PASS
RECOVERING c650f99p18...
FAILED
------------------------------------------------------------
                node Bucket COMPLETED
------------------------------------------------------------
Additional Flags:

------------------------------------------------------------

After adding in the multiple attempt process and cleaning up the logic, the test case now checks and raises a flag if an error occurs.

[2021-02-25 14:41:22.5177] Test Case 11:  Calling csm_node_attributes_query_history on 1 node:                                                PASS
[2021-02-25 14:41:22.5284] Test Case 12:  Calling csm_node_attributes_query_history on all nodes (error expected):                            PASS
[2021-02-25 14:41:22.5583] Test Case 13:  calling csm_node_delete:                                                                            PASS
[2021-02-25 14:41:26.0758] Test Case 14:  RECOVERING c650f99p18 stopping csmd-compute:                                                        PASS
[2021-02-25 14:41:27.0816] Test Case 15:  RECOVERING c650f99p18 start csmd-compute:                                                           PASS
[2021-02-25 14:41:27.0931] Test Case 16:  csm_node_resources_query_all c650f99p18:                                                            PASS
[2021-02-25 14:41:28.1299] Test Case 17:  csm_node_attributes_update c650f99p18 IN_SERVICE:                                                   PASS
------------------------------------------------------------
                node Bucket COMPLETED
------------------------------------------------------------
Additional Flags:

Hcdiag (automating the clustconf.yaml and test.properties scripts for both P8 and P9 environments)

I added in a step in the /csmtest/buckets/basic/hcdiag.sh script that builds the appropriate environmental details, so that certain test cases will pass in either the P8 or P9 environments.

The default scripts will cause the following FAILURES in either the P8 or P9 environments

[2021-02-26 10:23:57.2348] Test Case 15: chk-smt:                                                                                           FAILED
[2021-02-26 10:24:01.7011] Test Case 16: chk-temp:                                                                                          FAILED
[2021-02-26 10:24:22.5164] Test Case 21: chk-cpu:                                                                                           FAILED
[2021-02-26 10:24:26.8566] Test Case 22: chk-cpu-count:                                                                                     FAILED
[2021-02-26 10:24:32.5258] Test Case 23: chk-sys-firmware:                                                                                  FAILED
[2021-02-26 10:24:36.8624] Test Case 24: chk-memory:                                                                                        FAILED
[2021-02-26 10:24:45.2858] Test Case 26: chk-os:                                                                                            FAILED
[2021-02-26 10:24:49.7102] Test Case 27: chk-temp:                                                                                          FAILED

After the implemented change the hcdiag bucket was retested and all the cases passed on both the P8 and P9 environments.

[2021-02-26 13:10:58.3516] Test Case 15: chk-smt:                                                                                             PASS
[2021-02-26 13:11:03.4401] Test Case 16: chk-temp:                                                                                            PASS
[2021-02-26 13:11:31.1820] Test Case 21: chk-cpu:                                                                                             PASS
[2021-02-26 13:11:36.3790] Test Case 22: chk-cpu-count:                                                                                       PASS
[2021-02-26 13:11:42.1085] Test Case 23: chk-sys-firmware:                                                                                    PASS
[2021-02-26 13:11:50.5538] Test Case 24: chk-memory:                                                                                          PASS
[2021-02-26 13:12:00.5975] Test Case 26: chk-os:                                                                                              PASS
[2021-02-26 13:12:05.7110] Test Case 27: chk-temp:                                                                                            PASS
[2021-02-26 13:12:05.7193] Cleanup     : clustconf.yaml file:                                                                                 PASS

TODOs