IBM / CAST

CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27 stars 34 forks source link

Adding in a new test bucket in the advanced section (validate the CSM… #978

Closed williammorrison2 closed 3 years ago

williammorrison2 commented 3 years ago

… allocation metrics after a job has completed)

Purpose

Validate the CSM allocation metrics after a job has completed.

CSM collects a variety of statistics during the life of a job and populates them into the allocation records after the job completes. The stats can be viewed using the csm_allocation_query_details command after a job completes. Checking these statistics is another good sanity check. For example, when the cgroup configuration is not set up correctly, these stats will report cpu_usage of 0, even if the job had many compute intensive tasks running.

This testing helps catch numerous integration issues because different stats validate correct CSM integration with different components. In the example below, the gpfs_read and gpfs_write values of -1 indicate that integration with GPFS data monitoring was not configured or working when this job was run, for example. Values of 0 or -1 for gpu_energy, gpu_usage, and cpu_usage for long lived gpu or cpu intensive jobs may also indicate an integration problem.

Example output:

[root@c650f99p06 c650mnp05]# /opt/ibm/csm/bin/csm_allocation_query_details -a 1
---
allocation_id:                  1
primary_job_id:                 1023
secondary_job_id:               0
num_nodes:                      1
compute_nodes:
  c650f99p36:
   ib_rx:                       30960
   ib_tx:                       6768
   gpfs_read:                   -1
   gpfs_write:                  -1
   energy_consumed:             162540
   power_cap:                   3050
   power_shifting_ratio:        100
   power_cap_hit:               0
   gpu_energy:                  126698
   gpu_usage:                   736998220
   cpu_usage:                   791151769374
   memory_usage_max:            864813056

How to Test

Check some initial things before the actual allocation is created.

  1. Make sure the nodes are IN_SERVICE (if not FAIL out of script)
  2. Creating an allocation (if not FAIL out of script)
  3. Do some GPU work on the node
  4. Do some CPU work on the node
  5. Check the expected values from the results of the allocation. a. gpfs_read b. gpfs_write c. energy_consumed d. power_cap_hit e. gpu_energy f. gpu_usage g. cpu_usage h. memory_usage_max

The expected results:

[root@c650f99p06 advanced]# ./allocation_metrics_test.sh
------------------------------------------------------------
Starting Advanced Allocation Metrics Bucket
------------------------------------------------------------
Creating an allocation
ALLOCATION_ID = 1
Running tests, please wait...
Allocation ID: 1 successfully deleted
------------------------------------------------------------
Completed Advanced Allocation Metrics Bucket
------------------------------------------------------------

If there are any fields that return a value of -1 then the test case will indicate a FAILED result.

Example log file result:

------------------------------------------------------------
       Starting Advanced Allocation Metrics Bucket
------------------------------------------------------------
Thu Oct  1 15:20:10 EDT 2020
------------------------------------------------------------
[2020-10-01 15:20:10.9878] Test Case 0: Set Compute Nodes to (IN_SERVICE): Calling update_computes_in_service:                                PASS
[2020-10-01 15:20:11.2635] Test Case 1: Calling csm_allocation_create with 2 isolated core(s):                                                PASS
[2020-10-01 15:21:19.8708] Test Case 2: csm_allocation_query_details on c650f99p18 allocation_id: 13 check gpfs_read: -1:                   FAILED
[2020-10-01 15:21:19.8739] Test Case 3: csm_allocation_query_details on c650f99p18 allocation_id: 13 check gpfs_write: -1:                  FAILED
[2020-10-01 15:21:19.8770] Test Case 4: csm_allocation_query_details on c650f99p18 allocation_id: 13 check energy_consumed: 89552:            PASS
[2020-10-01 15:21:19.8791] Test Case 5: csm_allocation_query_details on c650f99p18 allocation_id: 13 check power_cap_hit: 0:                  PASS
[2020-10-01 15:21:19.8812] Test Case 6: csm_allocation_query_details on c650f99p18 allocation_id: 13 check gpu_energy: 71194:                 PASS
[2020-10-01 15:21:19.8834] Test Case 7: csm_allocation_query_details on c650f99p18 allocation_id: 13 check gpu_usage: 262456471:              PASS
[2020-10-01 15:21:19.8855] Test Case 8: csm_allocation_query_details on c650f99p18 allocation_id: 13 check cpu_usage: 812087761152:           PASS
[2020-10-01 15:21:19.8875] Test Case 9: csm_allocation_query_details on c650f99p18 allocation_id: 13 check memory_usage_max: 892403712:       PASS
------------------------------------------------------------
        Advanced Allocation Metrics Bucket COMPLETED
------------------------------------------------------------
Additional Flags:

Test Case 2: csm_allocation_query_details on c650f99p18 allocation_id: 13 check gpfs_read: -1
Test Case 3: csm_allocation_query_details on c650f99p18 allocation_id: 13 check gpfs_write: -1
------------------------------------------------------------

This will help detect any integration issues mentioned early in the test overview.

Open Questions and Pre-Merge TODOs