CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27
stars
34
forks
source link
Adding in a new test bucket in the advanced section (validate the CSM… #978
Validate the CSM allocation metrics after a job has completed.
CSM collects a variety of statistics during the life of a job and populates them into the allocation records after the job completes.
The stats can be viewed using the csm_allocation_query_details command after a job completes.
Checking these statistics is another good sanity check.
For example, when the cgroup configuration is not set up correctly,
these stats will report cpu_usage of 0, even if the job had many compute intensive tasks running.
This testing helps catch numerous integration issues because different stats validate correct CSM integration with different components. In the example below,
the gpfs_read and gpfs_write values of -1 indicate that integration with GPFS data monitoring was not configured or working when this job was run, for example.
Values of 0 or -1 for gpu_energy, gpu_usage, and cpu_usage for long lived gpu or cpu intensive jobs may also indicate an integration problem.
Check some initial things before the actual allocation is created.
Make sure the nodes are IN_SERVICE (if not FAIL out of script)
Creating an allocation (if not FAIL out of script)
Do some GPU work on the node
Do some CPU work on the node
Check the expected values from the results of the allocation.
a. gpfs_read
b. gpfs_write
c. energy_consumed
d. power_cap_hit
e. gpu_energy
f. gpu_usage
g. cpu_usage
h. memory_usage_max
If there are any fields that return a value of -1 then the test case will indicate a FAILED result.
Example log file result:
------------------------------------------------------------
Starting Advanced Allocation Metrics Bucket
------------------------------------------------------------
Thu Oct 1 15:20:10 EDT 2020
------------------------------------------------------------
[2020-10-01 15:20:10.9878] Test Case 0: Set Compute Nodes to (IN_SERVICE): Calling update_computes_in_service: PASS
[2020-10-01 15:20:11.2635] Test Case 1: Calling csm_allocation_create with 2 isolated core(s): PASS
[2020-10-01 15:21:19.8708] Test Case 2: csm_allocation_query_details on c650f99p18 allocation_id: 13 check gpfs_read: -1: FAILED
[2020-10-01 15:21:19.8739] Test Case 3: csm_allocation_query_details on c650f99p18 allocation_id: 13 check gpfs_write: -1: FAILED
[2020-10-01 15:21:19.8770] Test Case 4: csm_allocation_query_details on c650f99p18 allocation_id: 13 check energy_consumed: 89552: PASS
[2020-10-01 15:21:19.8791] Test Case 5: csm_allocation_query_details on c650f99p18 allocation_id: 13 check power_cap_hit: 0: PASS
[2020-10-01 15:21:19.8812] Test Case 6: csm_allocation_query_details on c650f99p18 allocation_id: 13 check gpu_energy: 71194: PASS
[2020-10-01 15:21:19.8834] Test Case 7: csm_allocation_query_details on c650f99p18 allocation_id: 13 check gpu_usage: 262456471: PASS
[2020-10-01 15:21:19.8855] Test Case 8: csm_allocation_query_details on c650f99p18 allocation_id: 13 check cpu_usage: 812087761152: PASS
[2020-10-01 15:21:19.8875] Test Case 9: csm_allocation_query_details on c650f99p18 allocation_id: 13 check memory_usage_max: 892403712: PASS
------------------------------------------------------------
Advanced Allocation Metrics Bucket COMPLETED
------------------------------------------------------------
Additional Flags:
Test Case 2: csm_allocation_query_details on c650f99p18 allocation_id: 13 check gpfs_read: -1
Test Case 3: csm_allocation_query_details on c650f99p18 allocation_id: 13 check gpfs_write: -1
------------------------------------------------------------
This will help detect any integration issues mentioned early in the test overview.
… allocation metrics after a job has completed)
Purpose
Validate the CSM allocation metrics after a job has completed.
CSM collects a variety of statistics during the life of a job and populates them into the allocation records after the job completes. The stats can be viewed using the csm_allocation_query_details command after a job completes. Checking these statistics is another good sanity check. For example, when the cgroup configuration is not set up correctly, these stats will report cpu_usage of 0, even if the job had many compute intensive tasks running.
This testing helps catch numerous integration issues because different stats validate correct CSM integration with different components. In the example below, the gpfs_read and gpfs_write values of -1 indicate that integration with GPFS data monitoring was not configured or working when this job was run, for example. Values of 0 or -1 for gpu_energy, gpu_usage, and cpu_usage for long lived gpu or cpu intensive jobs may also indicate an integration problem.
Example output:
How to Test
Check some initial things before the actual allocation is created.
IN_SERVICE
(if not FAIL out of script)The expected results:
If there are any fields that return a value of
-1
then the test case will indicate aFAILED
result.Example log file result:
This will help detect any integration issues mentioned early in the test overview.
Open Questions and Pre-Merge TODOs