IBM / CAST

CAST can enhance the system management of cluster-wide resources. It consists of the open source tools: cluster system management (CSM) and burst buffer.
Eclipse Public License 1.0
27 stars 34 forks source link

Sort compute node output for csm_allocation_query and CSM/JSM launch. #987

Closed awlauria closed 2 years ago

awlauria commented 3 years ago

Currently, each compute node jsmd has to sort this list to make sure it is in a consistent ordering. Instead of sorting it on each jsmd node, sort this via the sql query, which is probably more efficient.

awlauria commented 3 years ago

To test this, I installed a small 4 node csm cluster on our smpi P9 machines, and ran with jsm + csm. The jsm I ran with has the code removed to sort the csm_allocation_query output.

[awlauria@f8n01 src]$ sudo /opt/ibm/csm/bin/csm_allocation_create -l f8n01 -n "f8n06,f8n02,f8n04,f8n03"
[csmapi][warning]   Invalid 'primary_job_id supplied (<= 0), setting to 1.
---
allocation_id: 16
num_nodes: 4
- compute_nodes: f8n06
- compute_nodes: f8n02
- compute_nodes: f8n04
- compute_nodes: f8n03
user_name: root
user_id: 0
state: running
type: user-managed
job_submit_time: 2020-11-12 13:14:16
...
[awlauria@f8n01 src]$ export CSM_ALLOCATION_ID=16
[awlauria@f8n01 src]$ jsm &
[awlauria@f8n01 src]$ jsrun --rs_per_host 1 hostname
f8n03
f8n06
f8n02
f8n04
[awlauria@f8n01 src]$ /opt/ibm/csm/bin/csm_allocation_query -a $CSM_ALLOCATION_ID
---
allocation_id:                  16
primary_job_id:                 1
secondary_job_id:               0
num_nodes:                      4
compute_nodes:
 - f8n06
 - f8n04
 - f8n03
 - f8n02
ssd_file_system_name:           
launch_node_name:               f8n01
user_flags:                     
system_flags:                   
smt_mode:                       0
ssd_min:                        0
ssd_max:                        0
num_processors:                 0
num_gpus:                       0
projected_memory:               0
state:                          running
type:                           user-managed
job_type:                       batch
user_name:                      root
user_id:                        0
user_group_id:                  0
user_script:                    
begin_time:                     2020-11-12 13:14:16.108877
account:                        
comment:                        
job_name:                       
job_submit_time:                2020-11-12 13:14:16
queue:                          
requeue:                        
time_limit:                     0
wc_key:                         
isolated_cores:                 0
...
[awlauria@f8n01 src]$ pkill jsmd
[awlauria@f8n01 src]$ sudo /opt/ibm/csm/bin/csm_allocation_delete -a $CSM_ALLOCATION_ID
---
# Allocation ID: 16 successfully deleted
...
tgooding commented 3 years ago

I thought JSM did an alphanumeric sort due to a client's preference on hostname sorting. I'm not 100% but i'd think "ORDER BY node_name DESC" would do a alphabetic sort instead - - so might not be the same thing.

awlauria commented 3 years ago

@tgooding you are correct. And properly sorting this (in my googling) via an ORDER BY query is error prone and a bit tricky. So instead I pushed up the logic that jsm uses to CSM.

This change was done in two places:

awlauria commented 2 years ago

Closing. This is a minor cleanup issue that isn't needed.