Closed awlauria closed 2 years ago
To test this, I installed a small 4 node csm cluster on our smpi P9 machines, and ran with jsm + csm. The jsm I ran with has the code removed to sort the csm_allocation_query output.
[awlauria@f8n01 src]$ sudo /opt/ibm/csm/bin/csm_allocation_create -l f8n01 -n "f8n06,f8n02,f8n04,f8n03"
[csmapi][warning] Invalid 'primary_job_id supplied (<= 0), setting to 1.
---
allocation_id: 16
num_nodes: 4
- compute_nodes: f8n06
- compute_nodes: f8n02
- compute_nodes: f8n04
- compute_nodes: f8n03
user_name: root
user_id: 0
state: running
type: user-managed
job_submit_time: 2020-11-12 13:14:16
...
[awlauria@f8n01 src]$ export CSM_ALLOCATION_ID=16
[awlauria@f8n01 src]$ jsm &
[awlauria@f8n01 src]$ jsrun --rs_per_host 1 hostname
f8n03
f8n06
f8n02
f8n04
[awlauria@f8n01 src]$ /opt/ibm/csm/bin/csm_allocation_query -a $CSM_ALLOCATION_ID
---
allocation_id: 16
primary_job_id: 1
secondary_job_id: 0
num_nodes: 4
compute_nodes:
- f8n06
- f8n04
- f8n03
- f8n02
ssd_file_system_name:
launch_node_name: f8n01
user_flags:
system_flags:
smt_mode: 0
ssd_min: 0
ssd_max: 0
num_processors: 0
num_gpus: 0
projected_memory: 0
state: running
type: user-managed
job_type: batch
user_name: root
user_id: 0
user_group_id: 0
user_script:
begin_time: 2020-11-12 13:14:16.108877
account:
comment:
job_name:
job_submit_time: 2020-11-12 13:14:16
queue:
requeue:
time_limit: 0
wc_key:
isolated_cores: 0
...
[awlauria@f8n01 src]$ pkill jsmd
[awlauria@f8n01 src]$ sudo /opt/ibm/csm/bin/csm_allocation_delete -a $CSM_ALLOCATION_ID
---
# Allocation ID: 16 successfully deleted
...
I thought JSM did an alphanumeric sort due to a client's preference on hostname sorting. I'm not 100% but i'd think "ORDER BY node_name DESC" would do a alphabetic sort instead - - so might not be the same thing.
@tgooding you are correct. And properly sorting this (in my googling) via an ORDER BY query is error prone and a bit tricky. So instead I pushed up the logic that jsm uses to CSM.
This change was done in two places:
Sort the compute_nodes before returning from csm_allocation_query() - this provides no real benefit to csm/jsm launching. The root jsmd would previously sort this list, the compute nodes would receive the list sorted. So this is a straight logic push from jsm -> csm.
Sort the compute nodes before the multicast in the CSM launch of the jsmd's. This does provide some benefit in that now the hostnames are sorted before the multicast, and each compute jsmd no longer needs to sort the list of compute nodes.
Closing. This is a minor cleanup issue that isn't needed.
Currently, each compute node jsmd has to sort this list to make sure it is in a consistent ordering. Instead of sorting it on each jsmd node, sort this via the sql query, which is probably more efficient.