Open ryanday36 opened 17 hours ago
-N might be flipping where the unused core is located. Example: 2 nodes with 4 cores each
ƒ(s=2,d=1) garlick@picl3:~$ flux run -n7 hostname
picl4
picl4
picl4
picl4
picl3
picl3
picl3
ƒ(s=2,d=1) garlick@picl3:~$ flux run -N2 -n7 hostname
picl3
picl3
picl3
picl3
picl4
picl4
picl4
It might be worth doing a little audit here to see if anything stands out with these layouts in mind.
I think @garlick meant to put this comment here:
Have to leave right now but one thing that seems wrong is flux job taskmap shows the same map for both of those cases.
[[0,1,4,1],[1,1,3,1]]
That may be a flux-core bug. Will circle back to this later!
I wonder if the two jobs have the same R? I'll try to reproduce this.
yes sorry!
Hm, this is interesting (did we know this and just forgot?)
$ flux run -N2 -n 7 /bin/true
$ flux job info $(flux job last) R
{"version": 1, "execution": {"R_lite": [{"rank": "0-1", "children": {"core": "0-3"}}], "starttime": 1727383096.7284338, "expiration": 0.0, "nodelist": ["corona[82,82]"]}}
The -N2 -n7
case allocates all 4 cores on both ranks, while -n7
alone allocates just the 7 requested cores:
$ flux run -n 7 /bin/true
$ flux job info $(flux job last) R
{"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "0-3"}}, {"rank": "1", "children": {"core": "0-2"}}], "starttime": 1727383280.7969263, "expiration": 0.0, "nodelist": ["corona[82,82]"]}}
This seems to be explicit in the jobspec created by the first case:
$ flux run -N2 -n7 --dry-run hostname | jq .resources
[
{
"type": "node",
"count": 2,
"with": [
{
"type": "slot",
"count": 4,
"with": [
{
"type": "core",
"count": 1
}
],
"label": "task"
}
]
}
]
There is even a comment in the code:
if num_nodes is not None:
num_slots = int(math.ceil(num_tasks / float(num_nodes)))
if num_tasks % num_nodes != 0:
# N.B. uneven distribution results in wasted task slots
task_count_dict = {"total": num_tasks}
else:
task_count_dict = {"per_slot": 1}
slot = cls._create_slot("task", num_slots, children)
resource_section = cls._create_resource(
"node", num_nodes, [slot], exclusive
)
Anyway, maybe the extra task slot is confusing the taskmap stuff into running the wrong number of tasks on one of the nodes?
I think the taskmaps are actually correct and I was confused. Fluxion is packing 4 ranks onto the first node in both cases, and 3 on the second, but for some reason when -N is specified, the order of nodes is reversed.
as described in https://rzlc.llnl.gov/jira/browse/ELCAP-705:
(these are all run with
-o mpibind=off
, fwiw)in a two node allocation (
flux alloc -N2
), runningflux run -n190 ...
puts 96 tasks on one node and 94 on the other and hangs until I ctrl-c.If I run with
flux run -N2 -n190 ...
flux puts 95 tasks on each node and things run fine (if slowly).If I use flux's pmi2 (
-o pmi=pmi2
instead of whatever cray mpi is using by default, the original case runs fine.I did some good old fashioned printf debugging, and it looks like the hang is in MPI_Init, but I haven't gotten any deeper than that. I suspect that this is a an HPE issue, but I'm opening it here too in case you all have any insight. The bit that seems extra confusing is that
flux run -n191 ...
hangs, butflux run -N2 -n191 ...
doesn't. Both of those should have 96 tasks on one node and 95 on the other, so that doesn't fit super well with my characterization of this as an issue with unbalanced ranks / node.