flux-framework / flux-coral2

Plugins and services for Flux on CORAL2 systems
GNU Lesser General Public License v3.0
9 stars 7 forks source link

hang in MPI_Init with unbalanced ranks #222

Open ryanday36 opened 17 hours ago

ryanday36 commented 17 hours ago

as described in https://rzlc.llnl.gov/jira/browse/ELCAP-705:

(these are all run with -o mpibind=off, fwiw)

in a two node allocation (flux alloc -N2), running flux run -n190 ... puts 96 tasks on one node and 94 on the other and hangs until I ctrl-c.

If I run with flux run -N2 -n190 ... flux puts 95 tasks on each node and things run fine (if slowly).

If I use flux's pmi2 (-o pmi=pmi2 instead of whatever cray mpi is using by default, the original case runs fine.

I did some good old fashioned printf debugging, and it looks like the hang is in MPI_Init, but I haven't gotten any deeper than that. I suspect that this is a an HPE issue, but I'm opening it here too in case you all have any insight. The bit that seems extra confusing is that flux run -n191 ... hangs, but flux run -N2 -n191 ... doesn't. Both of those should have 96 tasks on one node and 95 on the other, so that doesn't fit super well with my characterization of this as an issue with unbalanced ranks / node.

garlick commented 16 hours ago

-N might be flipping where the unused core is located. Example: 2 nodes with 4 cores each

ƒ(s=2,d=1) garlick@picl3:~$ flux run -n7 hostname
picl4
picl4
picl4
picl4
picl3
picl3
picl3
ƒ(s=2,d=1) garlick@picl3:~$ flux run -N2 -n7 hostname
picl3
picl3
picl3
picl3
picl4
picl4
picl4

It might be worth doing a little audit here to see if anything stands out with these layouts in mind.

grondo commented 14 hours ago

I think @garlick meant to put this comment here:

Have to leave right now but one thing that seems wrong is flux job taskmap shows the same map for both of those cases.

[[0,1,4,1],[1,1,3,1]]

That may be a flux-core bug. Will circle back to this later!

I wonder if the two jobs have the same R? I'll try to reproduce this.

garlick commented 14 hours ago

yes sorry!

grondo commented 14 hours ago

Hm, this is interesting (did we know this and just forgot?)

$ flux run -N2 -n 7 /bin/true
$ flux job info $(flux job last) R
{"version": 1, "execution": {"R_lite": [{"rank": "0-1", "children": {"core": "0-3"}}], "starttime": 1727383096.7284338, "expiration": 0.0, "nodelist": ["corona[82,82]"]}}

The -N2 -n7 case allocates all 4 cores on both ranks, while -n7 alone allocates just the 7 requested cores:

$ flux run -n 7 /bin/true
$ flux job info $(flux job last) R
{"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "0-3"}}, {"rank": "1", "children": {"core": "0-2"}}], "starttime": 1727383280.7969263, "expiration": 0.0, "nodelist": ["corona[82,82]"]}}

This seems to be explicit in the jobspec created by the first case:

$ flux run -N2 -n7 --dry-run hostname | jq .resources
[
  {
    "type": "node",
    "count": 2,
    "with": [
      {
        "type": "slot",
        "count": 4,
        "with": [
          {
            "type": "core",
            "count": 1
          }
        ],
        "label": "task"
      }
    ]
  }
]

There is even a comment in the code:

https://github.com/flux-framework/flux-core/blob/e3f293b4bb8f34da55b7e9253da1de18e2f93aef/src/bindings/python/flux/job/Jobspec.py#L886-L891

        if num_nodes is not None:
            num_slots = int(math.ceil(num_tasks / float(num_nodes)))
            if num_tasks % num_nodes != 0:
                # N.B. uneven distribution results in wasted task slots
                task_count_dict = {"total": num_tasks}
            else:
                task_count_dict = {"per_slot": 1}
            slot = cls._create_slot("task", num_slots, children)
            resource_section = cls._create_resource(
                "node", num_nodes, [slot], exclusive
            )
grondo commented 14 hours ago

Anyway, maybe the extra task slot is confusing the taskmap stuff into running the wrong number of tasks on one of the nodes?

garlick commented 13 hours ago

I think the taskmaps are actually correct and I was confused. Fluxion is packing 4 ranks onto the first node in both cases, and 3 on the second, but for some reason when -N is specified, the order of nodes is reversed.