hang in MPI_Init with unbalanced ranks

ryanday36 commented 1 month ago

as described in https://rzlc.llnl.gov/jira/browse/ELCAP-705:

(these are all run with -o mpibind=off, fwiw)

in a two node allocation (flux alloc -N2), running flux run -n190 ... puts 96 tasks on one node and 94 on the other and hangs until I ctrl-c.

If I run with flux run -N2 -n190 ... flux puts 95 tasks on each node and things run fine (if slowly).

If I use flux's pmi2 (-o pmi=pmi2 instead of whatever cray mpi is using by default, the original case runs fine.

I did some good old fashioned printf debugging, and it looks like the hang is in MPI_Init, but I haven't gotten any deeper than that. I suspect that this is a an HPE issue, but I'm opening it here too in case you all have any insight. The bit that seems extra confusing is that flux run -n191 ... hangs, but flux run -N2 -n191 ... doesn't. Both of those should have 96 tasks on one node and 95 on the other, so that doesn't fit super well with my characterization of this as an issue with unbalanced ranks / node.

garlick commented 1 month ago

-N might be flipping where the unused core is located. Example: 2 nodes with 4 cores each

ƒ(s=2,d=1) garlick@picl3:~$ flux run -n7 hostname
picl4
picl4
picl4
picl4
picl3
picl3
picl3
ƒ(s=2,d=1) garlick@picl3:~$ flux run -N2 -n7 hostname
picl3
picl3
picl3
picl3
picl4
picl4
picl4

It might be worth doing a little audit here to see if anything stands out with these layouts in mind.

grondo commented 1 month ago

I think @garlick meant to put this comment here:

Have to leave right now but one thing that seems wrong is flux job taskmap shows the same map for both of those cases.

[[0,1,4,1],[1,1,3,1]]

That may be a flux-core bug. Will circle back to this later!

I wonder if the two jobs have the same R? I'll try to reproduce this.

garlick commented 1 month ago

yes sorry!

grondo commented 1 month ago

Hm, this is interesting (did we know this and just forgot?)

$ flux run -N2 -n 7 /bin/true
$ flux job info $(flux job last) R
{"version": 1, "execution": {"R_lite": [{"rank": "0-1", "children": {"core": "0-3"}}], "starttime": 1727383096.7284338, "expiration": 0.0, "nodelist": ["corona[82,82]"]}}

The -N2 -n7 case allocates all 4 cores on both ranks, while -n7 alone allocates just the 7 requested cores:

$ flux run -n 7 /bin/true
$ flux job info $(flux job last) R
{"version": 1, "execution": {"R_lite": [{"rank": "0", "children": {"core": "0-3"}}, {"rank": "1", "children": {"core": "0-2"}}], "starttime": 1727383280.7969263, "expiration": 0.0, "nodelist": ["corona[82,82]"]}}

This seems to be explicit in the jobspec created by the first case:

$ flux run -N2 -n7 --dry-run hostname | jq .resources
[
  {
    "type": "node",
    "count": 2,
    "with": [
      {
        "type": "slot",
        "count": 4,
        "with": [
          {
            "type": "core",
            "count": 1
          }
        ],
        "label": "task"
      }
    ]
  }
]

There is even a comment in the code:

https://github.com/flux-framework/flux-core/blob/e3f293b4bb8f34da55b7e9253da1de18e2f93aef/src/bindings/python/flux/job/Jobspec.py#L886-L891

        if num_nodes is not None:
            num_slots = int(math.ceil(num_tasks / float(num_nodes)))
            if num_tasks % num_nodes != 0:
                # N.B. uneven distribution results in wasted task slots
                task_count_dict = {"total": num_tasks}
            else:
                task_count_dict = {"per_slot": 1}
            slot = cls._create_slot("task", num_slots, children)
            resource_section = cls._create_resource(
                "node", num_nodes, [slot], exclusive
            )

grondo commented 1 month ago

Anyway, maybe the extra task slot is confusing the taskmap stuff into running the wrong number of tasks on one of the nodes?

garlick commented 1 month ago

I think the taskmaps are actually correct and I was confused. Fluxion is packing 4 ranks onto the first node in both cases, and 3 on the second, but for some reason when -N is specified, the order of nodes is reversed.

garlick commented 1 month ago

FWIW I ran the two cases dumping the apinfo using t/scripts/apinfo_checker.py for each rank and also the PMI and PALS environment variables.

The apinfo comes out the same for both jobs (on both nodes).

The environment seems to only differ in the expected ways.

I did notice that slurm is now up to version 5 of the apinfo struct, while we are on version 0.

Also slurm sets PALS_LOCAL_RANKID in the environment while we do not.

garlick commented 1 month ago

slurm also sets several PMI variables that we don't set. I assumed these would be set by cray's libpmi if at all since they are normally for the benefit of MPI, but since we're seeing a problem, maybe worth noting: PMI_JOBID, PMI_LOCAL_RANK, PMI_LOCAL_SIZE, PMI_RANK, PMI_SIZE, PMI_UNIVERSE_SIZE.

I took the extra step of adding a printenv to an MPI hello world program compiled with cray MPI and running that, and I"m not seeing any PMI_ variables other than PMI_CONTROL_PORT and PMI_SHARED_SECRET which we do set.

garlick commented 1 month ago

Looks like cray pmi prints debug on stderr when PMI_DEBUG=1 in the environment.

In @ryanday36's failing case above, the first rank on the second node (96) correctly identifies the address and port that the PMI rank 0 node is listening on, apparently successfully connects, and sends a barrier request there on behalf of its other local ranks. The rank 0 node never sees the connection. Where did it connect?

In the good case, the connection is logged and a barrier release packet is returned and things proceed.

Sanitized failing log of PE_0 and PE_96 with some noise removed

Mon Oct  7 10:56:52 2024: [PE_0]: _pmi2_kvs_hash_entries = 1
Mon Oct  7 10:56:52 2024: [PE_0]: mmap in a file for shared memory type 0 len 2520448
Mon Oct  7 10:56:52 2024: [PE_0]: PMI mmap filename: /dev/shm/shared_memory.PMI.4347849080832
Mon Oct  7 10:56:52 2024: [PE_0]:  mmap loop pg_info->pes_this_node = 96
Mon Oct  7 10:56:52 2024: [PE_96]: _pmi2_kvs_hash_entries = 1
Mon Oct  7 10:56:52 2024: [PE_96]: mmap in a file for shared memory type 0 len 2487680
Mon Oct  7 10:56:52 2024: [PE_96]: PMI mmap filename: /dev/shm/shared_memory.PMI.4347849080832
Mon Oct  7 10:56:52 2024: [PE_96]:  mmap loop pg_info->pes_this_node = 94
Mon Oct  7 10:56:53 2024: [PE_96]:  pals_get_nodes nnodes = 2 pals_get_nics nnics = 0
Mon Oct  7 10:56:53 2024: [PE_96]:  pals_get_nodes nnodes = 2 pals_get_hsn_nics nnics = 0
Mon Oct  7 10:56:53 2024: [PE_0]:  pals_get_nodes nnodes = 2 pals_get_nics nnics = 0
Mon Oct  7 10:56:53 2024: [PE_0]:  pals_get_nodes nnodes = 2 pals_get_hsn_nics nnics = 0
Mon Oct  7 10:56:53 2024: [PE_0]: PMI mmap filename: /dev/shm/shared_memory.PMI.4347849080832.applist
Mon Oct  7 10:56:53 2024: [PE_0]:  mmap loop pg_info->pes_this_node = 96
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_control_net_init: my nid = 0  
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_control_net_init: num_nodes = 2
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_control_net_init: parent_id is -1
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_control_net_init:  num_targets 1
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_control_net_init: my nid = 0 
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_control_net_init: num_nodes = 2
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_control_net_init: parent_id is -1
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_control_net_init:  num_targets 1
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_nid_to_hostname: hostname nid 0 = REDACTED_NODE0_HOSTNAME  
Mon Oct  7 10:56:53 2024: [PE_0]: input nid  = 0 _pmi_inet_info[0].nid =0  
Mon Oct  7 10:56:53 2024: [PE_0]: PMI: IP Address version in use: 2(IPv4)
Mon Oct  7 10:56:53 2024: [PE_0]:  _pmi_inet_listen_socket_setup num targets = 1 have_controller = 0
Mon Oct  7 10:56:53 2024: [PE_96]: PMI mmap filename: /dev/shm/shared_memory.PMI.4347849080832.applist
Mon Oct  7 10:56:53 2024: [PE_0]: Setting control listener IP to any IP 
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_inet_listen_socket_setup: setting up listening socket on addr 0.0.0.0 port 11998
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_inet_listen_socket_setup: open listen_sock 3
Mon Oct  7 10:56:53 2024: [PE_0]: PMI rank/pid filename : /var/tmp/garlick/flux-JsJqGs/jobtmp-0-f2yDDY1wm/pmi_attribs
Mon Oct  7 10:56:53 2024: [PE_96]:  mmap loop pg_info->pes_this_node = 94 
Mon Oct  7 10:56:53 2024: [PE_0]: calling _pmi_inet_setup (full)
Mon Oct  7 10:56:53 2024: [PE_0]: _pmi_inet_setup: have controller = 0 controller hostname
Mon Oct  7 10:56:53 2024: [PE_0]: accept_child_conns: waiting on listening
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_control_net_init: my nid = 1 
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_control_net_init: num_nodes = 2
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_control_net_init: parent_id is 0
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_nid_to_hostname: hostname nid 0 = REDACTED_NODE0_HOSTNAME
Mon Oct  7 10:56:53 2024: [PE_96]: input nid  = 0 _pmi_inet_info[0].nid =0  
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_control_net_init:  controller_nid = 0 controller hostname REDACTED_NODE0_HOSTNAME
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_control_net_init: my nid = 1 
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_control_net_init: num_nodes = 2
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_control_net_init: parent_id is 0
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_nid_to_hostname: hostname nid 0 = REDACTED_NODE0_HOSTNAME
Mon Oct  7 10:56:53 2024: [PE_96]: input nid  = 0 _pmi_inet_info[0].nid =0
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_control_net_init:  controller_nid = 0 controller hostname REDACTED_NODE0_HOSTNAME
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_nid_to_hostname: hostname nid 0 = REDACTED_NODE0_HOSTNAME
Mon Oct  7 10:56:53 2024: [PE_96]: input nid  = 0 _pmi_inet_info[0].nid =0  
Mon Oct  7 10:56:53 2024: [PE_96]: PMI: IP Address version in use: 2(IPv4)
Mon Oct  7 10:56:53 2024: [PE_96]:  _pmi_inet_listen_socket_setup num targets = 0 have_controller = 1
Mon Oct  7 10:56:53 2024: [PE_96]: Setting control listener IP to any IP
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_inet_listen_socket_setup: setting up listening socket on addr 0.0.0.0 port 11998
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_inet_listen_socket_setup: open listen_sock 3
Mon Oct  7 10:56:53 2024: [PE_96]: PMI rank/pid filename : /var/tmp/garlick/flux-uetjMU/jobtmp-1-f2yDDY1wm/pmi_attribs
Mon Oct  7 10:56:53 2024: [PE_96]: calling _pmi_inet_setup (full)
Mon Oct  7 10:56:53 2024: [PE_96]: _pmi_inet_setup: have controller = 1 controller hostname REDACTED_NODE0_HOSTNAME
Mon Oct  7 10:56:53 2024: [PE_96]: Failed to obtain IP address for REDACTED_NODE0_HOSTNAME-hsn0. [PMI_IP_ADDR_FAMILY=IPv4, rc=-2, errno=0]
Mon Oct  7 10:56:53 2024: [PE_96]: Failed to obtain IP address for REDACTED_NODE0_HOSTNAME-hsn1. [PMI_IP_ADDR_FAMILY=IPv4, rc=-2, errno=0]
Mon Oct  7 10:56:53 2024: [PE_96]: find_controller: hostname lookup: using port=11998 sleep=2 retry=300 control-name=REDACTED_NODE0_HOSTNAME contoller-ip=REDACTED_NODE0.IPADDRESS
Mon Oct  7 10:56:53 2024: [PE_96]: conn_to_controller: open controller fd 4
Mon Oct  7 10:56:53 2024: [PE_96]: network_barrier:sending barrier packet to my controller

Sanitized log of PE_0 and PE_95 (1st rank on second node) for good case:

Mon Oct  7 10:56:30 2024: [PE_0]: _pmi2_kvs_hash_entries = 1
Mon Oct  7 10:56:30 2024: [PE_0]: mmap in a file for shared memory type 0 len 2504064
Mon Oct  7 10:56:30 2024: [PE_0]: PMI mmap filename: /dev/shm/shared_memory.PMI.3974472138752
Mon Oct  7 10:56:30 2024: [PE_0]:  mmap loop pg_info->pes_this_node = 95
Mon Oct  7 10:56:30 2024: [PE_95]: _pmi2_kvs_hash_entries = 1
Mon Oct  7 10:56:30 2024: [PE_95]: mmap in a file for shared memory type 0 len 2504064
Mon Oct  7 10:56:30 2024: [PE_95]: PMI mmap filename: /dev/shm/shared_memory.PMI.3974472138752
Mon Oct  7 10:56:30 2024: [PE_95]:  mmap loop pg_info->pes_this_node = 95
Mon Oct  7 10:56:31 2024: [PE_95]:  pals_get_nodes nnodes = 2 pals_get_nics nnics = 0
Mon Oct  7 10:56:31 2024: [PE_95]:  pals_get_nodes nnodes = 2 pals_get_hsn_nics nnics = 0
Mon Oct  7 10:56:31 2024: [PE_95]: PMI mmap filename: /dev/shm/shared_memory.PMI.3974472138752.applist
Mon Oct  7 10:56:31 2024: [PE_95]:  mmap loop pg_info->pes_this_node = 95
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_control_net_init: my nid = 1
Mon Oct  7 10:56:31 2024: [PE_0]:  pals_get_nodes nnodes = 2 pals_get_nics nnics = 0
Mon Oct  7 10:56:31 2024: [PE_0]:  pals_get_nodes nnodes = 2 pals_get_hsn_nics nnics = 0
Mon Oct  7 10:56:31 2024: [PE_0]: PMI mmap filename: /dev/shm/shared_memory.PMI.3974472138752.applist
Mon Oct  7 10:56:31 2024: [PE_0]:  mmap loop pg_info->pes_this_node = 95
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_control_net_init: my nid = 0 
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_control_net_init: num_nodes = 2
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_control_net_init: parent_id is -1
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_control_net_init:  num_targets 1
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_control_net_init: my nid = 0 
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_control_net_init: num_nodes = 2
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_control_net_init: parent_id is -1
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_control_net_init:  num_targets 1
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_nid_to_hostname: hostname nid 0 = READCTED_NODE0_HOSTNAME
Mon Oct  7 10:56:31 2024: [PE_0]: input nid  = 0 _pmi_inet_info[0].nid =0
Mon Oct  7 10:56:31 2024: [PE_0]: PMI: IP Address version in use: 2(IPv4)
Mon Oct  7 10:56:31 2024: [PE_0]:  _pmi_inet_listen_socket_setup num targets = 1 have_controller = 0
Mon Oct  7 10:56:31 2024: [PE_0]: Setting control listener IP to any IP
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_inet_listen_socket_setup: setting up listening socket on addr 0.0.0.0 port 11999
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_inet_listen_socket_setup: open listen_sock 3
Mon Oct  7 10:56:31 2024: [PE_0]: PMI rank/pid filename : /var/tmp/garlick/flux-uetjMU/jobtmp-0-f2oQMVbKd/pmi_attribs
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_control_net_init: num_nodes = 2
Mon Oct  7 10:56:31 2024: [PE_0]: calling _pmi_inet_setup (full)
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_inet_setup: have controller = 0 controller hostname
Mon Oct  7 10:56:31 2024: [PE_0]: accept_child_conns: waiting on listening
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_control_net_init: parent_id is 0
Mon Oct  7 10:56:31 2024: [PE_0]: inet_accept_with_address: accepted a connection from READCTED_NODE1.IPADDRESS
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_inet_setup: inet_recv for accept completion done
Mon Oct  7 10:56:31 2024: [PE_0]: accept_child_conns: target_id 1 READCTED_NODE1.IPADDRESS connected
Mon Oct  7 10:56:31 2024: [PE_0]: accept_child_conns: open target fd 4 
Mon Oct  7 10:56:31 2024: [PE_0]: PE 0 network_barrier:receiving message from target 0 nid 1
Mon Oct  7 10:56:31 2024: [PE_0]: PE 0 network_barrier:received message from target 0 nid 1 errno 0
Mon Oct  7 10:56:31 2024: [PE_0]: network_barrier:sending release packet to target 0
Mon Oct  7 10:56:31 2024: [PE_0]: calling _pmi_inet_setup (app)
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_inet_setup: have controller = 0 controller hostname 
Mon Oct  7 10:56:31 2024: [PE_0]: accept_child_conns: waiting on listening  
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_nid_to_hostname: hostname nid 0 = READCTED_NODE0_HOSTNAME  
Mon Oct  7 10:56:31 2024: [PE_95]: input nid  = 0 _pmi_inet_info[0].nid =0
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_control_net_init:  controller_nid = 0 controller hostname READCTED_NODE0_HOSTNAME
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_control_net_init: my nid = 1 
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_control_net_init: num_nodes = 2 
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_control_net_init: parent_id is 0
Mon Oct  7 10:56:31 2024: [PE_0]: inet_accept_with_address: accepted a connection from READCTED_NODE1.IPADDRESS
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_inet_setup: inet_recv for accept completion done
Mon Oct  7 10:56:31 2024: [PE_0]: accept_child_conns: target_id 1 READCTED_NODE1.IPADDRESS connected
Mon Oct  7 10:56:31 2024: [PE_0]: accept_child_conns: open target fd 5
Mon Oct  7 10:56:31 2024: [PE_0]: completed PMI TCP/IP inet set up
Mon Oct  7 10:56:31 2024: [PE_0]: PMI Version: 6.1.15 git rev: 6
Mon Oct  7 10:56:31 2024: [PE_0]: PMI rank = 0 pg id = 3974472138752 appnum = 0 pes_this_smp = 95
Mon Oct  7 10:56:31 2024: [PE_0]: base_pe[0] = 0 pes_in_app[0] = 190 pes_in_app_this_smp = 95
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi_initialized = 0 spawned = 0
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi2_add_kvs added (GLOBAL=1): key=universeSize, value=190 (hash=0)
Mon Oct  7 10:56:31 2024: [PE_0]: vector-process-mapping str = [(vector,(0,2,95))]
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi2_add_kvs added (GLOBAL=1): key=PMI_process_mapping, value=(vector,(0,2,95)) (hash=0)
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi2_create_jobattrs - done
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi2_info_get_jobattr: FOUND a match to name=PMI_process_mapping, (val=(vector,(0,2,95)))
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi2_add_nodeattr added : name=-bcast-1-0, value=31353732393037, num_pairs=1
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_nid_to_hostname: hostname nid 0 = READCTED_NODE0_HOSTNAME
Mon Oct  7 10:56:31 2024: [PE_95]: input nid  = 0 _pmi_inet_info[0].nid =0
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_control_net_init:  controller_nid = 0 controller hostname READCTED_NODE0_HOSTNAME
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_nid_to_hostname: hostname nid 0 = READCTED_NODE0_HOSTNAME
Mon Oct  7 10:56:31 2024: [PE_95]: input nid  = 0 _pmi_inet_info[0].nid =0
Mon Oct  7 10:56:31 2024: [PE_95]: PMI: IP Address version in use: 2(IPv4)
Mon Oct  7 10:56:31 2024: [PE_95]:  _pmi_inet_listen_socket_setup num targets = 0 have_controller = 1
Mon Oct  7 10:56:31 2024: [PE_95]: Setting control listener IP to any IP
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_inet_listen_socket_setup: setting up listening socket on addr 0.0.0.0 port 11999
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_inet_listen_socket_setup: open listen_sock 3
Mon Oct  7 10:56:31 2024: [PE_95]: PMI rank/pid filename : /var/tmp/garlick/flux-JsJqGs/jobtmp-1-f2oQMVbKd/pmi_attribs
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi2_add_nodeattr added : name=xpmem-0, value=1:0, num_pairs=6
Mon Oct  7 10:56:31 2024: [PE_0]: _pmi2_info_getnodeattr: FOUND a match to name=xpmem-0, (val=1:0)
Mon Oct  7 10:56:31 2024: [PE_95]: calling _pmi_inet_setup (full)
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_inet_setup: have controller = 1 controller hostname READCTED_NODE0_HOSTNAME
Mon Oct  7 10:56:31 2024: [PE_95]: Failed to obtain IP address for READCTED_NODE0_HOSTNAME-hsn0. [PMI_IP_ADDR_FAMILY=IPv4, rc=-2, errno=0]
Mon Oct  7 10:56:31 2024: [PE_95]: Failed to obtain IP address for READCTED_NODE0_HOSTNAME-hsn1. [PMI_IP_ADDR_FAMILY=IPv4, rc=-2, errno=0]
Mon Oct  7 10:56:31 2024: [PE_95]: find_controller: hostname lookup: using port=11999 sleep=2 retry=300 control-name=READCTED_NODE0_HOSTNAME contoller-ip=REDACTED_NODE0.IPADDRESS
Mon Oct  7 10:56:31 2024: [PE_95]: conn_to_controller: open controller fd 4
Mon Oct  7 10:56:31 2024: [PE_95]: network_barrier:sending barrier packet to my controller
Mon Oct  7 10:56:31 2024: [PE_95]: calling _pmi_inet_setup (app)
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_inet_setup: have controller = 1 controller hostname READCTED_NODE0_HOSTNAME
Mon Oct  7 10:56:31 2024: [PE_95]: Failed to obtain IP address for READCTED_NODE0_HOSTNAME-hsn0. [PMI_IP_ADDR_FAMILY=IPv4, rc=-2, errno=0]
Mon Oct  7 10:56:31 2024: [PE_95]: Failed to obtain IP address for READCTED_NODE0_HOSTNAME-hsn1. [PMI_IP_ADDR_FAMILY=IPv4, rc=-2, errno=0]
Mon Oct  7 10:56:31 2024: [PE_95]: find_controller: hostname lookup: using port=11999 sleep=2 retry=300 control-name=READCTED_NODE0_HOSTNAME contoller-ip=REDACTED_NODE0.IPADDRESS
Mon Oct  7 10:56:31 2024: [PE_95]: conn_to_controller: open controller fd 5
Mon Oct  7 10:56:31 2024: [PE_95]: completed PMI TCP/IP inet set up
Mon Oct  7 10:56:31 2024: [PE_95]: PMI Version: 6.1.15 git rev: 6
Mon Oct  7 10:56:31 2024: [PE_95]: PMI rank = 95 pg id = 3974472138752 appnum = 0 pes_this_smp = 95
Mon Oct  7 10:56:31 2024: [PE_95]: base_pe[0] = 0 pes_in_app[0] = 190 pes_in_app_this_smp = 95
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi_initialized = 0 spawned = 0
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi2_add_kvs added (GLOBAL=1): key=universeSize, value=190 (hash=0)
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi2_add_kvs added (GLOBAL=1): key=PMI_process_mapping, value=(vector,(0,2,95)) (hash=0)
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi2_create_jobattrs - done
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi2_info_get_jobattr: FOUND a match to name=PMI_process_mapping, (val=(vector,(0,2,95)))
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi2_add_nodeattr added : name=-bcast-1-95, value=31353732383735, num_pairs=1
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi2_add_nodeattr added : name=xpmem-0, value=1:0, num_pairs=2
Mon Oct  7 10:56:31 2024: [PE_95]: _pmi2_info_getnodeattr: FOUND a match to name=xpmem-0, (val=1:0)
Mon Oct  7 10:56:43 2024: [PE_95]: _pmi_network_allgather: num_targets = 0 have_controller = 1 len_smp 760 len_global 1520
Mon Oct  7 10:56:44 2024: [PE_0]: _pmi_network_allgather: num_targets = 1 have_controller = 0 len_smp 760 len_global 1520
Mon Oct  7 10:56:44 2024: [PE_0]: returning from PMI2_Finalize, _pmi2_initialized 0, _pmi_initialized 0
Mon Oct  7 10:56:44 2024: [PE_95]: returning from PMI2_Finalize, _pmi2_initialized 0, _pmi_initialized 0

garlick commented 1 month ago

FWIW, this does not require MPI to reproduce. This example reproduces the hang with logging using a client that only uses PMI:

$ flux run --env=PMI_DEBUG=1 -opmi=cray-pals -n190 -N2 flux pmi barrier
[snip]
Mon Oct  7 12:50:31 2024: [PE_96]: network_barrier:sending barrier packet to my controller

garlick commented 1 month ago

I tried setting PALS_LOCAL_RANKID in a modified version of cray_pals.so on the off chance that it might help, but no luck. We might need some help from cray on this one.

flux-framework / flux-coral2

hang in MPI_Init with unbalanced ranks #222