Bad Spack builds if using ARMCI_NETWORK=MPI-PR

mattchan-tencent commented 3 years ago

Moving https://github.com/nwchemgit/nwchem/issues/269 here.

In summary, specifying the env ARMCI_NETWORK=MPI-PR when building makes multi-node runs fail. Single node multiprocessor runs are fine.

edoapra commented 3 years ago

You are using the argument -np 13 for mpirun. It is a bit strange using an odd prime number when multiple nodes are used. Could you explain this? How many nodes are you using in your run and how processors/node?

mattchan-tencent commented 3 years ago

I'm using 2 nodes, 8 cores each. I read MPI-PR needs n+1 so I just took out a few processes. I just confirmed it also happens with -np 14.

edoapra commented 3 years ago

MPI-PR subtracts one process on each node. Therefore, if you use -np 16 and two nodes with 8 processors/node, you end up with 7 processes on each node. Could you try to add the -N 7 option so that you are using only 7 processors/node? I am not quite sure what is the effect of the -allow-run-as-root option, either, since it is not recommended by the OpenMPI developers. Could you please try the following mpirun command and see if it makes any difference?

mpirun -np 14 -N 7 --hostfile hostfile nwchem c240_631gs.nw

By the way, are sure that the same nwchem executable is used on both nodes?

mattchan-tencent commented 3 years ago

Sorry for the delay. I spent the day rebuilding the packages from scratch on a non-privileged account. I'm running on some cloud instances so the security implications of running root aren't a high priority atm. It gives the same error running root or not.

Ah. So it handles the N-1 automatically. Thanks that's good to know.

Yes, the same executable is used assuming Spack is not broken. The filesystem is not shared so I issue identical builds on each node and then verify that the Spack hashes match (although the sha256sum hashes don't match). In theory this should ensure NWChem and all its dependencies are identical. I can run this down with the Spack devs if you think it's potentially an issue.

Edit: Also I forgot to report that the updated command with -N 7 and -np 14 doesn't change the error message.

edoapra commented 3 years ago

Yes, the same executable is used assuming Spack is not broken. The filesystem is not shared so I issue identical builds on each node and then verify that the Spack hashes match (although the sha256sum hashes don't match). In theory this should ensure NWChem and all its dependencies are identical. I can run this down with the Spack devs if you think it's potentially an issue.

I have two dumb questions:

Are the two nodes homogeneous both from the HW and SW point of view? What I am trying to ask: do they have the same HW, operating system, etc ...

Since the nwchem binaries do differ, could you you copy the node 1 binary to node 2? Before running NWChem, I would check the ldd output if anything suspicious shows up.

mattchan-tencent commented 3 years ago

Yep, the nodes are identical in hardware and software.

I copied the binary over and the sha matches now, but I still got the same error.

edoapra commented 3 years ago

I have no clue about what is going on.

The only thing you could try is to build just the Global Arrays and try the GA tests codes are working or not

mattchan-tencent commented 3 years ago

Sounds good. I'm working on it now. I'm not sure why the Spack NWChem package builds GA in-tree as opposed to as a dependency (the NWChem code is complicated enough as-is!). Is there anything to be aware of when building it separately? e.g. Are there NWChem specific patches? Also, are there any tests that are worth focusing on first or tests that aren't worth running? I'm just using make checkprogs and make check MPIEXEC='mpiexec -np 4' right now.

I'll report back with results when I get them. The tests are quite slow and I'm fairly new to writing Spack recipes.

mattchan-tencent commented 3 years ago

So I've got.... some results.

I built GA with MPI3 (as per Jeff's suggestion) and MPICH 3.3.2 and a few of the tests failed. There's a hang on the testmult.x test so I never got further than that. I haven't tried other permutations (e.g. MPI-PR and OpenMPI) yet.

The hung test is stuck with 3/4 process polling (according to strace). The stack trace of the 4th process which is holding the lock is as follows:

$ ptrace 24539

#0  0x00007faf88809063 in MPIDI_CH3I_SendNoncontig () from /home/cloud/spack/opt/spack/linux-tlinux2-x86_64/gcc-10.2.0/mpich-3.3.2-skhpv7jel3lmvjflvyq5gldiknkljsac/lib/libmpi.so.12
#1  0x00007faf887cfe2c in issue_from_origin_buffer () from /home/cloud/spack/opt/spack/linux-tlinux2-x86_64/gcc-10.2.0/mpich-3.3.2-skhpv7jel3lmvjflvyq5gldiknkljsac/lib/libmpi.so.12
#2  0x00007faf887d0f19 in issue_ops_target () from /home/cloud/spack/opt/spack/linux-tlinux2-x86_64/gcc-10.2.0/mpich-3.3.2-skhpv7jel3lmvjflvyq5gldiknkljsac/lib/libmpi.so.12
#3  0x00007faf887d4adc in MPIDI_CH3I_RMA_Make_progress_global () from /home/cloud/spack/opt/spack/linux-tlinux2-x86_64/gcc-10.2.0/mpich-3.3.2-skhpv7jel3lmvjflvyq5gldiknkljsac/lib/libmpi.so.12
#4  0x00007faf88804ce1 in MPIDI_CH3I_Progress () from /home/cloud/spack/opt/spack/linux-tlinux2-x86_64/gcc-10.2.0/mpich-3.3.2-skhpv7jel3lmvjflvyq5gldiknkljsac/lib/libmpi.so.12
#5  0x00007faf8870a723 in MPIR_Wait_impl () from /home/cloud/spack/opt/spack/linux-tlinux2-x86_64/gcc-10.2.0/mpich-3.3.2-skhpv7jel3lmvjflvyq5gldiknkljsac/lib/libmpi.so.12
#6  0x00007faf8870a88e in MPIR_Wait () from /home/cloud/spack/opt/spack/linux-tlinux2-x86_64/gcc-10.2.0/mpich-3.3.2-skhpv7jel3lmvjflvyq5gldiknkljsac/lib/libmpi.so.12
#7  0x00007faf8870aef2 in PMPI_Wait () from /home/cloud/spack/opt/spack/linux-tlinux2-x86_64/gcc-10.2.0/mpich-3.3.2-skhpv7jel3lmvjflvyq5gldiknkljsac/lib/libmpi.so.12
#8  0x00000000004f9442 in comex_putv (iov=iov@entry=0x3004d80, iov_len=iov_len@entry=1, proc=982937388, proc@entry=2, group=group@entry=0) at src-mpi3/comex.c:1392
#9  0x00000000004f5747 in PARMCI_PutV (darr=0x7ffc3a967190, len=1, proc=2) at src-armci/armci.c:626
#10 0x00000000004998a8 in gai_gatscat_new ()
#11 0x0000000000499d98 in wnga_scatter ()
#12 0x000000000045a118 in wnga_copy_patch ()
#13 0x0000000000413fc5 in NGA_Copy_patch ()
#14 0x000000000040411d in test ()
#15 0x00000000004044ca in do_work ()
#16 0x00000000004045f9 in main ()

It seems like there's some MPI problems cropping up...

I've included their logs here:

ma/testf: testf.log

global/testing/patch: patch.log

global/testing/pg2test: pg2test.log

global/examples/lennard-jones/lennard: lennard.log

global/testing/testmult : testmult.log-t.log

Env: spack-build-env.txt

stdout: spack-build-out.txt

edoapra commented 3 years ago

have you tried to run MPI tests?

mattchan-tencent commented 3 years ago

Do you mean the ARMCI-MPI tests?

edoapra commented 3 years ago

No, the tests that come with Mpich or OpenMPI

fancer commented 2 years ago

If anybody gets to meet this problem too, in my case it was caused by the gethostid() method returning the same value on different cluster nodes:

Node n02p109

-sh-4.2$ hostname n02p109 -sh-4.2$ hostid 61303235 -sh-4.2$ cat /etc/hostid 520a2d01

Node n02p110

-sh-4.2$ hostname n02p110 -sh-4.2$ hostid 61303235 -sh-4.2$ cat /etc/hostid 520a2e01

As you can see even though the /etc/hostid contains different data the actual number returned by the gethostid() method is the same. That happens due to the function implementation specific. https://code.woboq.org/userspace/glibc/sysdeps/unix/sysv/linux/gethostid.c.html gethostid() reads the first four bytes of the data saved in the denoted file and converts it to uint32_t type. These four bytes match on all of the cluster nodes.

To sum up the root cause of the problem is on the system side. If we want to get rid of the that cause then we could port the IP/MAC-address based hostid calculation from the gethostid() method implementation into the GA source code.

jeffhammond commented 2 years ago

Based on the Blue Gene and Cray implementations below, I think it's reasonable to come up with pretty much any implementation that generates unique integers. We could even use MPI.

static long xgethostid()
{
#if defined(__bgp__)
#warning BGP
    long nodeid;
    int matched,midplane,nodecard,computecard;
    char rack_row,rack_col;
    char location[128];
    char location_clean[128];
    (void) memset(location, '\0', 128);
    (void) memset(location_clean, '\0', 128);
    _BGP_Personality_t personality;
    Kernel_GetPersonality(&personality, sizeof(personality));
    BGP_Personality_getLocationString(&personality, location);
    matched = sscanf(location, "R%c%c-M%1d-N%2d-J%2d",
            &rack_row, &rack_col, &midplane, &nodecard, &computecard);
    assert(matched == 5);
    sprintf(location_clean, "%2d%02d%1d%02d%02d",
            (int)rack_row, (int)rack_col, midplane, nodecard, computecard);
    nodeid = atol(location_clean);
#elif defined(__bgq__)
#warning BGQ
    int nodeid;
    MPIX_Hardware_t hw;
    MPIX_Hardware(&hw);

    nodeid = hw.Coords[0] * hw.Size[1] * hw.Size[2] * hw.Size[3] * hw.Size[4]
        + hw.Coords[1] * hw.Size[2] * hw.Size[3] * hw.Size[4]
        + hw.Coords[2] * hw.Size[3] * hw.Size[4]
        + hw.Coords[3] * hw.Size[4]
        + hw.Coords[4];
#elif defined(__CRAYXT) || defined(__CRAYXE)
#warning CRAY
    int nodeid;
#  if defined(__CRAYXT)
    PMI_Portals_get_nid(g_state.rank, &nodeid);
#  elif defined(__CRAYXE)
    PMI_Get_nid(g_state.rank, &nodeid);
#  endif
#else
    long nodeid = gethostid();
#endif

    return nodeid;
}

fancer commented 2 years ago

@jeffhammond you are basically right, but afaics these are BG and Cray-specific solutions. Aren't they?

Anyway the patch below fixed the problem in my case (needs to be applied for the MPI PR, PT and MT comex implementations).

--- a/comex/src-mpi-mt/groups.c 2020-11-02 21:36:27.000000000 +0300
+++ b/comex/src-mpi-mt/groups.c 2022-04-10 15:31:10.274155371 +0300
@@ -2,6 +2,7 @@
 #   include "config.h"
 #endif

+#include <fcntl.h>
 #include <stdlib.h>
 #include <string.h>
 #include <stdio.h>
@@ -436,6 +437,28 @@
     COMEX_ASSERT(MPI_SUCCESS == status);
 }

+#if !defined(__bgp__) && !defined(__bgq__) && !defined(__CRAYXT) && !defined(__CRAYXE)
+static long int _gethostid(void)
+{
+    long int nodeid = 0;
+    ssize_t n;
+    int fd;
+
+    fd = open("/etc/hostid", O_RDONLY | O_LARGEFILE, 0);
+    if (fd < 0)
+        return gethostid();
+
+    n = read(fd, &nodeid, sizeof(nodeid));
+    close(fd);
+
+    if (n < 0) {
+        perror("_gethostid: read(id)");
+        comex_error("_gethostid: Failed on hostid reading", n);
+    }
+
+    return nodeid;
+}
+#endif

 static long xgethostid()
 {
@@ -477,7 +500,7 @@
     PMI_Get_nid(g_state.rank, &nodeid);
 #  endif
 #else
-    long nodeid = gethostid();
+    long nodeid = _gethostid();
 #endif

It just makes sure if the /etc/hostid exists its content will be read into the "long int" variable, which on the modern Unix-systems is of 64-bits long (it will work only if the hostid file is initialized with ASCII/utf-8 symbols though). If there is no file found the POSIX implementation of the gethostid() method will be utilized.

Note we can also read the /etc/machine-id file to get an unique host ID. But it contains the UUID hash, which needs to be parsed in a bit more clever manner than just reading into the 4/8 bytes variable. For instance 4/8-chunks xor or sum would give a better host uniqueness than getting a part of the hash value.

jeffhammond commented 2 years ago

Yeah, those BG/Cray hacks are for those machines alone. Can you create a pull request for your contribution?

fancer commented 2 years ago

@jeffhammond done. see PR #258 . I've changed the commit a bit with respect to what I've mentioned earlier to optionally using the /etc/machine-id content as the host ID data.

GlobalArrays / ga

Bad Spack builds if using ARMCI_NETWORK=MPI-PR #183

Node n02p109

Node n02p110