Sandia-OpenSHMEM / SOS

Sandia OpenSHMEM is an implementation of the OpenSHMEM specification over multiple Networking APIs, including Portals 4, the Open Fabric Interface (OFI), and UCX. Please click on the Wiki tab for help with building and using SOS.
Other
61 stars 53 forks source link

No path to peer error #1090

Closed WrongWizzli closed 7 months ago

WrongWizzli commented 1 year ago

I am trying to build OpenSHMEM package on Linux Mint 20.3 with the following commands:

./autogen.sh
./configure --enable-pmi-simple
make
make check

However on make check step I am getting 66 tests failed with the same reason No path to peer. The full example of single test log:

FAIL: mt_membar
===============

Starting multi-threaded test on 1 PEs, 2 threads/PE
[0000] ERROR: transport_none.h:214: shmem_transport_atomic
[0000]        No path to peer
Sandia OpenSHMEM exited in error
[0000] WARN:  init.c:137: shmem_internal_shutdown_atexit
[0000]        shutting down without a call to shmem_finalize
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
[0000] ERROR: transport_none.h:214: shmem_transport_atomic
[0000]        No path to peer
Sandia OpenSHMEM exited in error
[0000] WARN:  init.c:137: shmem_internal_shutdown_atexit
[0000]        shutting down without a call to shmem_finalize
Starting multi-threaded test on 1 PEs, 2 threads/PE
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[36188,1],0]
  Exit code:    1
--------------------------------------------------------------------------
FAIL mt_membar (exit status: 1)

Could you please suggest what might be the reason for such failure?

WrongWizzli commented 1 year ago

While running make I also get warning: WARNING: No transport found, resulting library will be unable to exchange messages

davidozog commented 1 year ago

Currently to run with multiple processes you need to enable a transport, I'd recommend https://ofiwg.github.io/libfabric and passing --with-ofi to the SOS configure. If you are only running on a single node, your shared memory transport options are XPMEM (--with-xpmem), CMA (--with-cma), or OFI with the sockets provider (--with-ofi); however, these can be tricky to build/install sometimes and performance widely varies... May I ask what is your use-case?

If you know how to checkout and build a different git branch, you could also try our experimental mmap shared memory transport here: https://github.com/davidozog/sandia-shmem/tree/wip/mmap_xpmem_heap and passing --enable-mmap. The advantage of mmap is you don't have to build/install anything on Mint Linux. Let me know if you would like help trying that.