Sandia OpenSHMEM is an implementation of the OpenSHMEM specification over multiple Networking APIs, including Portals 4, the Open Fabric Interface (OFI), and UCX. Please click on the Wiki tab for help with building and using SOS.
When running a ULT job with 1024 PEs and 16 nodes with 8 ABT threads SOS fails to initialize the transport endpoint.
e.g. isx_micro
This is the warning:
[0132] WARN: transport_ofi.c:621: bind_enable_ep_resources
[0132] fi_enable on endpoint failed
[0132] WARN: transport_ofi.c:1430: shmem_transport_ofi_ctx_init
[0132] context bind/enable endpoint failed (No space left on device)
When running a ULT job with 1024 PEs and 16 nodes with 8 ABT threads SOS fails to initialize the transport endpoint.
e.g. isx_micro This is the warning: [0132] WARN: transport_ofi.c:621: bind_enable_ep_resources [0132] fi_enable on endpoint failed [0132] WARN: transport_ofi.c:1430: shmem_transport_ofi_ctx_init [0132] context bind/enable endpoint failed (No space left on device)
The job hangs afterwords.
Parameters: PMI_MAX_KVS_ENTRIES=10000000 SHMEM_SYMMETRIC_SIZE=6G SHMEM_ADAPTIVE_THREAD_SCHEDULE=1 FI_PROVIDER=cxi