Sandia-OpenSHMEM / SOS

Sandia OpenSHMEM is an implementation of the OpenSHMEM specification over multiple Networking APIs, including Portals 4, the Open Fabric Interface (OFI), and UCX. Please click on the Wiki tab for help with building and using SOS.
Other
61 stars 53 forks source link

Endpoint Resource Exhaustion in ULT Mode #1127

Open markbrown314 opened 3 months ago

markbrown314 commented 3 months ago

When running a ULT job with 1024 PEs and 16 nodes with 8 ABT threads SOS fails to initialize the transport endpoint.

e.g. isx_micro This is the warning: [0132] WARN: transport_ofi.c:621: bind_enable_ep_resources [0132] fi_enable on endpoint failed [0132] WARN: transport_ofi.c:1430: shmem_transport_ofi_ctx_init [0132] context bind/enable endpoint failed (No space left on device)

The job hangs afterwords.

Parameters: PMI_MAX_KVS_ENTRIES=10000000 SHMEM_SYMMETRIC_SIZE=6G SHMEM_ADAPTIVE_THREAD_SCHEDULE=1 FI_PROVIDER=cxi