Sandia-OpenSHMEM / SOS

Sandia OpenSHMEM is an implementation of the OpenSHMEM specification over multiple Networking APIs, including Portals 4, the Open Fabric Interface (OFI), and UCX. Please click on the Wiki tab for help with building and using SOS.
Other
61 stars 53 forks source link

Seg fault with more threads than STXs on Aries #639

Open davidozog opened 6 years ago

davidozog commented 6 years ago

On Cray Aries (Cori@NERSC Haswell nodes), I see a seg fault when running 16 threads in the blocking put bandwidth test between 2 PEs on 2 nodes when setting SHMEM_OFI_STX_MAX between 1 and 8. Setting 9-17+ maximum STXs does not exhibit the seg fault. The seg. fault does not appear with 8 threads.

More details to come, this issue is a placeholder so I don't forget about it.

davidozog commented 6 years ago

Setting SHMEM_OFI_STX_DISABLE_PRIVATE seems to alleviate this problem and allow scaling to higher thread counts. Increasing SHMEM_OFI_STX_MAX at a given thread count also seems to alleviate the problem. That suggests to me that this could be an oversubscription issue that occurs with ~10-15 threads on an STX...?

davidozog commented 6 years ago

Note: I haven't yet narrowed down the cause of this problem, but avoiding it is quite easy by setting a reasonable value for SHMEM_OFI_STX_MAX. Also, once we merge PR #653, this issue is fixed on Cori when setting SHMEM_OFI_STX_AUTO.

jdinan commented 6 years ago

Is this issues still present, and is it something we want to fix in 1.4.2?

davidozog commented 6 years ago

This issue still exists with SOS + libfabric v1.6.0. It's a non-deterministic seg fault, with roughly 50% failure with STX_MAX=8 and 16 threads.

I haven't had much luck debugging it. It's a strange use-case that oversubscribes the 0th STX: [0000] STX[8] = [ 10S 1P 1P 1P 1P 1P 1P 1P ]

There are a few simple alternatives that avoid the problem: disable private contexts, use more shared contexts, or increase SHMEM_OFI_STX_MAX. However, the default multi-threaded use-case of having many threads share 1 STX might be relatively common at the moment. PR #653 might avoid that.

Unfortunately, this benchmark always seg faults with libfabric v1.6.x with more than 1 thread regardless of the STX_MAX limit... This might be a separate issue, and I'm looking into it.

davidozog commented 6 years ago

Hmm... this bug does not appear to exist with libfabric v1.6.1 or v1.6.x. I may have confused the v1.6.x seg fault with #763, which was recently fixed. But I still see the fault about half the time using libfabric v1.6.0.

jdinan commented 6 years ago

Ok, pushing to 1.4.x so we re-test at the next release.