Open davidozog opened 6 years ago
Setting SHMEM_OFI_STX_DISABLE_PRIVATE
seems to alleviate this problem and allow scaling to higher thread counts. Increasing SHMEM_OFI_STX_MAX
at a given thread count also seems to alleviate the problem. That suggests to me that this could be an oversubscription issue that occurs with ~10-15 threads on an STX...?
Note: I haven't yet narrowed down the cause of this problem, but avoiding it is quite easy by setting a reasonable value for SHMEM_OFI_STX_MAX
. Also, once we merge PR #653, this issue is fixed on Cori when setting SHMEM_OFI_STX_AUTO
.
Is this issues still present, and is it something we want to fix in 1.4.2?
This issue still exists with SOS + libfabric v1.6.0. It's a non-deterministic seg fault, with roughly 50% failure with STX_MAX=8 and 16 threads.
I haven't had much luck debugging it. It's a strange use-case that oversubscribes the 0th STX:
[0000] STX[8] = [ 10S 1P 1P 1P 1P 1P 1P 1P ]
There are a few simple alternatives that avoid the problem: disable private contexts, use more shared contexts, or increase SHMEM_OFI_STX_MAX. However, the default multi-threaded use-case of having many threads share 1 STX might be relatively common at the moment. PR #653 might avoid that.
Unfortunately, this benchmark always seg faults with libfabric v1.6.x with more than 1 thread regardless of the STX_MAX limit... This might be a separate issue, and I'm looking into it.
Hmm... this bug does not appear to exist with libfabric v1.6.1 or v1.6.x. I may have confused the v1.6.x seg fault with #763, which was recently fixed. But I still see the fault about half the time using libfabric v1.6.0.
Ok, pushing to 1.4.x so we re-test at the next release.
On Cray Aries (Cori@NERSC Haswell nodes), I see a seg fault when running 16 threads in the blocking put bandwidth test between 2 PEs on 2 nodes when setting
SHMEM_OFI_STX_MAX
between 1 and 8. Setting 9-17+ maximum STXs does not exhibit the seg fault. The seg. fault does not appear with 8 threads.More details to come, this issue is a placeholder so I don't forget about it.