Open abouteiller opened 4 years ago
Original comment by Thomas Herault (Bitbucket: herault, GitHub: therault).
ESSL documentation https://www.ibm.com/support/knowledgecenter/SSFHY8_6.1/reference/essl_reference_pdf.pdf seems to point more on the compiler: when running ESSL calls from multiple threads, we must be careful to call the reentrant version of the compiler (xlc_r instead of xlc).
If this defines some constants for essl.h that changes the version of the kernel, this might be the solution to this issue. However, we currently call xlc through mpicc on summit. We need to check if a mpicc_r exists or another way is possible.
xlc_r has been deprecated. The issue may be that we do not use the proper essl.h in some of dplasma, or worse, when we compiled our own lapacke library.
Original report by Thomas Herault (Bitbucket: herault, GitHub: therault).
testing_Xgeqrf_hqr fails (among others, but testing_sgeqrf_hqr is the first test that fails in shared memory), because there is a double-free when using libessl on summit.
A backtrace shows the double-free is detected within the management of the workspace for some kernels, deep inside ESSL.
I suspect that this is because libessl is not made to be used by multiple threads simultaneously and shares workspaces between threads. Based on this hypothesis, I tried using libesslsmp.so, instead of libessl.so, but without any success.