ICLDisco / dplasma

DPLASMA is a highly optimized, accelerator-aware, implementation of a dense linear algebra package for distributed heterogeneous systems. It is designed to deliver sustained performance for distributed systems where each node featuring multiple sockets of multicore processors, and if available, accelerators, using the PaRSEC runtime as a backend.
Other
10 stars 8 forks source link

ESSL and workspaces #3

Open abouteiller opened 4 years ago

abouteiller commented 4 years ago

Original report by Thomas Herault (Bitbucket: herault, GitHub: therault).


testing_Xgeqrf_hqr fails (among others, but testing_sgeqrf_hqr is the first test that fails in shared memory), because there is a double-free when using libessl on summit.

A backtrace shows the double-free is detected within the management of the workspace for some kernels, deep inside ESSL.

I suspect that this is because libessl is not made to be used by multiple threads simultaneously and shares workspaces between threads. Based on this hypothesis, I tried using libesslsmp.so, instead of libessl.so, but without any success.

abouteiller commented 4 years ago

Original comment by Thomas Herault (Bitbucket: herault, GitHub: therault).


ESSL documentation https://www.ibm.com/support/knowledgecenter/SSFHY8_6.1/reference/essl_reference_pdf.pdf seems to point more on the compiler: when running ESSL calls from multiple threads, we must be careful to call the reentrant version of the compiler (xlc_r instead of xlc).

If this defines some constants for essl.h that changes the version of the kernel, this might be the solution to this issue. However, we currently call xlc through mpicc on summit. We need to check if a mpicc_r exists or another way is possible.

abouteiller commented 4 years ago

xlc_r has been deprecated. The issue may be that we do not use the proper essl.h in some of dplasma, or worse, when we compiled our own lapacke library.