Open therault opened 2 years ago
Maybe the termdet shouldn't call parsec_ce if there is only one process?
It's the same problem with distributed runs: parsec_ce is initialized as late as possible, and when we create these taskpool that load up the termination detector, no comm engine is initialized, in distributed or in single node. In fact the context is not even associated yet with the taskpool, so it's impossible to initialize the comm engine on demand.
I guess the only solution is to have every function of the termination detector to check if the handler for this module is registered or not, and if not call the registration. That's not thread-safe, but I guess we can just atomic cas a variable before we do the call.
parsec_remote_dep_ini
is called inside parsec_init
; however, portions of initialization are indeed delayed until the first time that parsec_remote_dep_on
is called. I think the proper solution is putting a call to remote_dep_ce_init
inside remote_dep_dequeue_main
before the while( -1 != whatsup )
loop. You still need to call remote_dep_ce_init
inside remote_dep_mpi_on
in case something reinits the comm engine e.g. remote_dep_set_ctx
, but I don't think there's a cost to initing the comm engine early—actually, since it's on a dedicated thread anyway, it's done in parallel with the rest of the PaRSEC init, so in theory should be faster.
We suspect this is already fixed @therault will assess and we will close if issue resolved, push to 4.1 otherwise.
Is it resolved?
Describe the bug
parsec_ce is not initialized by default when we create the taskpool. It is only initialized lazily after the context is started.
When enabling dynamic termination detection in PTG (either via the
%option dynamic
option in the JDF, or via the--dynamic-termdet
flag toparsec_ptgpp
), the generated code intaskpool_new()
installs a dynamic termination detector (fourcounter today).This calls
mca_component_query()
on the fourcounter termination detector module, which callsparsec_ce.tag_register()
and segfaults.To Reproduce
-DPARSEC_PTGPP_FLAGS=--dynamic-termdet
at configure time./tests/testing_dgemm -N 1000
for example.Expected behavior
Should run to completion. Doesn't.
Environment (please complete the following information):
Additional context
Adding
remote_dep_mpi_on(ctx);
at the end ofparsec_context_t* setup_parsec(int argc, char **argv, int *iparam)
in DPLASMA solves the problem, but I assume this is not what we want to do.