ICLDisco / parsec

PaRSEC is a generic framework for architecture aware scheduling and management of micro-tasks on distributed, GPU accelerated, many-core heterogeneous architectures. PaRSEC assigns computation threads to the cores, GPU accelerators, overlaps communications and computations and uses a dynamic, fully-distributed scheduler based on architectural features such as NUMA nodes and algorithmic features such as data reuse.
Other
50 stars 17 forks source link

parsec_ce initialization: bug in PTG with dynamic termination detection #444

Open therault opened 2 years ago

therault commented 2 years ago

Describe the bug

parsec_ce is not initialized by default when we create the taskpool. It is only initialized lazily after the context is started.

When enabling dynamic termination detection in PTG (either via the %option dynamic option in the JDF, or via the --dynamic-termdet flag to parsec_ptgpp), the generated code in taskpool_new() installs a dynamic termination detector (fourcounter today).

This calls mca_component_query() on the fourcounter termination detector module, which calls parsec_ce.tag_register() and segfaults.

To Reproduce

  1. Checkout dplasma master with parsec master
  2. Enable dynamic termination detection with the PTGs of DPLASMA by passing -DPARSEC_PTGPP_FLAGS=--dynamic-termdet at configure time
  3. Run any (single process or multiprocess) dplasma test. ./tests/testing_dgemm -N 1000 for example.

Expected behavior

Should run to completion. Doesn't.

Environment (please complete the following information):

Additional context

Adding remote_dep_mpi_on(ctx); at the end of parsec_context_t* setup_parsec(int argc, char **argv, int *iparam) in DPLASMA solves the problem, but I assume this is not what we want to do.

devreal commented 2 years ago

Maybe the termdet shouldn't call parsec_ce if there is only one process?

therault commented 2 years ago

It's the same problem with distributed runs: parsec_ce is initialized as late as possible, and when we create these taskpool that load up the termination detector, no comm engine is initialized, in distributed or in single node. In fact the context is not even associated yet with the taskpool, so it's impossible to initialize the comm engine on demand.

I guess the only solution is to have every function of the termination detector to check if the handler for this module is registered or not, and if not call the registration. That's not thread-safe, but I guess we can just atomic cas a variable before we do the call.

omor1 commented 2 years ago

parsec_remote_dep_ini is called inside parsec_init; however, portions of initialization are indeed delayed until the first time that parsec_remote_dep_on is called. I think the proper solution is putting a call to remote_dep_ce_init inside remote_dep_dequeue_main before the while( -1 != whatsup ) loop. You still need to call remote_dep_ce_init inside remote_dep_mpi_on in case something reinits the comm engine e.g. remote_dep_set_ctx, but I don't think there's a cost to initing the comm engine early—actually, since it's on a dedicated thread anyway, it's done in parallel with the rest of the PaRSEC init, so in theory should be faster.

abouteiller commented 8 months ago

We suspect this is already fixed @therault will assess and we will close if issue resolved, push to 4.1 otherwise.

abouteiller commented 8 months ago

Is it resolved?