flux-framework / dyad

DYAD: DYnamic and Asynchronous Data Streamliner
GNU Lesser General Public License v3.0
7 stars 5 forks source link

Handling forking of process and thread within DYAD. #60

Open hariharan-devarajan opened 7 months ago

hariharan-devarajan commented 7 months ago

Currently, when a fork happens, the dyad context is not re-initialized, which potentially causes UCX endpoint creation errors. We have to investigate what to reinitialize.

Current thoughts

  1. Reintialize DYAD CTX.
  2. check if UCX can be reinitialized from the forked process.
ilumsden commented 7 months ago

I'm pretty sure we will have to reinitialize everything. At bare minimum, we will need to reinitialize the DTL because the UCX context and worker cannot be shared across processes. I am also pretty sure that anything Flux related (e.g., the flux_t handle) will need to be reinitialized.

JaeseungYeom commented 7 months ago

We will support two modes of child process creation. Forking and Spawning. We will not support threading for now until we are confident that multi-process support is robust. For process creation, it is important to understand various mechanisms by which a new process is created so that we can identify solutions to trigger initialization upon creation. python multiprocessing fork seems to rely on system fork while python spawn does not. Python multiprocessing supports at-fork custom callback. According to Hari, pytorch offers similar capability in itself. In some cases, we may need to intercept creation calls and add dyad initialization. I will at least add a call to reinitialize, and define an environment variable to select the re-initialization behavior rather than the default one with which initialization will be skipped if dyad context object exits. PR #63