DragonHPC / dragon

Dragon distributed runtime for HPC and AI applications and workflows
http://dragonhpc.org
MIT License
54 stars 6 forks source link

zmq related problem in reduce #11

Closed andre-merzky closed 6 months ago

andre-merzky commented 7 months ago

Attached are two files to reproduce the problem. They should be used like this:

$ . dragon/_env/bin/activate
$ pip install radical.pilot
[...]
$ dragon radical-pilot-dragon-executor.py
/home/merzky/radical/radical.pilot.2/dragon/_env/lib:/usr/local/lib/started
ZMQ_ENDPOINTS tcp://0.0.0.0:33141 tcp://0.0.0.0:35079

In a second shell, run:

$ . dragon/_env/bin/activate
$ ./dragon_client.py

The first shell (the dragon service stub) shows the following stacktrace:

/home/merzky/radical/radical.pilot.2/dragon/_env/lib:/usr/local/lib/started
ZMQ_ENDPOINTS tcp://0.0.0.0:33141 tcp://0.0.0.0:35079
Traceback (most recent call last):
  File "/home/merzky/radical/radical.pilot.2/bin/radical-pilot-dragon-executor.py", line 274, in <module>
    s.serve()
  File "/home/merzky/radical/radical.pilot.2/bin/radical-pilot-dragon-executor.py", line 75, in serve
    self._handle_requests()
  File "/home/merzky/radical/radical.pilot.2/bin/radical-pilot-dragon-executor.py", line 129, in _handle_requests
    task['proc'].start()
  File "/usr/lib/python3.10/multiprocessing/process.py", line 121, in start
    self._popen = self._Popen(self)
  File "/home/merzky/radical/radical.pilot.2/dragon/src/dragon/mpbridge/process.py", line 114, in _Popen
    return DragonPopen(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_spawn_posix.py", line 32, in __init__
    super().__init__(process_obj)
  File "/usr/lib/python3.10/multiprocessing/popen_fork.py", line 19, in __init__
    self._launch(process_obj)
  File "/home/merzky/radical/radical.pilot.2/dragon/src/dragon/mpbridge/process.py", line 58, in _launch
    multiprocessing.context.reduction.dump(process_obj, fp)
  File "/usr/lib/python3.10/multiprocessing/reduction.py", line 60, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "<stringsource>", line 2, in zmq.backend.cython.context.Context.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__
+++ head proc exited, code 1

The code works as expected when USE_DRAGON is set to False before running.

Do you have any idea how to address this, or advice on how to further debug? Thank you!

rct_dragon.zip

andre-merzky commented 7 months ago

Hi dragoneers - do you have any advise on this? Thanks!

mendygral commented 7 months ago

Apologies for the delay and thank you for the reproducer. My hunch given the stack trace is some bad interaction between how we patch the multiprocessing reducer and that being used by ZMQ. We'll take a close look and get back very soon. Thank you!

mendygral commented 7 months ago

Actually, on another inspection, it could also be that some ZMQ object that is being passed through multiprocessing is not fully reducible (hence the mention of no reduce method).

applio commented 7 months ago

I was not initially able to reproduce the behavior you described and instead encountered a different error with a traceback ending with:

AttributeError: Can't pickle local object 'Logger._ensure_handler.<locals>.<lambda>'

This was encountered against radical.utils version 1.47.0. It seemed to be having trouble with the multiple lambdas referred to inside radical.utils.logger.Logger instances so I temporarily modified that package to instead use a function defined at the module level which performed the same task as all those lambdas. Having made this change, I was able to successfully reproduce the behavior you described.

I suggest changing the _fork_task() method in Server to a staticmethod. That is, in your radical-pilot-dragon-executor.py file, modify this:

    def _fork_task(self, task):

to instead read as:

    @staticmethod
    def _fork_task(task):

I believe the error and complaint from zmq.backend.cython.context.Context.__reduce_cython__ was because passing a bound method as a target to multiprocessing.Process forces the pickling and transmission of the entire instance of the Server object (what _fork_task is bound to) and those Server instances are holding references to zmq.Pipe and zmq.Context objects which, according to ZeroMQ docs, are not supposed to be serializable. At least, handing any other process an unpickled zmq.Socket should be unusable. I do not understand the details of why ZeroMQ did not similarly object when not using dragon and only using standard multiprocessing but it kinda looks like ZeroMQ maybe has some tricks to sometimes permit pickling of such things when multiprocessing is involved in the hopes they aren't used after unpickling? I am uncertain about this and my guess could be quite incorrect.

In any event, triggering the serialization of the full instance when you do not really need it is probably worth avoiding. Using a target that is a staticmethod or otherwise a simple function should be more performant and simpler.

This feels like it might not be an issue for dragon itself but a nuance of how/when zmq objects complain about not liking pickles (er, pickling).

andre-merzky commented 7 months ago

Thanks for you reply! And sorry for being sloppy with the Logger serialization - I had that fixed in a local branch along the lines you suggested, and that indeed resolved that specific issue.

The ZMQ channels are indeed not used after the fork (it is immediately followed by an exec), but I also can't close the channels as the parent process continues to use them. I'll read a bit into how ZMQ handles the serialization on fork...

applio commented 7 months ago

I might be misunderstanding what you have in mind but it should be just fine for the parent to continue using the channels after the fork. It's pickling the channels/sockets/contexts to pass them to the child process that is proving problematic.

I was hoping that whatever info you need to provide to _fork_task could be passed as arguments to that function rather than needing to pull in the whole of the Server instance (in other words, get rid of the self but do pass things you need and will be usable by the child process).

andre-merzky commented 7 months ago

Thanks, understood, I'll give that a try!

andre-merzky commented 6 months ago

The suggested workaround did the trick - thanks!