Closed andre-merzky closed 6 months ago
Hi dragoneers - do you have any advise on this? Thanks!
Apologies for the delay and thank you for the reproducer. My hunch given the stack trace is some bad interaction between how we patch the multiprocessing reducer and that being used by ZMQ. We'll take a close look and get back very soon. Thank you!
Actually, on another inspection, it could also be that some ZMQ object that is being passed through multiprocessing is not fully reducible (hence the mention of no reduce method).
I was not initially able to reproduce the behavior you described and instead encountered a different error with a traceback ending with:
AttributeError: Can't pickle local object 'Logger._ensure_handler.<locals>.<lambda>'
This was encountered against radical.utils
version 1.47.0. It seemed to be having trouble with the multiple lambdas referred to inside radical.utils.logger.Logger
instances so I temporarily modified that package to instead use a function defined at the module level which performed the same task as all those lambdas. Having made this change, I was able to successfully reproduce the behavior you described.
I suggest changing the _fork_task()
method in Server
to a staticmethod
. That is, in your radical-pilot-dragon-executor.py
file, modify this:
def _fork_task(self, task):
to instead read as:
@staticmethod
def _fork_task(task):
I believe the error and complaint from zmq.backend.cython.context.Context.__reduce_cython__
was because passing a bound method as a target to multiprocessing.Process
forces the pickling and transmission of the entire instance of the Server
object (what _fork_task
is bound to) and those Server
instances are holding references to zmq.Pipe
and zmq.Context
objects which, according to ZeroMQ docs, are not supposed to be serializable. At least, handing any other process an unpickled zmq.Socket
should be unusable. I do not understand the details of why ZeroMQ did not similarly object when not using dragon
and only using standard multiprocessing
but it kinda looks like ZeroMQ maybe has some tricks to sometimes permit pickling of such things when multiprocessing
is involved in the hopes they aren't used after unpickling? I am uncertain about this and my guess could be quite incorrect.
In any event, triggering the serialization of the full instance when you do not really need it is probably worth avoiding. Using a target that is a staticmethod
or otherwise a simple function should be more performant and simpler.
This feels like it might not be an issue for dragon
itself but a nuance of how/when zmq
objects complain about not liking pickles (er, pickling).
Thanks for you reply! And sorry for being sloppy with the Logger
serialization - I had that fixed in a local branch along the lines you suggested, and that indeed resolved that specific issue.
The ZMQ channels are indeed not used after the fork
(it is immediately followed by an exec
), but I also can't close the channels as the parent process continues to use them. I'll read a bit into how ZMQ handles the serialization on fork...
I might be misunderstanding what you have in mind but it should be just fine for the parent to continue using the channels after the fork. It's pickling the channels/sockets/contexts to pass them to the child process that is proving problematic.
I was hoping that whatever info you need to provide to _fork_task
could be passed as arguments to that function rather than needing to pull in the whole of the Server
instance (in other words, get rid of the self
but do pass things you need and will be usable by the child process).
Thanks, understood, I'll give that a try!
The suggested workaround did the trick - thanks!
Attached are two files to reproduce the problem. They should be used like this:
In a second shell, run:
The first shell (the dragon service stub) shows the following stacktrace:
The code works as expected when
USE_DRAGON
is set toFalse
before running.Do you have any idea how to address this, or advice on how to further debug? Thank you!
rct_dragon.zip