Closed jeappen closed 3 years ago
Hi @yokian . I'll look at this more carefully. A framework limitation: I see a ZMQ address in use error. CSAF cannot support multiple environments on the same machine unless they are managed by a parallel runner. This is because the configuration specifies which port to use, and so multiple instances of systems that have overlapping ports will fail. Is this relevant to your setup?
Reproducing the errors, I think that we are indeed seeing the architectural limitation of hard coding ports in the system config. A path forward could be to rewrite the system config objects so that there are no port overlaps. Does this sound reasonable / correct, @yokian ?
For race_condition_error2.txt (the ZMQ error), yes that does sound like a fix. However, I'm still not sure how it seems to work sometimes (with reasonable results) showing that the multiple environments can occasionally work in parallel without any apparent port conflict.
Also, w.r.t. race_conditionerror1.txt, it seems that there might be conflicts between multiple environments with the automatically generated *.py files (I suspect the pyros-genpy library). This pops up fairly often as well and I'm not sure fixing the ports alone would rectify it.
Also, w.r.t. race_conditionerror1.txt, it seems that there might be conflicts between multiple environments with the automatically generated *.py files (I suspect the pyros-genpy library). This pops up fairly often as well and I'm not sure fixing the ports alone would rectify it.
In the same way that one can edit ports in the config object, one can edit the codec directory here so that the files don't conflict.
Another way to prevent needing separate codec directories for each worker is to assume the generated files with the same name can be reused. One would then however have to manually empty the codec folder every time the .msg file changes to force genpy to update the generated files. This minor edit to generate_serializer()
in rosmsg.py seems to fix race_condition_error1.txt for me.
def generate_serializer(msg_filepath: str, output_dir: str, package_name="csaf"):
"""generate a rosmsg class serializer/deserializer given a rosmsg .msg file
:param msg_filepath: path to .msg file
:param output_dir: path to place serializer/deserializer
:param package_name: name of ros package
:return: None -- asserts that return code of message generator is error free
"""
output_python_file = os.path.join(output_dir, f"_{pathlib.Path(msg_filepath).stem}.py")
if not os.path.exists(output_python_file):
# check arguments
assert os.path.exists(msg_filepath)
assert os.path.exists(output_dir)
# see https://github.com/ros/genpy/blob/kinetic-devel/src/genpy/genpy_main.py
gen = genpy.generator.MsgGenerator()
retcode = gen.generate_messages(package_name, [msg_filepath], output_dir, {})
# assert that return from generator is good
assert retcode == 0
# import the generated code
assert os.path.exists(output_python_file)
spec = importlib.util.spec_from_file_location(pathlib.Path(output_python_file).stem,
output_python_file)
python_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(python_module)
# load an instance and return it
class_name = pathlib.Path(msg_filepath).stem
class_ = getattr(python_module, class_name)
return class_
That's a good point. We will keep this in mind as we think about the CSAF architecture moving on.
Currently, has this issue been addressed for you to continue your development?
Yes, after the edit to rosmsg.py I believe so . The ZMQ Error (race_condition_error2.txt) does not pop up as frequently as the genpy error (race_condition_error1.txt) which seems to be fixed. Thank you for your help!
Perfect! Closing this for now--feel free to re-open when appropriate.
@yokian It seems this bug has been fixed. Are you still running into it with the updated version?
I've been using Ray's RLLib to train a multi agent version of the Dubin's rejoin environment. PFA two simple scripts to demonstrate the 2 errors I've frequently run into.
Prerequisites:
After downloading these files (error traceback included) , run the learning algorithm on a PC with 5+ cores in the docker environment
python3 train_MAEnv.py --num_cpus 5 --algorithm PPO --env navenv_inlineDS --num_agents 4 --num_workers_per_device 2 --horizon 100 --train_batch_size 1600
Increasing the
num_cpus
parameter seems correlate with the occurrence of the error(s).