Error Randomly thrown in Multiple Simultaneous Environment runs (Possible Race Condition) - Githubissues

GaloisInc / csaf

Control Systems Analysis Framework - a framework to minimize the effort required to evaluate, implement, and verify controller design (classical and learning enabled) with respect to the system dynamics.

BSD 3-Clause "New" or "Revised" License

11 stars 4 forks source link

Error Randomly thrown in Multiple Simultaneous Environment runs (Possible Race Condition) #110

Closed jeappen closed 3 years ago

jeappen commented 3 years ago

I've been using Ray's RLLib to train a multi agent version of the Dubin's rejoin environment. PFA two simple scripts to demonstrate the 2 errors I've frequently run into.

Prerequisites:

Either add ray[rllib] (ray==1.5.1) and torch==1.9.0 into the requirements.txt 
or pip install these libraries via `docker exec -it <your docker container id from docker stats> /bin/bash`

After downloading these files (error traceback included) , run the learning algorithm on a PC with 5+ cores in the docker environment

python3 train_MAEnv.py --num_cpus 5 --algorithm PPO --env navenv_inlineDS --num_agents 4 --num_workers_per_device 2 --horizon 100 --train_batch_size 1600

Increasing the num_cpus parameter seems correlate with the occurrence of the error(s).

EthanJamesLew commented 3 years ago

Hi @yokian . I'll look at this more carefully. A framework limitation: I see a ZMQ address in use error. CSAF cannot support multiple environments on the same machine unless they are managed by a parallel runner. This is because the configuration specifies which port to use, and so multiple instances of systems that have overlapping ports will fail. Is this relevant to your setup?

EthanJamesLew commented 3 years ago

Reproducing the errors, I think that we are indeed seeing the architectural limitation of hard coding ports in the system config. A path forward could be to rewrite the system config objects so that there are no port overlaps. Does this sound reasonable / correct, @yokian ?

jeappen commented 3 years ago

For race_condition_error2.txt (the ZMQ error), yes that does sound like a fix. However, I'm still not sure how it seems to work sometimes (with reasonable results) showing that the multiple environments can occasionally work in parallel without any apparent port conflict.

Also, w.r.t. race_conditionerror1.txt, it seems that there might be conflicts between multiple environments with the automatically generated *.py files (I suspect the pyros-genpy library). This pops up fairly often as well and I'm not sure fixing the ports alone would rectify it.

EthanJamesLew commented 3 years ago

Also, w.r.t. race_conditionerror1.txt, it seems that there might be conflicts between multiple environments with the automatically generated *.py files (I suspect the pyros-genpy library). This pops up fairly often as well and I'm not sure fixing the ports alone would rectify it.

In the same way that one can edit ports in the config object, one can edit the codec directory here so that the files don't conflict.

jeappen commented 3 years ago

Another way to prevent needing separate codec directories for each worker is to assume the generated files with the same name can be reused. One would then however have to manually empty the codec folder every time the .msg file changes to force genpy to update the generated files. This minor edit to generate_serializer() in rosmsg.py seems to fix race_condition_error1.txt for me.

def generate_serializer(msg_filepath: str, output_dir: str, package_name="csaf"):
"""generate a rosmsg class serializer/deserializer given a rosmsg .msg file
:param msg_filepath: path to .msg file
:param output_dir: path to place serializer/deserializer
:param package_name: name of ros package
:return: None -- asserts that return code of message generator is error free
"""
output_python_file = os.path.join(output_dir, f"_{pathlib.Path(msg_filepath).stem}.py")
if not os.path.exists(output_python_file):
    # check arguments
    assert os.path.exists(msg_filepath)
    assert os.path.exists(output_dir)

    # see https://github.com/ros/genpy/blob/kinetic-devel/src/genpy/genpy_main.py
    gen = genpy.generator.MsgGenerator()
    retcode = gen.generate_messages(package_name, [msg_filepath], output_dir, {})

    # assert that return from generator is good
    assert retcode == 0

    # import the generated code
    assert os.path.exists(output_python_file)
spec = importlib.util.spec_from_file_location(pathlib.Path(output_python_file).stem,
                                              output_python_file)
python_module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(python_module)

# load an instance and return it
class_name = pathlib.Path(msg_filepath).stem
class_ = getattr(python_module, class_name)
return class_

EthanJamesLew commented 3 years ago

That's a good point. We will keep this in mind as we think about the CSAF architecture moving on.

Currently, has this issue been addressed for you to continue your development?

jeappen commented 3 years ago

Yes, after the edit to rosmsg.py I believe so . The ZMQ Error (race_condition_error2.txt) does not pop up as frequently as the genpy error (race_condition_error1.txt) which seems to be fixed. Thank you for your help!

EthanJamesLew commented 3 years ago

Perfect! Closing this for now--feel free to re-open when appropriate.

zutshi commented 3 years ago

@yokian It seems this bug has been fixed. Are you still running into it with the updated version?