dkkim93 / meta-mapg

Source code for "A Policy Gradient Algorithm for Learning to Learn in Multiagent Reinforcement Learning" (ICML 2021)
MIT License
32 stars 4 forks source link

TypeError: can't pickle _thread.RLock objects #2

Closed Waiting-TT closed 10 months ago

Waiting-TT commented 1 year ago

When training the model, I encountered the following error. Traceback (most recent call last): File "/home/xxx/ww/meta-mapg-main/main.py", line 163, in main(args=args) File "/home/xxx/ww/meta-mapg-main/main.py", line 60, in main p.start() # TypeError: can't pickle _thread.RLock objects File "/home/xxx/anaconda3/envs/mapg/lib/python3.6/multiprocessing/process.py", line 105, in start self._popen = self._Popen(self) File "/home/xxx/anaconda3/envs/mapg/lib/python3.6/multiprocessing/context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/home/xxx/anaconda3/envs/mapg/lib/python3.6/multiprocessing/context.py", line 284, in _Popen return Popen(process_obj) File "/home/xxx/anaconda3/envs/mapg/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 32, in init super().init(process_obj) File "/home/xxx/anaconda3/envs/mapg/lib/python3.6/multiprocessing/popen_fork.py", line 19, in init self._launch(process_obj) File "/home/xxx/anaconda3/envs/mapg/lib/python3.6/multiprocessing/popen_spawn_posix.py", line 47, in _launch reduction.dump(process_obj, fp) File "/home/xxx/anaconda3/envs/mapg/lib/python3.6/multiprocessing/reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj) TypeError: can't pickle _thread.RLock objects

Could you please tell me how to solve this problem? Thanks a lot!

dkkim93 commented 1 year ago

Hello! I appreciate your interest in our paper. :-) I just ran the code on a GCP server with virtualenv (python version 3.6) and did not experience the above issue.

Instead of conda, would it be possible to try again with virtualenv (python version 3.6)? Additionally, could you share the requirements.txt file from your virtual environment with me and check whether the versions match the ones in this file?

The above issue might be caused due to a version mismatch between libraries. Thanks!

Waiting-TT commented 1 year ago

Thank you so much and this is my environment! requirements.txt

I found that "shared_meta_agent" and "log" in main.py line 48 can't use pickle. I wonder if this is the reason, and how to solve it? Thanks again!

(absl-py==1.4.0 cachetools==4.2.4 certifi==2021.5.30 cffi==1.15.1 charset-normalizer==2.0.12 Cython==0.29.34 dataclasses==0.8 distlib==0.3.6 fasteners==0.18 filelock==3.4.1 gitdb==4.0.9 gitdb2==4.0.2 GitPython==3.0.8 glfw==2.5.9 google-auth==2.17.3 google-auth-oauthlib==0.4.6 grpcio==1.48.2 gym==0.12.5 idna==3.4 imageio==2.15.0 importlib-metadata==4.8.3 importlib-resources==5.4.0 Markdown==3.3.7 mujoco-py==2.1.2.14 numpy==1.19.5 oauthlib==3.2.2 Pillow==8.4.0 platformdirs==2.4.0 protobuf==3.19.6 pyasn1==0.5.0 pyasn1-modules==0.3.0 pycparser==2.21 pyglet==2.0.5 PyYAML==3.12 requests==2.27.1 requests-oauthlib==1.3.1 rsa==4.9 scipy==1.5.4 six==1.16.0 smmap==5.0.0 tensorboard==2.10.1 tensorboard-data-server==0.6.1 tensorboard-plugin-wit==1.8.1 tensorboardX==1.2 torch==1.4.0 typing_extensions==4.1.1 urllib3==1.26.15 virtualenv==20.17.1 Werkzeug==2.0.3 zipp==3.6.0)

Waiting-TT commented 1 year ago

My Python version is 3.6.5.

dkkim93 commented 1 year ago

In our paper, we use distributed training to speed up the meta-optimization, where the shared_meta_agent is shared between multiple processes and is updated asynchronously. The experienced issue above is related to the multiprocessing part.

I would like to ask the following questions to understand better why our code is not working on your environment:

  1. We tested our code based on a Linux OS server (Ubuntu 20.04). Which OS would you be using?
  2. By looking at the error message above, the pickle issue directly arises from Python's multiprocessing library (/home/xxx/anaconda3/envs/mapg/lib/python3.6/multiprocessing/process.py) and does go through PyTorch's multiprocessing library. In main.py, could you double-check whether you are using import torch.multiprocessing as mp instead of import multiprocessing as mp? Because we are sharing the PyTorch model across processes, we would like to use torch.multiprocessing.
  3. Lastly, the distributed training part is implemented based on the popular A3C code (repository). Could you double-check whether your environment can run the referred A3C code? As in our code, the A3C code also uses the torch.multiprocessing (link) and share_memory (link) to enable the distributed training.

Thanks!

dkkim93 commented 10 months ago

I will close this issue :) If this issue remains, please feel free to re-open. Thank you.