PaddlePaddle / PARL

A high-performance distributed training framework for Reinforcement Learning
https://parl.readthedocs.io/
Apache License 2.0
3.24k stars 820 forks source link

Segmentation fault (core dumped) #423

Closed Gaoee closed 3 years ago

Gaoee commented 3 years ago

我用多进程运行torch版本的算法时出现了如下错误:

/home/exp/anaconda3/envs/universal_picking_torch/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
  return f(*args, **kwds)
/home/exp/envs/robosuite/robosuite/models/tasks/placement_sampler.py:93: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
  elif isinstance(self.z_rotation, collections.Iterable):
Found 1 GPUs for rendering. Using device 0.
pybullet build time: Jun  2 2020 06:48:23
/home/exp/anaconda3/envs/universal_picking_torch/lib/python3.7/site-packages/gym/logger.py:30: UserWarning: WARN: Box bound precision lowered by casting to float32
  warnings.warn(colorize('%s: %s'%('WARN', msg % args), 'yellow'))
[10-08 11:11:18 MainThread @machine_info.py:86] nvidia-smi -L found gpu count: 1
[10-08 11:11:18 MainThread @machine_info.py:86] nvidia-smi -L found gpu count: 1
[10-08 11:11:18 MainThread @client.py:410] Remote actors log url: http://192.168.16.135:55043/logs?client_id=192.168.16.135_40859_1602126678
[10-08 11:11:18 MainThread @train.py:155] Waiting for 3 remote actors to connect.
[10-08 11:11:18 MainThread @train.py:165] Remote actor count: 1
[10-08 11:11:18 MainThread @train.py:165] Remote actor count: 2
[10-08 11:11:18 MainThread @train.py:165] Remote actor count: 3
[10-08 11:11:18 MainThread @train.py:172] All remote actors are ready, begin to learn.
[10-08 11:13:52 MainThread @train.py:267] Epoch 0, Evaluate reward: -200.0, success: 0.0
[10-08 11:13:52 MainThread @visualdl.py:34] WRN [VisualDL] logdir is None, will save VisualDL files to train_log/train
View the data using: visualdl --logdir=./train_log/train --host=192.168.16.135
[10-08 11:13:52 MainThread @train.py:350] {'epoch': 0, 'stats_g_mean': 0.47275138, 'stats_g_std': 0.038607508, 'stats_o_mean': 0.08620722, 'stats_o_std': 0.24926318, 'tests_mean_ep_rew': -200.0, 'tests_success_rate': 0.0, 'train_mean_ep_rew': -199.98533529142853, 'train_success_rate': 0.0}
Segmentation fault (core dumped)

python版本为3.7.7

环境如下:

Package              Version         Location                                                   
-------------------- --------------- -----------------------------------------------------------
absl-py              0.9.0           
appdirs              1.4.4           
astor                0.8.1           
atari-py             0.2.6           
Babel                2.8.0                                       
certifi              2020.4.5.1      
cffi                 1.14.0          
cfgv                 3.1.0           
chardet              3.0.4           
click                7.1.2           
cloudpickle          1.2.1           
cycler               0.10.0          
Cython               0.29.16         
distlib              0.3.1           
fasteners            0.15            
filelock             3.0.12          
flake8               3.8.3           
Flask                1.1.2           
Flask-Babel          1.0.0           
Flask-Cors           3.0.8           
future               0.18.2          
gast                 0.3.3           
glfw                 1.11.0          
google-pasta         0.2.0           
grpcio               1.29.0          
gym                  0.15.7          
h5py                 2.10.0          
identify             1.4.25          
idna                 2.10            
imageio              2.8.0           
importlib-metadata   1.6.1           
importlib-resources  3.0.0           
itsdangerous         1.1.0           
Jinja2               2.11.2          
joblib               0.15.1          
Keras-Applications   1.0.8           
Keras-Preprocessing  1.1.2           
kiwisolver           1.2.0           
lockfile             0.12.2          
Markdown             3.2.2           
MarkupSafe           1.1.1           
matplotlib           3.2.2           
mccabe               0.6.1           
monotonic            1.5             
mpi4py               3.0.3           
mujoco-py            2.0.2.2         
nodeenv              1.4.0           
numpy                1.18.5          
opencv-python        4.2.0.34        
pandas               1.0.5           
parl                 1.3.2           /home/exp/Codes/PARL                                       
Pillow               7.1.2           
pip                  20.0.2          
pre-commit           2.6.0           
protobuf             3.12.2          
psutil               5.7.2           
pyarrow              0.13.0          
pybullet             2.8.1           
pycodestyle          2.6.0           
pycparser            2.20            
pyflakes             2.2.0           
pyglet               1.5.0           
pyparsing            2.4.7           
python-dateutil      2.8.1           
pytz                 2020.1          
PyYAML               5.3.1           
pyzmq                18.0.1          
requests             2.24.0                                    
scipy                1.4.1           
setuptools           47.1.1          
six                  1.15.0          
tb-nightly           1.15.0a20190801 
tensorboard          1.14.0          
tensorboardX         1.8             
tensorflow-estimator 1.14.0          
tensorflow-gpu       1.14.0          
termcolor            1.1.0           
toml                 0.10.1          
torch                1.6.0           
torchvision          0.7.0           
tqdm                 4.46.1          
urllib3              1.25.10         
virtualenv           20.0.28         
visualdl             2.0.0b8         
Werkzeug             1.0.1           
wheel                0.29.0          
wrapt                1.12.1          
zipp                 3.1.0   

不知道是不是Python版本的问题,我用Python3.6.2运行也不会出错

Gaoee commented 3 years ago

好像确实是python版本的问题,我现在换成3.6.2的版本就没有出现这个问题了,但我不清楚是什么原因

TomorrowIsAnOtherDay commented 3 years ago

感谢你的反馈:)请问方便法下你的系统版本信息吗?我们尝试定位下问题,估计是某个第三方依赖库在3.7版本上的兼容做的不好。

Gaoee commented 3 years ago

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 16.04.6 LTS Release: 16.04 Codename: xenial

TomorrowIsAnOtherDay commented 3 years ago

收到:)我们尝试在本地复现下

rical730 commented 3 years ago

我们在Ubuntu机器的python3.7.7环境中运行了 PARL/benchmark/torch/a2c下的示例代码,可以正常运行,你可以参考下我们的A2C示例来编写多线程训练程序,或者如果方便的话也可以整理一下你的代码给我们,方便我们复现并定位问题。

Gaoee commented 3 years ago

今天重新运行了,python3.6的也用问题,另一台机器上3.6是没有问题的。 代码可能不是很方便提供,下面是我这边用faulthandler打印出来的错误,不知道对问题的定位有没有帮助:

[10-12 14:07:33 MainThread @train.py:270] Epoch 2, Evaluate reward: -200.0, success: 0.0
[10-12 14:07:33 MainThread @train.py:353] {'epoch': 2, 'stats_g_mean': 0.46897718, 'stats_g_std': 0.04633944, 'stats_o_mean': 0.08383601, 'stats_o_std': 0.25559673, 'tests_mean_ep_rew': -200.0, 'tests_success_rate': 0.0, 'train_mean_ep_rew': -199.80116071428571, 'train_success_rate': 0.0}
Fatal Python error: Segmentation fault

Thread 0x00007fb1cf73a700 (most recent call first):
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 299 in wait
  File "/home/exp/anaconda3/envs/**/lib/python3.6/queue.py", line 173 in get
  File "/home/exp/anaconda3/envs/**/lib/python3.6/site-packages/visualdl/writer/record_writer.py", line 171 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1edffb700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/remote/client.py", line 303 in _create_job_monitor
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1ee7fc700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/remote/client.py", line 303 in _create_job_monitor
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1eeffd700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/remote/client.py", line 303 in _create_job_monitor
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1ef7fe700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/remote/client.py", line 303 in _create_job_monitor
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1effff700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/remote/client.py", line 303 in _create_job_monitor
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1feffd700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/remote/client.py", line 303 in _create_job_monitor
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1ff7fe700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/utils/communication.py", line 60 in dumps_argument
  File "/home/exp/Codes/PARL/parl/remote/remote_decorator.py", line 230 in wrapper
  File "train.py", line 190 in get_remote_gradient
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1fffff700 (most recent call first):
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 295 in wait
  File "/home/exp/anaconda3/envs/**/lib/python3.6/queue.py", line 164 in get
  File "train.py", line 189 in get_remote_gradient
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb230981700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/remote/client.py", line 303 in _create_job_monitor
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb231182700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/utils/communication.py", line 60 in dumps_argument
  File "/home/exp/Codes/PARL/parl/remote/remote_decorator.py", line 230 in wrapper
  File "train.py", line 190 in get_remote_gradient
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb231983700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/utils/communication.py", line 60 in dumps_argument
  File "/home/exp/Codes/PARL/parl/remote/remote_decorator.py", line 230 in wrapper
  File "train.py", line 190 in get_remote_gradient
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb276545700 (most recent call first):
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 295 in wait
  File "/home/exp/anaconda3/envs/**/lib/python3.6/queue.py", line 164 in get
  File "train.py", line 189 in get_remote_gradient
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Current thread 0x00007fb278d46700 (most recent call first):
  File "/home/exp/anaconda3/envs/**/lib/python3.6/site-packages/pyarrow/serialization.py", line 265 in _serialize_ordered_dict
  File "/home/exp/Codes/PARL/parl/utils/communication.py", line 60 in dumps_argument
  File "/home/exp/Codes/PARL/parl/remote/remote_decorator.py", line 230 in wrapper
  File "train.py", line 190 in get_remote_gradient
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1db1c1700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/utils/communication.py", line 60 in dumps_argument
  File "/home/exp/Codes/PARL/parl/remote/remote_decorator.py", line 230 in wrapper
  File "train.py", line 190 in get_remote_gradient
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1da9c0700 (most recent call first):
  File "/home/exp/anaconda3/envs/**/lib/python3.6/site-packages/zmq/sugar/socket.py", line 470 in recv_multipart
  File "/home/exp/Codes/PARL/parl/remote/client.py", line 226 in _reply_heartbeat
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb30dc82700 (most recent call first):
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 295 in wait
  File "/home/exp/anaconda3/envs/**/lib/python3.6/queue.py", line 164 in get
  File "train.py", line 220 in step
  File "train.py", line 422 in <module>
Segmentation fault (core dumped)
TomorrowIsAnOtherDay commented 3 years ago
/home/exp/anaconda3/envs/**/lib/python3.6/site-packages/pyarrow/serialization.py

抱歉,这两天忙着赶项目忘记关注邮箱了。 这个看着像是序列化导致的,xparl 并行过程中会把函数的输入以及输出通过网络传输分发到不同节点上。 这就需要用到序列化,但是有些类型是没办法序列化的,比如protobuf的message,没办法通过pyarrow进行二次序列化。 还请检查下你的函数输入输出是否都是常见类型。

TomorrowIsAnOtherDay commented 3 years ago

另外这个情况不代表没法通过xparl并行了,举个例子,torch里面的gpu tensor数据都放在gpu显存中,没办法直接序列化分发的,但是我们可以先通过tensor.cpu().numpy()的方式把它转换成常见的数据类型,再分发出去。

Gaoee commented 3 years ago

今天重新运行了,python3.6的也用问题,另一台机器上3.6是没有问题的。 代码可能不是很方便提供,下面是我这边用faulthandler打印出来的错误,不知道对问题的定位有没有帮助:

[10-12 14:07:33 MainThread @train.py:270] Epoch 2, Evaluate reward: -200.0, success: 0.0
[10-12 14:07:33 MainThread @train.py:353] {'epoch': 2, 'stats_g_mean': 0.46897718, 'stats_g_std': 0.04633944, 'stats_o_mean': 0.08383601, 'stats_o_std': 0.25559673, 'tests_mean_ep_rew': -200.0, 'tests_success_rate': 0.0, 'train_mean_ep_rew': -199.80116071428571, 'train_success_rate': 0.0}
Fatal Python error: Segmentation fault

Thread 0x00007fb1cf73a700 (most recent call first):
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 299 in wait
  File "/home/exp/anaconda3/envs/**/lib/python3.6/queue.py", line 173 in get
  File "/home/exp/anaconda3/envs/**/lib/python3.6/site-packages/visualdl/writer/record_writer.py", line 171 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1edffb700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/remote/client.py", line 303 in _create_job_monitor
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1ee7fc700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/remote/client.py", line 303 in _create_job_monitor
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1eeffd700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/remote/client.py", line 303 in _create_job_monitor
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1ef7fe700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/remote/client.py", line 303 in _create_job_monitor
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1effff700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/remote/client.py", line 303 in _create_job_monitor
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1feffd700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/remote/client.py", line 303 in _create_job_monitor
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1ff7fe700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/utils/communication.py", line 60 in dumps_argument
  File "/home/exp/Codes/PARL/parl/remote/remote_decorator.py", line 230 in wrapper
  File "train.py", line 190 in get_remote_gradient
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1fffff700 (most recent call first):
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 295 in wait
  File "/home/exp/anaconda3/envs/**/lib/python3.6/queue.py", line 164 in get
  File "train.py", line 189 in get_remote_gradient
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb230981700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/remote/client.py", line 303 in _create_job_monitor
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb231182700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/utils/communication.py", line 60 in dumps_argument
  File "/home/exp/Codes/PARL/parl/remote/remote_decorator.py", line 230 in wrapper
  File "train.py", line 190 in get_remote_gradient
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb231983700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/utils/communication.py", line 60 in dumps_argument
  File "/home/exp/Codes/PARL/parl/remote/remote_decorator.py", line 230 in wrapper
  File "train.py", line 190 in get_remote_gradient
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb276545700 (most recent call first):
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 295 in wait
  File "/home/exp/anaconda3/envs/**/lib/python3.6/queue.py", line 164 in get
  File "train.py", line 189 in get_remote_gradient
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Current thread 0x00007fb278d46700 (most recent call first):
  File "/home/exp/anaconda3/envs/**/lib/python3.6/site-packages/pyarrow/serialization.py", line 265 in _serialize_ordered_dict
  File "/home/exp/Codes/PARL/parl/utils/communication.py", line 60 in dumps_argument
  File "/home/exp/Codes/PARL/parl/remote/remote_decorator.py", line 230 in wrapper
  File "train.py", line 190 in get_remote_gradient
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1db1c1700 (most recent call first):
  File "/home/exp/Codes/PARL/parl/utils/communication.py", line 60 in dumps_argument
  File "/home/exp/Codes/PARL/parl/remote/remote_decorator.py", line 230 in wrapper
  File "train.py", line 190 in get_remote_gradient
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb1da9c0700 (most recent call first):
  File "/home/exp/anaconda3/envs/**/lib/python3.6/site-packages/zmq/sugar/socket.py", line 470 in recv_multipart
  File "/home/exp/Codes/PARL/parl/remote/client.py", line 226 in _reply_heartbeat
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 864 in run
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 916 in _bootstrap_inner
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 884 in _bootstrap

Thread 0x00007fb30dc82700 (most recent call first):
  File "/home/exp/anaconda3/envs/**/lib/python3.6/threading.py", line 295 in wait
  File "/home/exp/anaconda3/envs/**/lib/python3.6/queue.py", line 164 in get
  File "train.py", line 220 in step
  File "train.py", line 422 in <module>
Segmentation fault (core dumped)

上面的log是在Python3.6环境出现的,具体的环境如下:

Package             Version             Location
------------------- ------------------- -------------------------------------------------------------
absl-py             0.10.0
appdirs             1.4.4
Babel               2.8.0
bce-python-sdk      0.8.46
certifi             2020.6.20
cffi                1.14.3
cfgv                3.2.0
chardet             3.0.4
click               7.1.2
cloudpickle         1.2.1
Cython              0.29.21
distlib             0.3.1
filelock            3.0.12
flake8              3.8.4
Flask               1.1.2
Flask-Babel         2.0.0
Flask-Cors          3.0.9
future              0.18.2
glfw                2.0.0
grpcio              1.32.0
gym                 0.17.3
h5py                2.10.0
identify            1.5.5
idna                2.10
imageio             2.9.0
importlib-metadata  2.0.0
importlib-resources 3.0.0
itsdangerous        1.1.0
Jinja2              2.11.2
lockfile            0.12.2
Markdown            3.3
MarkupSafe          1.1.1
mccabe              0.6.1
mujoco-py           2.0.2.2
nodeenv             1.5.0
numpy               1.19.2
opencv-python       4.4.0.44
parl                1.3.2               /home/exp/Codes/PARL
Pillow              7.2.0
pip                 20.2.3
pre-commit          2.7.1
protobuf            3.13.0
psutil              5.7.2
pyarrow             0.17.1
pybullet            2.8.1
pycodestyle         2.6.0
pycparser           2.20
pycryptodome        3.9.8
pyflakes            2.2.0
pyglet              1.5.0
pytz                2020.1
PyYAML              5.3.1
pyzmq               18.0.1
requests            2.24.0
scipy               1.5.2
setuptools          50.3.0.post20201006
six                 1.15.0
tb-nightly          1.15.0a20190801
tensorboardX        1.8
termcolor           1.1.0
toml                0.10.1
torch               1.6.0
tqdm                4.50.1
urllib3             1.25.10
virtualenv          20.0.33
visualdl            2.0.0b8
Werkzeug            1.0.1
wheel               0.35.1
zipp                3.3.0

目前我将pyarrow从0.17.1换到0.13.0后,似乎就没有问题了(暂时还没出问题),但我不知道具体是什么原因

Gaoee commented 3 years ago
/home/exp/anaconda3/envs/**/lib/python3.6/site-packages/pyarrow/serialization.py

抱歉,这两天忙着赶项目忘记关注邮箱了。 这个看着像是序列化导致的,xparl 并行过程中会把函数的输入以及输出通过网络传输分发到不同节点上。 这就需要用到序列化,但是有些类型是没办法序列化的,比如protobuf的message,没办法通过pyarrow进行二次序列化。 还请检查下你的函数输入输出是否都是常见类型。

我检查了我这边的代码,传递的数据都是numpy里面的数据,应该是没有问题

TomorrowIsAnOtherDay commented 3 years ago

感谢反馈。 pyarrow确实是个大问题,它虽然是归Apache这个开源机构管理,但是为了跨平台兼容性,在 python这块经常有些莫名其妙的问题。 后续版本我们会替换掉它。

TomorrowIsAnOtherDay commented 3 years ago

我们在年底会统一修复兼容性,鲁棒性的问题,感谢使用与反馈。 您的使用反馈对我们帮助很大:)祝好

Gaoee commented 3 years ago

非常感谢您的解答,希望这个平台可以越来越好

Gaoee commented 3 years ago

这个问题好像是一些库的版本导致的,更新到1.4后还是存在Segmentation fault (core dumped)的问题。 我使用下面的环境就没有出现这样的问题了

Package              Version         Location                               
-------------------- --------------- ---------------------------------------
absl-py              0.9.0           
appdirs              1.4.4           
astor                0.8.1           
atari-py             0.2.6           
Babel                2.8.0                  
certifi              2020.6.20       
cffi                 1.14.0          
cfgv                 3.1.0           
chardet              3.0.4           
click                7.1.2           
cloudpickle          1.6.0           
cycler               0.10.0          
Cython               0.29.19         
distlib              0.3.1           
fasteners            0.15            
filelock             3.0.12          
flake8               3.8.3           
Flask                1.1.2           
Flask-Babel          1.0.0           
Flask-Cors           3.0.8           
future               0.18.2          
gast                 0.3.3           
glfw                 1.11.2          
google-pasta         0.2.0           
grpcio               1.29.0          
gym                  0.15.7          
h5py                 2.10.0          
identify             1.4.25          
idna                 2.10            
imageio              2.8.0           
importlib-metadata   1.6.1           
importlib-resources  3.0.0           
itsdangerous         1.1.0           
Jinja2               2.11.2          
joblib               0.15.1          
Keras-Applications   1.0.8           
Keras-Preprocessing  1.1.2           
kiwisolver           1.2.0           
lockfile             0.12.2          
Markdown             3.2.2           
MarkupSafe           1.1.1           
matplotlib           3.2.2           
mccabe               0.6.1           
monotonic            1.5             
mpi4py               3.0.3           
mujoco-py            2.0.2.10        
nodeenv              1.4.0           
numpy                1.18.5          
opencv-python        4.2.0.34        
pandas               1.0.5           
parl                 1.4             /home/exp/Codes/PARL1.4                
Pillow               7.1.2           
pip                  20.2.3          
pre-commit           2.6.0           
protobuf             3.14.0          
psutil               5.7.2           
pyarrow              0.17.1          
pybullet             2.8.1           
pycodestyle          2.6.0           
pycparser            2.20            
pyflakes             2.2.0           
pyglet               1.5.0           
pyparsing            2.4.7           
python-dateutil      2.8.1           
pytz                 2020.1          
PyYAML               5.3.1           
pyzmq                18.1.1          
requests             2.24.0                     
scipy                1.4.1           
setuptools           47.1.1          
six                  1.15.0          
tb-nightly           1.15.0a20190801 
tensorboard          1.14.0          
tensorboardX         1.8             
tensorflow-estimator 1.14.0          
tensorflow-gpu       1.14.0          
termcolor            1.1.0           
toml                 0.10.1          
torch                1.6.0           
torchvision          0.7.0           
tqdm                 4.46.1          
urllib3              1.25.10         
virtualenv           20.0.28         
visualdl             2.0.0b8         
Werkzeug             1.0.1           
wheel                0.29.0          
wrapt                1.12.1          
zipp                 3.1.0