PaddlePaddle / PaddleDetection

Object Detection toolkit based on PaddlePaddle. It supports object detection, instance segmentation, multiple object tracking and real-time multi-person keypoint detection.
Apache License 2.0
12.77k stars 2.88k forks source link

多卡训练ppyolo时出现 std::runtime_error #3167

Open PeterJaq opened 3 years ago

PeterJaq commented 3 years ago

我在训练多卡的ppyolo时出现了下列问题,尝试了3次 都会在训练了数百个epoch后出现下列问题,复现100% Traceback (most recent call last): File "tools/train.py", line 140, in main() File "tools/train.py", line 136, in main run(FLAGS, cfg) File "tools/train.py", line 111, in run trainer.train(FLAGS.eval) File "/usr/src/app/pd_detection/ppdet/engine/trainer.py", line 307, in train outputs = model(data) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 898, in call outputs = self.forward(*inputs, kwargs) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/parallel.py", line 578, in forward outputs = self._layers(*inputs, *kwargs) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 898, in call outputs = self.forward(inputs, kwargs) File "/usr/src/app/pd_detection/ppdet/modeling/architectures/meta_arch.py", line 27, in forward out = self.get_loss() File "/usr/src/app/pd_detection/ppdet/modeling/architectures/yolo.py", line 101, in get_loss return self._forward() File "/usr/src/app/pd_detection/ppdet/modeling/architectures/yolo.py", line 64, in _forward neck_feats = self.neck(body_feats, self.for_mot) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 898, in call outputs = self.forward(*inputs, kwargs) File "/usr/src/app/pd_detection/ppdet/modeling/necks/yolo_fpn.py", line 997, in forward route, tip = self.fpn_blocksi File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 898, in call outputs = self.forward(*inputs, *kwargs) File "/usr/src/app/pd_detection/ppdet/modeling/necks/yolo_fpn.py", line 417, in forward conv_left = self.conv_module(conv_left) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 898, in call outputs = self.forward(inputs, kwargs) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/container.py", line 97, in forward input = layer(input) File "/usr/local/lib/python3.7/dist-packages/paddle/fluid/dygraph/layers.py", line 898, in call outputs = self.forward(*inputs, **kwargs) File "/usr/src/app/pd_detection/ppdet/modeling/necks/yolo_fpn.py", line 205, in forward matrix = paddle.cast(paddle.rand(x.shape, x.dtype) < gamma, x.dtype) File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/random.py", line 722, in rand return uniform(shape, dtype, min=0.0, max=1.0, name=name) File "/usr/local/lib/python3.7/dist-packages/paddle/tensor/random.py", line 502, in uniform float(max), 'seed', seed, 'dtype', dtype) SystemError: (Fatal) Operator uniform_random raises an std::runtime_error exception. The exception content is :random_device::random_device(const std::string&). (at /paddle/paddle/fluid/imperative/tracer.cc:192)

wangxinxin08 commented 3 years ago

收到,可以贴一下你的PaddlePaddle版本,PaddleDetection版本,cuda版本以及cudnn版本吗?我们定位下问题

PeterJaq commented 3 years ago

这是我的

收到,可以贴一下你的PaddlePaddle版本,PaddleDetection版本,cuda版本以及cudnn版本吗?我们定位下问题

收到,paddleDetection版本为 v2.1.0 通过release下载的https://github.com/PaddlePaddle/PaddleDetection/releases/tag/v2.1.0 paddlepaddle 版本 2.1.0 cuda 11.2