Closed cheesezoella closed 1 year ago
单卡训不需要 paddle.distributed.launch
直接CUDA_VISIBLE_DEVICES=0 python3.7 tools/train.py -c ${config} --amp --eval
好的,谢谢!那请问configs/pphuman/ppyoloe_crn_s_80e_smoking_visdrone.yml文件下的配置有什么建议吗?比如learning rate,epoch 和 batch size。我大概有一万张照片。
PP-YOLOE模型训练过程中使用8 GPUs进行混合精度训练,如果GPU卡数或者batch size发生了改变,你需要按照公式 *lrnew = lrdefault (batch_sizenew GPU_numbernew) / (batch_sizedefault GPU_numberdefault)** 调整学习率。
问题确认 Search before asking
请提出你的问题 Please ask your question
你好,我在自行准备的数据上训练但报错。我是根据https://github.com/PaddlePaddle/PaddleDetection/blob/develop/docs/advanced_tutorials/customization/action_recognotion/idbased_det.md 的方案来进行模型训练。谢谢! 运行:python -m paddle.distributed.launch --gpus 0 tools/train.py -c configs/pphuman/ppyoloe_crn_s_80e_smoking_visdrone.yml --eval --amp
W0729 09:54:24.241298 42635 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.4, Runtime API Version: 11.2 W0729 09:54:24.253404 42635 device_context.cc:465] device: 0, cuDNN Version: 8.1. [07/29 09:54:28] ppdet.utils.checkpoint INFO: The shape [10] in pretrained weight yolo_head.pred_cls.0.bias is unmatched with the shape [1] in model yolo_head.pred_cls.0.bias. And the weight yolo_head.pred_cls.0.bias will not be loaded [07/29 09:54:28] ppdet.utils.checkpoint INFO: The shape [10, 384, 3, 3] in pretrained weight yolo_head.pred_cls.0.weight is unmatched with the shape [1, 384, 3, 3] in model yolo_head.pred_cls.0.weight. And the weight yolo_head.pred_cls.0.weight will not be loaded [07/29 09:54:28] ppdet.utils.checkpoint INFO: The shape [10] in pretrained weight yolo_head.pred_cls.1.bias is unmatched with the shape [1] in model yolo_head.pred_cls.1.bias. And the weight yolo_head.pred_cls.1.bias will not be loaded [07/29 09:54:28] ppdet.utils.checkpoint INFO: The shape [10, 192, 3, 3] in pretrained weight yolo_head.pred_cls.1.weight is unmatched with the shape [1, 192, 3, 3] in model yolo_head.pred_cls.1.weight. And the weight yolo_head.pred_cls.1.weight will not be loaded [07/29 09:54:28] ppdet.utils.checkpoint INFO: The shape [10] in pretrained weight yolo_head.pred_cls.2.bias is unmatched with the shape [1] in model yolo_head.pred_cls.2.bias. And the weight yolo_head.pred_cls.2.bias will not be loaded [07/29 09:54:28] ppdet.utils.checkpoint INFO: The shape [10, 96, 3, 3] in pretrained weight yolo_head.pred_cls.2.weight is unmatched with the shape [1, 96, 3, 3] in model yolo_head.pred_cls.2.weight. And the weight yolo_head.pred_cls.2.weight will not be loaded [07/29 09:54:28] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/.cache/paddle/weights/ppyoloe_crn_s_80e_visdrone.pdparams Error: /paddle/paddle/fluid/operators/gather.cu.h:62 Assertion
index_value >= 0 && index_value < input_dims[j]
failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0]W0729 09:54:24.241298 42635 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 7.5, Driver API Version: 11.4, Runtime API Version: 11.2 W0729 09:54:24.253404 42635 device_context.cc:465] device: 0, cuDNN Version: 8.1. [07/29 09:54:28] ppdet.utils.checkpoint INFO: The shape [10] in pretrained weight yolo_head.pred_cls.0.bias is unmatched with the shape [1] in model yolo_head.pred_cls.0.bias. And the weight yolo_head.pred_cls.0.bias will not be loaded [07/29 09:54:28] ppdet.utils.checkpoint INFO: The shape [10, 384, 3, 3] in pretrained weight yolo_head.pred_cls.0.weight is unmatched with the shape [1, 384, 3, 3] in model yolo_head.pred_cls.0.weight. And the weight yolo_head.pred_cls.0.weight will not be loaded [07/29 09:54:28] ppdet.utils.checkpoint INFO: The shape [10] in pretrained weight yolo_head.pred_cls.1.bias is unmatched with the shape [1] in model yolo_head.pred_cls.1.bias. And the weight yolo_head.pred_cls.1.bias will not be loaded [07/29 09:54:28] ppdet.utils.checkpoint INFO: The shape [10, 192, 3, 3] in pretrained weight yolo_head.pred_cls.1.weight is unmatched with the shape [1, 192, 3, 3] in model yolo_head.pred_cls.1.weight. And the weight yolo_head.pred_cls.1.weight will not be loaded [07/29 09:54:28] ppdet.utils.checkpoint INFO: The shape [10] in pretrained weight yolo_head.pred_cls.2.bias is unmatched with the shape [1] in model yolo_head.pred_cls.2.bias. And the weight yolo_head.pred_cls.2.bias will not be loaded [07/29 09:54:28] ppdet.utils.checkpoint INFO: The shape [10, 96, 3, 3] in pretrained weight yolo_head.pred_cls.2.weight is unmatched with the shape [1, 96, 3, 3] in model yolo_head.pred_cls.2.weight. And the weight yolo_head.pred_cls.2.weight will not be loaded [07/29 09:54:28] ppdet.utils.checkpoint INFO: Finish loading model weights: /home/.cache/paddle/weights/ppyoloe_crn_s_80e_visdrone.pdparams Error: /paddle/paddle/fluid/operators/gather.cu.h:62 Assertion
index_value >= 0 && index_value < input_dims[j]
failed. The index is out of bounds, please check whether the dimensions of index and input meet the requirements. It should be less than [1] and greater than or equal to 0, but received [0]Traceback (most recent call last): File "tools/train.py", line 172, in
main()
File "tools/train.py", line 168, in main
run(FLAGS, cfg)
File "tools/train.py", line 132, in run
trainer.train(FLAGS.eval)
File "/home/action/paddle_develop/ppdet/engine/trainer.py", line 467, in train
outputs = model(data)
File "/home/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 917, in call
return self._dygraph_call_func(*inputs, kwargs)
File "/home/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
outputs = self.forward(*inputs, *kwargs)
File "/home/action/paddle_develop/ppdet/modeling/architectures/meta_arch.py", line 59, in forward
out = self.get_loss()
File "/home/action/paddle_develop/ppdet/modeling/architectures/yolo.py", line 124, in get_loss
return self._forward()
File "/home/action/paddle_develop/ppdet/modeling/architectures/yolo.py", line 88, in _forward
yolo_losses = self.yolo_head(neck_feats, self.inputs)
File "/home/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 917, in call
return self._dygraph_call_func(inputs, kwargs)
File "/home/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
outputs = self.forward(*inputs, kwargs)
File "/home/action/paddle_develop/ppdet/modeling/heads/ppyoloe_head.py", line 216, in forward
return self.forward_train(feats, targets)
File "/home/action/paddle_develop/ppdet/modeling/heads/ppyoloe_head.py", line 158, in forward_train
return self.get_loss([
File "/home/action/paddle_develop/ppdet/modeling/heads/ppyoloe_head.py", line 322, in get_loss
self.assigner(
File "/home/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 917, in call
return self._dygraph_call_func(*inputs, *kwargs)
File "/home/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/layers.py", line 907, in _dygraph_call_func
outputs = self.forward(inputs, kwargs)
File "/home/anaconda3/envs/paddle/lib/python3.8/site-packages/decorator.py", line 232, in fun
return caller(func, *(extras + args), *kw)
File "/home/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/base.py", line 351, in _decorate_function
return func(args, *kwargs)
File "/home/action/paddle_develop/ppdet/modeling/assigners/task_aligned_assigner.py", line 114, in forward
is_in_topk = gather_topk_anchors(
File "/home/action/paddle_develop/ppdet/modeling/assigners/utils.py", line 105, in gather_topk_anchors
return is_in_topk topk_mask
File "/home/anaconda3/envs/paddle/lib/python3.8/site-packages/paddle/fluid/dygraph/math_op_patch.py", line 264, in impl
return math_op(self, other_var, 'axis', axis)
OSError: (External) CUDA error(719), unspecified launch failure.
[Hint: 'cudaErrorLaunchFailure'. An exception occurred on the device while executing a kernel. Common causes include dereferencing an invalid device pointerand accessing out of bounds shared memory. Less common cases can be system specific - more information about these cases canbe found in the system specific user guide. This leaves the process in an inconsistent state and any further CUDA work willreturn the same error. To continue using CUDA, the process must be terminated and relaunched.] (at /paddle/paddle/fluid/platform/gpu_info.cc:441)
[operator < elementwise_mul > error]
INFO 2022-07-29 09:54:36,315 launch_utils.py:341] terminate all the procs
ERROR 2022-07-29 09:54:36,315 launch_utils.py:602] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check its log.
INFO 2022-07-29 09:54:40,317 launch_utils.py:341] terminate all the procs
INFO 2022-07-29 09:54:40,317 launch.py:311] Local processes completed.