多卡pp-yolo训练出问题

使用paddle2.1.3，paddleDetection2.3；运行检测nccl安装正确

显卡信息

使用命令进行多卡训练 python -m paddle.distributed.launch --gpus 0,1 train.py

报错信息如下： (venv) [root@yun218 PP-YOLO]# python -m paddle.distributed.launch --gpus 0,1 train.py ----------- Configuration Arguments ----------- gpus: 0,1 heter_worker_num: None heter_workers: http_port: None ips: 127.0.0.1 log_dir: log nproc_per_node: None run_mode: None server_num: None servers: training_script: train.py training_script_args: [] worker_num: None workers:

WARNING 2021-11-04 20:49:05,967 launch.py:359] Not found distinct arguments and compiled with cuda or xpu. Default use collective mode launch train in GPU mode! INFO 2021-11-04 20:49:05,968 launch_utils.py:621] Change selected_gpus into reletive values. --ips:0,1 will change into relative_ips:[0, 1] according to your CUDA_VISIBLE_DEVICES:['0', '1'] INFO 2021-11-04 20:49:05,969 launch_utils.py:510] Local start 2 processes. First process distributed environment info (Only For Debug): +=======================================================================================+ | Distributed Envs Value | +---------------------------------------------------------------------------------------+ | PADDLE_TRAINER_ID 0 | | PADDLE_CURRENT_ENDPOINT 127.0.0.1:60216 | | PADDLE_TRAINERS_NUM 2 | | PADDLE_TRAINER_ENDPOINTS 127.0.0.1:60216,127.0.0.1:38263 | | PADDLE_RANK_IN_NODE 0 | | PADDLE_LOCAL_DEVICE_IDS 0 | | PADDLE_WORLD_DEVICE_IDS 0,1 | | FLAGS_selected_gpus 0 | | FLAGS_selected_accelerators 0 | +=======================================================================================+

INFO 2021-11-04 20:49:05,969 launch_utils.py:514] details abouts PADDLE_TRAINER_ENDPOINTS can be found in log/endpoints.log, and detail running logs maybe found in log/workerlog.0 launch proc_id:15922 idx:0 launch proc_id:15925 idx:1 /root/train/venv/lib/python3.7/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations if data.dtype == np.object: MobileNetV3 : {'model_name': 'large', 'scale': 0.5, 'with_extra_blocks': False, 'extra_block_filters': [], 'feature_maps': [7, 13, 16]} PPYOLOTinyFPN : {'detection_block_channels': [160, 128, 96], 'spp': True, 'drop_block': True} YOLOv3Head : {'anchors': [[10, 15], [24, 36], [72, 42], [35, 87], [102, 96], [60, 170], [220, 125], [128, 222], [264, 266]], 'anchor_masks': [[6, 7, 8], [3, 4, 5], [0, 1, 2]], 'loss': 'YOLOv3Loss'} YOLOv3Loss : {'ignore_thresh': 0.5, 'downsample': [32, 16, 8], 'label_smooth': False, 'scale_x_y': 1.05, 'iou_loss': 'IouLoss'} IouLoss : {'loss_weight': 5.0, 'loss_square': True} BBoxPostProcess : {'decode': {'name': 'YOLOBox', 'conf_thresh': 0.01, 'downsample_ratio': 32, 'clip_bbox': True, 'scale_x_y': 1.05}, 'nms': {'name': 'MatrixNMS', 'keep_top_k': 100, 'score_threshold': 0.01, 'post_threshold': 0.01, 'nms_top_k': -1, 'background_label': -1}} YOLOv3 : {'backbone': 'MobileNetV3', 'neck': 'PPYOLOTinyFPN', 'yolo_head': 'YOLOv3Head', 'post_process': 'BBoxPostProcess'} TrainReader : {'inputs_def': {'num_max_boxes': 100}, 'sample_transforms': [{'Decode': {}}, {'Mixup': {'alpha': 1.5, 'beta': 1.5}}, {'RandomDistort': {}}, {'RandomExpand': {'fill_value': [123.675, 116.28, 103.53]}}, {'RandomCrop': {}}, {'RandomFlip': {'prob': 0.5}}], 'batch_transforms': [{'BatchRandomResize': {'target_size': [192, 224, 256, 288, 320, 352, 384, 416, 448, 480, 512], 'random_size': True, 'random_interp': True, 'keep_ratio': False}}, {'NormalizeBox': {}}, {'PadBox': {'num_max_boxes': 100}}, {'BboxXYXY2XYWH': {}}, {'NormalizeImage': {'mean': [0.485, 0.456, 0.406], 'std': [0.229, 0.224, 0.225], 'is_scale': True}}, {'Permute': {}}, {'Gt2YoloTarget': {'anchor_masks': [[6, 7, 8], [3, 4, 5], [0, 1, 2]], 'anchors': [[10, 15], [24, 36], [72, 42], [35, 87], [102, 96], [60, 170], [220, 125], [128, 222], [264, 266]], 'downsample_ratios': [32, 16, 8]}}], 'batch_size': 16, 'shuffle': True, 'drop_last': True, 'mixup_epoch': 500, 'use_shared_memory': True} EvalReader : {'collate_batch': False, 'sample_transforms': [{'Decode': {}}, {'Resize': {'target_size': [320, 320], 'keep_ratio': False, 'interp': 2}}, {'NormalizeImage': {'mean': [0.485, 0.456, 0.406], 'std': [0.229, 0.224, 0.225], 'is_scale': True}}, {'Permute': {}}], 'batch_size': 8} TestReader : {'inputs_def': {'image_shape': [3, 320, 320]}, 'sample_transforms': [{'Decode': {}}, {'Resize': {'target_size': [320, 320], 'keep_ratio': False, 'interp': 2}}, {'NormalizeImage': {'mean': [0.485, 0.456, 0.406], 'std': [0.229, 0.224, 0.225], 'is_scale': True}}, {'Permute': {}}], 'batch_size': 1} LearningRate : {'base_lr': 0.005, 'schedulers': [<ppdet.optimizer.PiecewiseDecay object at 0x7f44a36dacd0>, <ppdet.optimizer.LinearWarmup object at 0x7f44a36dab90>]} OptimizerBuilder : {'optimizer': {'momentum': 0.9, 'type': 'Momentum'}, 'regularizer': {'factor': 0.0005, 'type': 'L2'}} metric : VOC map_type : 11point num_classes : 1 TrainDataset : <ppdet.data.source.voc.VOCDataSet object at 0x7f44a36d4950> EvalDataset : <ppdet.data.source.voc.VOCDataSet object at 0x7f44a36d4bd0> TestDataset : <ppdet.data.source.dataset.ImageFolder object at 0x7f44a36d4750> use_gpu : True log_iter : 20 save_dir : output snapshot_epoch : 5 architecture : YOLOv3 pretrain_weights : https://paddledet.bj.bcebos.com/models/pedestrian_yolov3_darknet.pdparams norm_type : sync_bn use_ema : True ema_decay : 0.9998 epoch : 1500 worker_num : 16 weights : output/ppyolo_mbv3_large_qat/best_model filename : ppyolo_tiny_650e_voc W1104 20:49:09.136406 15922 device_context.cc:404] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 10.2, Runtime API Version: 10.2 W1104 20:49:09.141834 15922 device_context.cc:422] device: 0, cuDNN Version: 7.6. INFO 2021-11-04 20:49:15,018 launch_utils.py:327] terminate all the procs ERROR 2021-11-04 20:49:15,019 launch_utils.py:584] ABORT!!! Out of all 2 trainers, the trainer process with rank=[1] was aborted. Please check its log. INFO 2021-11-04 20:49:18,022 launch_utils.py:327] terminate all the procs

@yihuiluo235 请查看log/workerlog.0 确定下报错的具体原因 /root/train/venv/lib/python3.7/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations if data.dtype == np.object: MobileNetV3 : {'model_name': 'large', 'scale': 0.5, 'with_extra_blocks': False, 'extra_block_filters': [], 'feature_maps': [7, 13, 16]} PPYOLOTinyFPN : {'detection_block_channels': [160, 128, 96], 'spp': True, 'drop_block': True} YOLOv3Head : {'anchors': [[10, 15], [24, 36], [72, 42], [35, 87], [102, 96], [60, 170], [220, 125], [128, 222], [264, 266]], 'anchor_masks': [[6, 7, 8], [3, 4, 5], [0, 1, 2]], 'loss': 'YOLOv3Loss'} YOLOv3Loss : {'ignore_thresh': 0.5, 'downsample': [32, 16, 8], 'label_smooth': False, 'scale_x_y': 1.05, 'iou_loss': 'IouLoss'} IouLoss : {'loss_weight': 5.0, 'loss_square': True} BBoxPostProcess : {'decode': {'name': 'YOLOBox', 'conf_thresh': 0.01, 'downsample_ratio': 32, 'clip_bbox': True, 'scale_x_y': 1.05}, 'nms': {'name': 'MatrixNMS', 'keep_top_k': 100, 'score_threshold': 0.01, 'post_threshold': 0.01, 'nms_top_k': -1, 'background_label': -1}} YOLOv3 : {'backbone': 'MobileNetV3', 'neck': 'PPYOLOTinyFPN', 'yolo_head': 'YOLOv3Head', 'post_process': 'BBoxPostProcess'} TrainReader : {'inputs_def': {'num_max_boxes': 100}, 'sample_transforms': [{'Decode': {}}, {'Mixup': {'alpha': 1.5, 'beta': 1.5}}, {'RandomDistort': {}}, {'RandomExpand': {'fill_value': [123.675, 116.28, 103.53]}}, {'RandomCrop': {}}, {'RandomFlip': {'prob': 0.5}}], 'batch_transforms': [{'BatchRandomResize': {'target_size': [192, 224, 256, 288, 320, 352, 384, 416, 448, 480, 512], 'random_size': True, 'random_interp': True, 'keep_ratio': False}}, {'NormalizeBox': {}}, {'PadBox': {'num_max_boxes': 100}}, {'BboxXYXY2XYWH': {}}, {'NormalizeImage': {'mean': [0.485, 0.456, 0.406], 'std': [0.229, 0.224, 0.225], 'is_scale': True}}, {'Permute': {}}, {'Gt2YoloTarget': {'anchor_masks': [[6, 7, 8], [3, 4, 5], [0, 1, 2]], 'anchors': [[10, 15], [24, 36], [72, 42], [35, 87], [102, 96], [60, 170], [220, 125], [128, 222], [264, 266]], 'downsample_ratios': [32, 16, 8]}}], 'batch_size': 16, 'shuffle': True, 'drop_last': True, 'mixup_epoch': 500, 'use_shared_memory': True} EvalReader : {'collate_batch': False, 'sample_transforms': [{'Decode': {}}, {'Resize': {'target_size': [320, 320], 'keep_ratio': False, 'interp': 2}}, {'NormalizeImage': {'mean': [0.485, 0.456, 0.406], 'std': [0.229, 0.224, 0.225], 'is_scale': True}}, {'Permute': {}}], 'batch_size': 8} TestReader : {'inputs_def': {'image_shape': [3, 320, 320]}, 'sample_transforms': [{'Decode': {}}, {'Resize': {'target_size': [320, 320], 'keep_ratio': False, 'interp': 2}}, {'NormalizeImage': {'mean': [0.485, 0.456, 0.406], 'std': [0.229, 0.224, 0.225], 'is_scale': True}}, {'Permute': {}}], 'batch_size': 1} LearningRate : {'base_lr': 0.005, 'schedulers': [<ppdet.optimizer.PiecewiseDecay object at 0x7fe503808190>, <ppdet.optimizer.LinearWarmup object at 0x7fe503808310>]} OptimizerBuilder : {'optimizer': {'momentum': 0.9, 'type': 'Momentum'}, 'regularizer': {'factor': 0.0005, 'type': 'L2'}} metric : VOC map_type : 11point num_classes : 1 TrainDataset : <ppdet.data.source.voc.VOCDataSet object at 0x7fe50378d3d0> EvalDataset : <ppdet.data.source.voc.VOCDataSet object at 0x7fe50378d450> TestDataset : <ppdet.data.source.dataset.ImageFolder object at 0x7fe50378d6d0> use_gpu : True log_iter : 20 save_dir : output snapshot_epoch : 5 architecture : YOLOv3 pretrain_weights : https://paddledet.bj.bcebos.com/models/pedestrian_yolov3_darknet.pdparams norm_type : sync_bn use_ema : True ema_decay : 0.9998 epoch : 1500 worker_num : 16 weights : output/ppyolo_mbv3_large_qat/best_model filename : ppyolo_tiny_650e_voc Traceback (most recent call last): File "train.py", line 80, in main() File "train.py", line 70, in main cfg = build_slim_model(cfg, FLAGS.slim_config) File "/root/train/venv/lib/python3.7/site-packages/paddledet-2.3.0-py3.7.egg/ppdet/slim/init.py", line 61, in build_slim_model model = create(cfg.architecture) File "/root/train/venv/lib/python3.7/site-packages/paddledet-2.3.0-py3.7.egg/ppdet/core/workspace.py", line 238, in create cls_kwargs.update(cls.from_config(config, kwargs)) File "/root/train/venv/lib/python3.7/site-packages/paddledet-2.3.0-py3.7.egg/ppdet/modeling/architectures/yolo.py", line 62, in from_config backbone = create(cfg['backbone']) File "/root/train/venv/lib/python3.7/site-packages/paddledet-2.3.0-py3.7.egg/ppdet/core/workspace.py", line 275, in create return cls(cls_kwargs) File "/root/train/venv/lib/python3.7/site-packages/paddledet-2.3.0-py3.7.egg/ppdet/modeling/backbones/mobilenet_v3.py", line 371, in init name="conv1") File "/root/train/venv/lib/python3.7/site-packages/paddledet-2.3.0-py3.7.egg/ppdet/modeling/backbones/mobilenet_v3.py", line 66, in init bias_attr=False) File "/root/train/venv/lib/python3.7/site-packages/paddle/nn/layer/conv.py", line 646, in init data_format=data_format) File "/root/train/venv/lib/python3.7/site-packages/paddle/nn/layer/conv.py", line 135, in init default_initializer=_get_default_param_initializer()) File "/root/train/venv/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 412, in create_parameter default_initializer) File "/root/train/venv/lib/python3.7/site-packages/paddle/fluid/layer_helper_base.py", line 374, in create_parameter attr._to_kwargs(with_initializer=True)) File "/root/train/venv/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2920, in create_parameter initializer(param, self) File "/root/train/venv/lib/python3.7/site-packages/paddle/fluid/initializer.py", line 366, in call stop_gradient=True) File "/root/train/venv/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2950, in append_op kwargs.get("stop_gradient", False)) File "/root/train/venv/lib/python3.7/site-packages/paddle/fluid/dygraph/tracer.py", line 45, in trace_op not stop_gradient) NotImplementedError: (Unimplemented) Place CUDAPlace(0) is not supported. Please check that your paddle compiles with WITH_GPU, WITH_XPU or WITH_ASCEND_CL option or check that your train process set the correct device id if you use Executor. (at /paddle/paddle/fluid/platform/device_context.cc:88) [operator < gaussian_random > error] /root/train/venv/lib/python3.7/site-packages/paddle/tensor/creation.py:125: DeprecationWarning: np.object is a deprecated alias for the builtin object. To silence this warning, use object by itself. Doing this will not modify any behavior and is safe. Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations if data.dtype == np.object: MobileNetV3 : {'model_name': 'large', 'scale': 0.5, 'with_extra_blocks': False, 'extra_block_filters': [], 'feature_maps': [7, 13, 16]} PPYOLOTinyFPN : {'detection_block_channels': [160, 128, 96], 'spp': True, 'drop_block': True} YOLOv3Head : {'anchors': [[10, 15], [24, 36], [72, 42], [35, 87], [102, 96], [60, 170], [220, 125], [128, 222], [264, 266]], 'anchor_masks': [[6, 7, 8], [3, 4, 5], [0, 1, 2]], 'loss': 'YOLOv3Loss'} YOLOv3Loss : {'ignore_thresh': 0.5, 'downsample': [32, 16, 8], 'label_smooth': False, 'scale_x_y': 1.05, 'iou_loss': 'IouLoss'} IouLoss : {'loss_weight': 5.0, 'loss_square': True} BBoxPostProcess : {'decode': {'name': 'YOLOBox', 'conf_thresh': 0.01, 'downsample_ratio': 32, 'clip_bbox': True, 'scale_x_y': 1.05}, 'nms': {'name': 'MatrixNMS', 'keep_top_k': 100, 'score_threshold': 0.01, 'post_threshold': 0.01, 'nms_top_k': -1, 'background_label': -1}} YOLOv3 : {'backbone': 'MobileNetV3', 'neck': 'PPYOLOTinyFPN', 'yolo_head': 'YOLOv3Head', 'post_process': 'BBoxPostProcess'} TrainReader : {'inputs_def': {'num_max_boxes': 100}, 'sample_transforms': [{'Decode': {}}, {'Mixup': {'alpha': 1.5, 'beta': 1.5}}, {'RandomDistort': {}}, {'RandomExpand': {'fill_value': [123.675, 116.28, 103.53]}}, {'RandomCrop': {}}, {'RandomFlip': {'prob': 0.5}}], 'batch_transforms': [{'BatchRandomResize': {'target_size': [192, 224, 256, 288, 320, 352, 384, 416, 448, 480, 512], 'random_size': True, 'random_interp': True, 'keep_ratio': False}}, {'NormalizeBox': {}}, {'PadBox': {'num_max_boxes': 100}}, {'BboxXYXY2XYWH': {}}, {'NormalizeImage': {'mean': [0.485, 0.456, 0.406], 'std': [0.229, 0.224, 0.225], 'is_scale': True}}, {'Permute': {}}, {'Gt2YoloTarget': {'anchor_masks': [[6, 7, 8], [3, 4, 5], [0, 1, 2]], 'anchors': [[10, 15], [24, 36], [72, 42], [35, 87], [102, 96], [60, 170], [220, 125], [128, 222], [264, 266]], 'downsample_ratios': [32, 16, 8]}}], 'batch_size': 16, 'shuffle': True, 'drop_last': True, 'mixup_epoch': 500, 'use_shared_memory': True} EvalReader : {'collate_batch': False, 'sample_transforms': [{'Decode': {}}, {'Resize': {'target_size': [320, 320], 'keep_ratio': False, 'interp': 2}}, {'NormalizeImage': {'mean': [0.485, 0.456, 0.406], 'std': [0.229, 0.224, 0.225], 'is_scale': True}}, {'Permute': {}}], 'batch_size': 8} TestReader : {'inputs_def': {'image_shape': [3, 320, 320]}, 'sample_transforms': [{'Decode': {}}, {'Resize': {'target_size': [320, 320], 'keep_ratio': False, 'interp': 2}}, {'NormalizeImage': {'mean': [0.485, 0.456, 0.406], 'std': [0.229, 0.224, 0.225], 'is_scale': True}}, {'Permute': {}}], 'batch_size': 1} LearningRate : {'base_lr': 0.005, 'schedulers': [<ppdet.optimizer.PiecewiseDecay object at 0x7f1c2fb54c50>, <ppdet.optimizer.LinearWarmup object at 0x7f1c2fb54650>]} OptimizerBuilder : {'optimizer': {'momentum': 0.9, 'type': 'Momentum'}, 'regularizer': {'factor': 0.0005, 'type': 'L2'}} metric : VOC map_type : 11point num_classes : 1 TrainDataset : <ppdet.data.source.voc.VOCDataSet object at 0x7f1c2fb4e490> EvalDataset : <ppdet.data.source.voc.VOCDataSet object at 0x7f1c63d8b310> TestDataset : <ppdet.data.source.dataset.ImageFolder object at 0x7f1c2fb4e510> use_gpu : True log_iter : 20 save_dir : output snapshot_epoch : 5 architecture : YOLOv3 pretrain_weights : https://paddledet.bj.bcebos.com/models/pedestrian_yolov3_darknet.pdparams norm_type : sync_bn use_ema : True ema_decay : 0.9998 epoch : 1500 worker_num : 4 weights : output/ppyolo_mbv3_large_qat/best_model filename : ppyolo_tiny_650e_voc Traceback (most recent call last): File "train.py", line 80, in main() File "train.py", line 70, in main cfg = build_slim_model(cfg, FLAGS.slim_config) File "/root/train/venv/lib/python3.7/site-packages/paddledet-2.3.0-py3.7.egg/ppdet/slim/init.py", line 61, in build_slim_model model = create(cfg.architecture) File "/root/train/venv/lib/python3.7/site-packages/paddledet-2.3.0-py3.7.egg/ppdet/core/workspace.py", line 238, in create cls_kwargs.update(cls.from_config(config, kwargs)) File "/root/train/venv/lib/python3.7/site-packages/paddledet-2.3.0-py3.7.egg/ppdet/modeling/architectures/yolo.py", line 62, in from_config backbone = create(cfg['backbone']) File "/root/train/venv/lib/python3.7/site-packages/paddledet-2.3.0-py3.7.egg/ppdet/core/workspace.py", line 275, in create return cls(cls_kwargs) File "/root/train/venv/lib/python3.7/site-packages/paddledet-2.3.0-py3.7.egg/ppdet/modeling/backbones/mobilenet_v3.py", line 371, in init name="conv1") File "/root/train/venv/lib/python3.7/site-packages/paddledet-2.3.0-py3.7.egg/ppdet/modeling/backbones/mobilenet_v3.py", line 66, in init bias_attr=False) File "/root/train/venv/lib/python3.7/site-packages/paddle/nn/layer/conv.py", line 646, in init data_format=data_format) File "/root/train/venv/lib/python3.7/site-packages/paddle/nn/layer/conv.py", line 135, in init default_initializer=_get_default_param_initializer()) File "/root/train/venv/lib/python3.7/site-packages/paddle/fluid/dygraph/layers.py", line 412, in create_parameter default_initializer) File "/root/train/venv/lib/python3.7/site-packages/paddle/fluid/layer_helper_base.py", line 374, in create_parameter attr._to_kwargs(with_initializer=True)) File "/root/train/venv/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2920, in create_parameter initializer(param, self) File "/root/train/venv/lib/python3.7/site-packages/paddle/fluid/initializer.py", line 366, in call stop_gradient=True) File "/root/train/venv/lib/python3.7/site-packages/paddle/fluid/framework.py", line 2950, in append_op kwargs.get("stop_gradient", False)) File "/root/train/venv/lib/python3.7/site-packages/paddle/fluid/dygraph/tracer.py", line 45, in trace_op not stop_gradient) NotImplementedError: (Unimplemented) Place CUDAPlace(0) is not supported. Please check that your paddle compiles with WITH_GPU, WITH_XPU or WITH_ASCEND_CL option or check that your train process set the correct device id if you use Executor. (at /paddle/paddle/fluid/platform/device_context.cc:88)

这是log.workerlog.1的日志，应该是配置文件出错导致,具体是什么原因我也还在排查

PaddlePaddle / PaddleDetection

多卡pp-yolo训练出问题 #4465