david8862 / keras-YOLOv3-model-set

end-to-end YOLOv4/v3/v2 object detection pipeline, implemented on tf.keras with different technologies
MIT License
640 stars 222 forks source link

gpu_num parameter during training throws error when set to more than 1 #68

Open mon28 opened 4 years ago

mon28 commented 4 years ago

Train Command:

python train.py --model_type=yolo4_darknet --annotation_file=data/trainval.txt --classes_path=data/obj.names --model_image_size=416x416 --multiscale --rescale_interval=50 --learning_rate=0.001 --transfer_epoch=0 --total_epoch=500 --eval_online --eval_epoch_interval=20 --save_eval_checkpoint --gpu_num=3

Error:

    main(args)
  File "train.py", line 146, in main
    model = get_train_model(args.model_type, anchors, num_classes, weights_path=args.weights_path, freeze_level=freeze_level, optimizer=optimizer, label_smoothing=args.label_smoothing, model_pruning=args.model_pruning, pruning_end_step=pruning_end_step)
  File "/data/mtare/keras-yolov4/yolo3/model.py", line 206, in get_yolo3_train_model
    model_body, backbone_len = get_yolo3_model(model_type, num_feature_layers, num_anchors, num_classes, model_pruning=model_pruning, pruning_end_step=pruning_end_step)
  File "/data/mtare/keras-yolov4/yolo3/model.py", line 174, in get_yolo3_model
    model_body = model_function(input_tensor, num_anchors//3, num_classes, weights_path=weights_path)
  File "/data/mtare/keras-yolov4/yolo4/models/yolo4_darknet.py", line 51, in yolo4_body
    darknet = Model(inputs, csp_darknet53_body(inputs))
  File "/data/mtare/keras-yolov4/yolo4/models/yolo4_darknet.py", line 40, in csp_darknet53_body
    x = DarknetConv2D_BN_Mish(32, (3,3))(x)
  File "/data/mtare/keras-yolov4/yolo4/models/layers.py", line 21, in <lambda>
    return reduce(lambda f, g: lambda *a, **kw: g(f(*a, **kw)), funcs)
  File "/data/mtare/keras-yolov4/yolo4/models/layers.py", line 21, in <lambda>
    return reduce(lambda f, g: lambda *a, **kw: g(f(*a, **kw)), funcs)
  File "/data/mtare/keras-yolov4/venv/lib/python3.6/site-packages/tensorflow/python/keras/engine/base_layer.py", line 922, in __call__
    outputs = call_fn(cast_inputs, *args, **kwargs)
  File "/data/mtare/keras-yolov4/venv/lib/python3.6/site-packages/tensorflow/python/keras/layers/normalization.py", line 741, in call
    outputs = self._fused_batch_norm(inputs, training=training)
  File "/data/mtare/keras-yolov4/venv/lib/python3.6/site-packages/tensorflow/python/keras/layers/normalization.py", line 604, in _fused_batch_norm
    _fused_batch_norm_inference)
  File "/data/mtare/keras-yolov4/venv/lib/python3.6/site-packages/tensorflow/python/keras/utils/tf_utils.py", line 65, in smart_cond
    pred, true_fn=true_fn, false_fn=false_fn, name=name)
  File "/data/mtare/keras-yolov4/venv/lib/python3.6/site-packages/tensorflow/python/framework/smart_cond.py", line 59, in smart_cond
    name=name)
  File "/data/mtare/keras-yolov4/venv/lib/python3.6/site-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/data/mtare/keras-yolov4/venv/lib/python3.6/site-packages/tensorflow/python/ops/control_flow_ops.py", line 1177, in cond
    return cond_v2.cond_v2(pred, true_fn, false_fn, name)
  File "/data/mtare/keras-yolov4/venv/lib/python3.6/site-packages/tensorflow/python/ops/cond_v2.py", line 84, in cond_v2
    op_return_value=pred)
  File "/data/mtare/keras-yolov4/venv/lib/python3.6/site-packages/tensorflow/python/framework/func_graph.py", line 981, in func_graph_from_py_func
    func_outputs = python_func(*func_args, **func_kwargs)
  File "/data/mtare/keras-yolov4/venv/lib/python3.6/site-packages/tensorflow/python/keras/layers/normalization.py", line 579, in _fused_batch_norm_training
    exponential_avg_factor=exponential_avg_factor)
  File "/data/mtare/keras-yolov4/venv/lib/python3.6/site-packages/tensorflow/python/ops/nn_impl.py", line 1544, in fused_batch_norm
    name=name)
  File "/data/mtare/keras-yolov4/venv/lib/python3.6/site-packages/tensorflow/python/ops/gen_nn_ops.py", line 4279, in fused_batch_norm_v3
    name=name)
  File "/data/mtare/keras-yolov4/venv/lib/python3.6/site-packages/tensorflow/python/framework/op_def_library.py", line 488, in _apply_op_helper
    (input_name, err))
ValueError: Tried to convert 'mean' to a tensor and failed. Error: Device assignment required for nccl collective ops
david8862 commented 4 years ago

The multi-gpu training support now is experimental and not verified yet. You can refer #56 for details.