多GPU训练 - Githubissues

chenxyyy commented 3 years ago

你好, 我在使用多GPU训练的时候, 每次都会遇到这个问题

Namespace(BN_Fold=False, FPGA=False, KDstr=-1, a_bit=8, adam=False, batch_size=16, bucket='', cache_images=False, cfg='./cfg/yolov4/yolov4.cfg', data='data/coco2017.data', device='0,1,2,4', ema=False, epochs=300, evolve=False, img_size=[320, 640], multi_scale=False, name='', nosave=False, notest=False, prune=0, pt=False, quantized=0, rect=False, resume=False, s=0.0001, single_cls=False, sr=True, t_cfg='', t_weights='', w_bit=8, weights='weights/yolo4_coco/qianyi_weight/best.pt')
Using CUDA Apex device0 _CudaDeviceProperties(name='Tesla T4', total_memory=15109MB)
                device1 _CudaDeviceProperties(name='Tesla T4', total_memory=15109MB)
                device2 _CudaDeviceProperties(name='Tesla T4', total_memory=15109MB)
                device3 _CudaDeviceProperties(name='Tesla T4', total_memory=15109MB)

Start Tensorboard with "tensorboard --logdir=runs", view at http://localhost:6006/
Model Summary: 327 layers, 6.43631e+07 parameters, 6.43631e+07 gradients
Optimizer groups: 110 .bias, 110 Conv2d.weight, 107 other
muti-gpus sparse
normal sparse training 
Image sizes 320 - 640 train, 640 test
Using 8 dataloader workers
Starting training for 300 epochs...

     Epoch   gpu_mem      GIoU       obj       cls     total   targets  img_size

  0%|          | 0/4381 [00:00<?, ?it/s]
  0%|          | 0/4381 [00:01<?, ?it/s]
Traceback (most recent call last):
  File "train.py", line 987, in <module>
    train(hyp)  # train normally
  File "train.py", line 330, in train
    pred, feature_s = model(imgs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 580, in forward
    output = self.gather(outputs, self.output_device)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/distributed.py", line 607, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/scatter_gather.py", line 68, in gather
    res = gather_map(outputs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/scatter_gather.py", line 63, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/scatter_gather.py", line 55, in gather_map
    return Gather.apply(target_device, dim, *outputs)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/_functions.py", line 71, in forward
    return comm.gather(inputs, ctx.dim, ctx.target_device)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/parallel/comm.py", line 230, in gather
    return torch._C._gather(tensors, dim, destination)
RuntimeError: CUDA out of memory. Tried to allocate 400.00 MiB (GPU 0; 14.76 GiB total capacity; 13.07 GiB already allocated; 5.75 MiB free; 13.43 GiB reserved in total by PyTorch)

我使用的训练命令是

python train.py --data data/coco2017.data --batch-size 16 --cfg cfg/yolov4/yolov4.cfg --weights weights/yolo4_coco/qianyi_weight/best.pt --cfg cfg/yolov4/yolov4.cfg --device 0,1,2,4 -sr --s 0.0001 --prune 0

我用了4个Tesla T4 GPU, 而且4张卡都是空闲状态,为什么会出现显存不足的现象呢?

SpursLipu commented 3 years ago

你可以试试减小batchsize 或者减小imgsize，yolov4就是比较吃显存

chenxyyy commented 3 years ago

单张T4显卡训练时候，batchsize 的大小设置为10是没问题的。 4张显卡设置成16就不行了，感觉不是batchsize 的问题啊

sharoseali commented 3 years ago

@chenxyyy did you solve that issue. I am pruning a model, I tried different pruning thresholds and even reduce my batch size to 1 but the same error is comming. RuntimeError: CUDA out of memory. Tried to allocate 170.00 MiB (GPU 0; 7.79 GiB total capacity; 5.54 GiB already allocated; 43.25 MiB free; 6.11 GiB reserved in total by PyTorch) @SpursLipu can you suggest something. Also, I tried with pruning threshold between 0.5 to 0.01 but every time it's showing after pruning model mAP is 0.0, which could be the possible problem ??

SpursLipu / YOLOv3v4-ModelCompression-MultidatasetTraining-Multibackbone

多GPU训练 #102