Multi-GPU training is giving error

deepak242424 commented 4 years ago

Hi,

I am trying to do the single view training on multi-GPU but getting the error:

File "/1scratch/deepack/geomni/15-xnocs/xnocs/noxray/nxm/../../models/SegNet.py", line 97, in init_vgg16_params
units = [conv_block.conv1.cbr_unit, conv_block.conv2.cbr_unit]
AttributeError: 'DataParallel' object has no attribute 'conv1'

I am running the following command from xnocs/noxray/nxm directory: python nxm.py --mode train --input-dir shapenetplain_v1 --output-dir ../output --expt-name XNOCS_SV --category cars --arch SegNetSkip --seed 0 --gpu {1,2,3}

After searching above error on google, one solution was to change conv_block to conv_block.module in SegNet.py file. After doing that I am getting the following error:

  File "/home/deepack/.conda/envs/wave/lib/python3.6/site-packages/tk3dv/ptTools/ptNets.py", line 194, in fit
    Output = self.forward(DataTD)
  File "/1scratch/deepack/geomni/15-xnocs/xnocs/noxray/nxm/../../models/SegNet.py", line 57, in forward
    down1, indices_1, unpool_shape1, FM1 = self.down1(inputs)
  File "/home/deepack/.conda/envs/wave/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/deepack/.conda/envs/wave/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 153, in forward
    return self.gather(outputs, self.output_device)
  File "/home/deepack/.conda/envs/wave/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 165, in gather
    return gather(outputs, output_device, dim=self.dim)
  File "/home/deepack/.conda/envs/wave/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 67, in gather
    return gather_map(outputs)
  File "/home/deepack/.conda/envs/wave/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/deepack/.conda/envs/wave/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
  File "/home/deepack/.conda/envs/wave/lib/python3.6/site-packages/torch/nn/parallel/scatter_gather.py", line 62, in gather_map
    return type(out)(map(gather_map, zip(*outputs)))
TypeError: zip argument #1 must support iteration
[ WARN ]: Exception detected. *NOT* saving checkpoint. zip argument #1 must support iteration

Can you please let me know if I need to do anything else for multi GPU training. I am using Pytorch version 1.1.0.

drsrinathsridhar commented 4 years ago

I haven't extensively tested the multi-GPU version since the speedup gain was marginal. I would recommend sticking with the single-GPU version, but I will try to fix this bug at some point.

deepak242424 commented 4 years ago

I haven't extensively tested the multi-GPU version since the speedup gain was marginal. I would recommend sticking with the single-GPU version, but I will try to fix this bug at some point.

Thanks @drsrinathsridhar for your quick reply. And one last thing, can you please tell how much time it took to train your models? It took me around more than 3 days to train multiview model with 5 views on a single GPU till 100 epochs on Shapenetplain_v1 dataset. Do you think it is reasonable?

drsrinathsridhar commented 4 years ago

What GPU are you using? On an Nvidia V100, I think it's about 2 days for ShapeNetCOCO. @davrempe could give you an exact number.

deepak242424 commented 4 years ago

What GPU are you using? On an Nvidia V100, I think it's about 2 days for ShapeNetCOCO. @davrempe could give you an exact number.

I am using NVIDIA 1080Ti. And you ran for 100 epochs with batch_size=1 (as mentioned in the paper), right?

davrempe commented 4 years ago

Yes, we ran 100 epochs with batch size of 1 on a single V100. For 5 views, the most time-consuming category is chairs which took about 2 days for ShapeNetCOCO. Cars and planes took a little over 1 day each. Considering the differences between the 1080Ti and V100, 3 days does not seem unreasonable.

akaganeite commented 2 years ago

Hi, when i tried to do the multi view training on multi-GPU. I got the same error AttributeError: 'DataParallel' object has no attribute 'conv1' But when i tried single-GPU training, the CUDA ran out of memory. I have 3 TESLA-M40 GPUs and each has 24G memory. Is there a possible solution? Thanks.

drsrinathsridhar commented 2 years ago

Please make sure to use small batch sizes (Flag: --batch-size 1). You will run out of memory if you use larger batch sizes. 24GB should be more than enough for batch size 1.

drsrinathsridhar / xnocs

Multi-GPU training is giving error #3