V2AI / Det3D

World's first general purpose 3D object detection codebse.
https://arxiv.org/abs/1908.09492
Apache License 2.0
1.49k stars 298 forks source link

multi-GPU training error #36

Closed muzi2045 closed 4 years ago

muzi2045 commented 4 years ago

trying train CBGS in 8 GPU(2080ti), using the newest repo code, use follow code to start.

python3 -m torch.distributed.launch --nproc_per_node=8 ./tools/train.py examples/cbgs/configs/nusc_all_vfev3_spmiddleresnetfhd_rpn2_mghead_syncbn.py --work_dir=/home/ubuntu/Documents/Det3D/trained_model

the error looks like happend in syncBN:

    return SyncBatchnormFunction.apply(input, z, self.weight, self.bias, self.running_mean, self.running_var, self.eps, self.training or not self.track_running_stats, exponential_average_factor, self.process_group, self.channel_last, self.fuse_relu)
  File "/home/ubuntu/.local/lib/python3.6/site-packages/apex/parallel/optimized_sync_batchnorm_kernel.py", line 26, in forward
    mean, var_biased = syncbn.welford_mean_var(input)
RuntimeError: Dimension out of range (expected to be in range of [-2, 1], but got 2) (maybe_wrap_dim at /pytorch/c10/core/WrapDimMinimal.h:20)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x33 (0x7f92265a1813 in /home/ubuntu/.local/lib/python3.6/site-packages/torch/lib/libc10.so)

by the way, server environment are using pytorch 1.3.1 + CUDA 10.1 + python3.6 @poodarchu

poodarchu commented 4 years ago

check your data. Dimension out of range (expected to be in range of [-2, 1], but got 2)

muzi2045 commented 4 years ago

it locate at backones(SpMiddleResNetModule which contain BN layer) occur this error, it mean I should check the forward pipeline in this part?

        # input: # [41, 1600, 1408]
        sparse_shape = np.array(input_shape[::-1]) + [1, 0, 0]
        print("sparse shape:", sparse_shape)

        coors = coors.int()
        ret = spconv.SparseConvTensor(voxel_features, coors, sparse_shape, batch_size)
        ret = self.middle_conv(ret)
        ret = ret.dense()

        N, C, D, H, W = ret.shape
        ret = ret.view(N, C * D, H, W)

        return ret
poodarchu commented 4 years ago

yes, 1d conv should be used here (sync bn layer)

muzi2045 commented 4 years ago

the syncbn input tensor shape:

syncbn input shape: torch.Size([75100, 16])
syncbn input shape: torch.Size([76985, 16])
syncbn input shape: torch.Size([79617, 16])
syncbn input shape: torch.Size([81551, 16])
syncbn input shape: torch.Size([79113, 16])
syncbn input shape: torch.Size([68469, 16])
syncbn input shape: torch.Size([61096, 16])
syncbn input shape: torch.Size([77184, 16])

try to print spconv.SparseConvTensor shape info, but it can't, just use this:

print(" ret shape:", ret.dense().shape)

here is the output:

ret shape: torch.Size([4, 5, 41, 1008, 1008])
 ret shape: torch.Size([4, 5, 41, 1008, 1008])
 ret shape: torch.Size([4, 5, 41, 1008, 1008])
 ret shape: torch.Size([4, 5, 41, 1008, 1008])
 ret shape: torch.Size([4, 5, 41, 1008, 1008])
 ret shape: torch.Size([4, 5, 41, 1008, 1008])
 ret shape: torch.Size([4, 5, 41, 1008, 1008])

is there any way to debug the shape info in spconv.SparseTensor? @poodarchu

poodarchu commented 4 years ago

I guess you could run it using single gpu; so the problem is relative to BatchNorm Layer; you should use batchnrom 1d .

muzi2045 commented 4 years ago

it's strange, in the code, the BN layer are set to batchnorm1d

## norm_cfg == None
if norm_cfg is None:
            norm_cfg = dict(type="BN1d", eps=1e-3, momentum=0.01)

but the whole network info output is:

(backbone): SpMiddleResNetFHD(
      (middle_conv): SparseSequential(
        (0): SubMConv3d()
        (1): SyncBatchNorm(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
        (2): ReLU()
        (3): SparseBasicBlock(
          (conv1): SubMConv3d()
          (bn1): SyncBatchNorm(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)
          (relu): ReLU()
          (conv2): SubMConv3d()
          (bn2): SyncBatchNorm(16, eps=0.001, momentum=0.01, affine=True, track_running_stats=True)

refer this issue in apex https://github.com/NVIDIA/apex/issues/194 @poodarchu

muzi2045 commented 4 years ago

reinstall apex fix it