Backward error when training on nuScenes

klightz commented 2 years ago

Thanks for contributing this wonderful work.

Previously when I run Focal Conv on Kitti, every thing is OK. However, when I try to train on nuScene using nusc_centerpoint_voxelnet_0075voxel_fix_bn_z_focal. I got an Error :

File "det3d/torchie/apis/train.py", line 337, in train_detector
    trainer.run(data_loaders, cfg.workflow, cfg.total_epochs, local_rank=cfg.local_rank)
  File "det3d/torchie/trainer/trainer.py", line 553, in run
    epoch_runner(data_loaders[i], self.epoch, **kwargs)
  File "det3d/torchie/trainer/trainer.py", line 428, in train
    self.call_hook("after_train_iter")
  File "det3d/torchie/trainer/trainer.py", line 335, in call_hook
    getattr(hook, fn_name)(self)
  File "det3d/core/utils/dist_utils.py", line 54, in after_train_iter
    runner.outputs["loss"].backward()
  File "torch/_tensor.py", line 484, in backward
    torch.autograd.backward(
  File "torch/autograd/__init__.py", line 191, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [148285, 16]], which is output 0 of ReluBackward0, is at version 12; expected version 0 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

I also try to run the config of the normal CenterPoint with voxel net nusc_centerpoint_voxelnet_0075voxel_fix_bn_z.py in this repo and it trains smoothly. So I guess it is some problems occurs in the Focal Conv layer. Any Idea about this problem? Any hint or suggestion about the possible error place to look into also helps. A lot of Thanks.

yukang2017 commented 2 years ago

Thanks for your interests in our works.

This is a bug from the PyTorch later than 1.7.0. I just updated code and fixed it. I test the codes on PyTorch 1.11.0. It is runable. Please use the updated code of this file. https://github.com/dvlab-research/FocalsConv/blob/master/CenterPoint/det3d/models/fusion/voxel_with_point_projection.py

klightz commented 2 years ago

Thanks for your interests in our works.

This is a bug from the PyTorch later than 1.7.0. I just updated code and fixed it. I test the codes on PyTorch 1.11.0. It is runable. Please use the updated code of this file. https://github.com/dvlab-research/FocalsConv/blob/master/CenterPoint/det3d/models/fusion/voxel_with_point_projection.py

Really thanks a lot, it works for me.

klightz commented 2 years ago

For a quick addition question, I find training on nuScene 1/4 using Focal-multimodal comes out much slower than Centerpoint ( 4xV100, 40cpus with 256GB RAM). Especially it will becomes extreme slow in some batches forward and backward (it seems not not data loading issue) and normal in other batches.

Any idea about where can potentially cause this issue? I may need a faster training speed rather than the best performance so I may need slightly modify it if possible. Really thanks!

yukang2017 commented 2 years ago

Would you please provide some logs or hints on this problem?

klightz commented 2 years ago

Nvm, it should be some data loading problem, I am not sure why the time is count to the forward time. I will have a detailed check and temporary close this issue. If i can not solve it anyway, I may paste some log information here. A lot of thanks for the quick reply!

yukang2017 commented 2 years ago

Thanks for your information. Please feel free to reopen it.

klightz commented 2 years ago

Great thanks, maybe one more question to ask regarding this issue, what is the RAM and how many CPU are you using with the 4GPU nuScene training? It would be helpful for my time bottleneck analysis.

yukang2017 commented 2 years ago

We use 4 NVIDIA V100 GPUs and 32 CPU cores for training it.

dvlab-research / FocalsConv

Backward error when training on nuScenes #13