ABCNet training ERROR with custom datasets

Eurus-Holmes commented 4 years ago

Hi, I am training with custom datasets, following this issue.

@shuangyichen @Yuliang-Liu Running train_net.py use command "OMP_NUM_THREADS=1 python tools/train_net.py --config-file configs/BAText/TotalText/attn_R_50.yaml --num-gpus 1"

dataset arch: datasets

mydataset
- annotations
  - train.json
- train_img
  - img_1.jpg
  - img_2.jpg

specify train img and annotations in "builtin.py": "mydataset_train":("mydataset/train_img","mydataset/annotations/train.json")

specify train config in "configs/BAText/TotalText/Base-TotalText.yaml" DATASETS: TRAIN: ("mydataset_train",) TEST: ("mydataset_train",)

Originally posted by @chenyangMl in https://github.com/aim-uofa/AdelaiDet/issues/100#issuecomment-644056170

But it occurred an error:

Traceback (most recent call last):
  File "tools/train_net.py", line 243, in <module>
    args=(args,),
  File "./AdelaiDet/env/lib/python3.7/site-packages/detectron2/engine/launch.py", line 57, in launch
    main_func(*args)
  File "tools/train_net.py", line 231, in main
    return trainer.train()
  File "tools/train_net.py", line 113, in train
    self.train_loop(self.start_iter, self.max_iter)
  File "tools/train_net.py", line 102, in train_loop
    self.run_step()
  File "./AdelaiDet/env/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 228, in run_step
    losses.backward()
  File "./AdelaiDet/env/lib/python3.7/site-packages/torch/tensor.py", line 198, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "./AdelaiDet/env/lib/python3.7/site-packages/torch/autograd/__init__.py", line 100, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: cuDNN error: CUDNN_STATUS_EXECUTION_FAILED (_cudnn_rnn_backward_input at /pytorch/aten/src/ATen/native/cudnn/RNN.cpp:931)
frame #0: c10::Error::Error(c10::SourceLocation, std::string const&) + 0x46 (0x7f0388a65536 in ./AdelaiDet/env/lib/python3.7/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xf55aa7 (0x7f0389e16aa7 in ./AdelaiDet/env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #2: at::native::_cudnn_rnn_backward(at::Tensor const&, c10::ArrayRef<at::Tensor>, long, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, at::Tensor const&, long, long, long, bool, double, bool, bool, c10::ArrayRef<long>, at::Tensor const&, at::Tensor const&, std::array<bool, 4ul>) + 0x1a9 (0x7f0389e18db9 in ./AdelaiDet/env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #3: <unknown function> + 0xfdab4d (0x7f0389e9bb4d in ./AdelaiDet/env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xfdc2e3 (0x7f0389e9d2e3 in ./AdelaiDet/env/lib/python3.7/site-packages/torch/lib/libtorch_cuda.so)
frame #5: <unknown function> + 0x2b08450 (0x7f03c327b450 in ./AdelaiDet/env/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #6: <unknown function> + 0x2b7b8a3 (0x7f03c32ee8a3 in ./AdelaiDet/env/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #7: torch::autograd::generated::CudnnRnnBackward::apply(std::vector<at::Tensor, std::allocator<at::Tensor> >&&) + 0x708 (0x7f03c302fd28 in ./AdelaiDet/env/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x2d89c05 (0x7f03c34fcc05 in ./AdelaiDet/env/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #9: torch::autograd::Engine::evaluate_function(std::shared_ptr<torch::autograd::GraphTask>&, torch::autograd::Node*, torch::autograd::InputBuffer&) + 0x16f3 (0x7f03c34f9f03 in ./AdelaiDet/env/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::autograd::Engine::thread_main(std::shared_ptr<torch::autograd::GraphTask> const&, bool) + 0x3d2 (0x7f03c34face2 in ./AdelaiDet/env/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #11: torch::autograd::Engine::thread_init(int) + 0x39 (0x7f03c34f3359 in ./AdelaiDet/env/lib/python3.7/site-packages/torch/lib/libtorch_cpu.so)
frame #12: torch::autograd::python::PythonEngine::thread_init(int) + 0x38 (0x7f03cfc32828 in ./AdelaiDet/env/lib/python3.7/site-packages/torch/lib/libtorch_python.so)
frame #13: <unknown function> + 0xee0f (0x7f03d081ee0f in ./AdelaiDet/env/lib/python3.7/site-packages/torch/_C.cpython-37m-x86_64-linux-gnu.so)
frame #14: <unknown function> + 0x76ba (0x7f03d2ec46ba in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #15: clone + 0x6d (0x7f03d2bfa41d in /lib/x86_64-linux-gnu/libc.so.6)

However, I could run the ABCNet demo successfully (without changing anything). So, what is happening to it?

Eurus-Holmes commented 4 years ago

BTW, my environment:

torch.cuda.is_available() = True
torch.backends.cudnn.is_available() = True
torch.backends.cudnn.version() = 7603
CUDA Version: 10.1
torch                  1.5.0+cu101
torchvision            0.6.0+cu101

I don't think it is because of the environmental problem.

Yuliang-Liu commented 4 years ago

@Eurus-Holmes This doesn't seem like an environmental problem. Assuming you have passed the provided demo.

Can you make sure the json file is correct (same structures as we provided) and also try to include more training samples? Can you also tried using a pretrained model to finetune on your custom dataset?

Eurus-Holmes commented 4 years ago

@Yuliang-Liu Yes, I can make sure my JSON file is correct. What I am doing is using a pretrained model (weights/batext/pretrain_attn_R_50.pth) to finetune on my custom dataset.

After updating the AdelaiDet and detectron2, this weird cuDNN error is gone, I occurred the same problem as this issue.

Eurus-Holmes commented 4 years ago

@stan-haochen According to this pull request, I have changed adet/data/augmentation.py, adet/data/dataset_mapper.py, adet/data/detection_utils.py files, the problem of this issue is solved. However, the origin cuDNN error occurred again...

stan-haochen commented 4 years ago

@stan-haochen According to this pull request, I have changed adet/data/augmentation.py, adet/data/dataset_mapper.py, adet/data/detection_utils.py files, the problem of this issue is solved. However, the origin cuDNN error occurred again...

Could uou check if training with official data works?

Eurus-Holmes commented 4 years ago

@stan-haochen I have reinstalled latest AdelaiDet and detectron2, now I can train on the official Totaltext dataset with a pretrained model (weights/batext/pretrain_attn_R_50.pth). However, when I train on the custom dataset, it occurred a new error:

Traceback (most recent call last):
  File "tools/train_net2.py", line 244, in <module>
    args=(args,),
  File "./ABCNet/AdelaiDet/env/lib/python3.7/site-packages/detectron2/engine/launch.py", line 62, in launch
    main_func(*args)
  File "tools/train_net2.py", line 232, in main
    return trainer.train()
  File "tools/train_net2.py", line 114, in train
    self.train_loop(self.start_iter, self.max_iter)
  File "tools/train_net2.py", line 103, in train_loop
    self.run_step()
  File "./ABCNet/AdelaiDet/env/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 218, in run_step
    self._detect_anomaly(losses, loss_dict)
  File "./ABCNet/AdelaiDet/env/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 241, in _detect_anomaly
    self.iter, loss_dict
FloatingPointError: Loss became infinite or NaN at iteration=2!
loss_dict = {'rec_loss': tensor(nan, device='cuda:0', grad_fn=<MulBackward0>), 'loss_fcos_cls': tensor(0.2020, device='cuda:0', grad_fn=<DivBackward0>), 'loss_fcos_loc': tensor(0.1354, device='cuda:0', grad_fn=<DivBackward0>), 'loss_fcos_ctr': tensor(0.6088, device='cuda:0', grad_fn=<DivBackward0>), 'loss_fcos_bezier': tensor(2.2358, device='cuda:0', grad_fn=<DivBackward0>)}

Should I modify other configurations? PS: could you please reopen this issue? Thx!

stan-haochen commented 4 years ago

I don't think this problem is related to the code. It seems that you should check your data and adjust the hyperparams carefully.

One suggestion is to modify loss weights to keep them the same scale as in the official datasets.

Eurus-Holmes commented 4 years ago

@stan-haochen Hi, what do you mean that modify loss weights to keep them the same scale as in the official datasets? The loss weights only occurred at adet/config/defaults.py: _C.MODEL.BASIS_MODULE.LOSS_WEIGHT = 0.3, this value should be taken according to what conditions?

stan-haochen commented 4 years ago

You are free to change the code to make it work for your case.

Eurus-Holmes commented 4 years ago

@stan-haochen Hi, I have solved this problem by reducing the learning rate and increasing the loss weight, thanks for your help!

innat commented 4 years ago

FloatingPointError: Loss became infinite or NaN

I'm about you share on this something. I've tried to pre-train with the provided synthetic samples and this error occurred. The thing that I've changed is IMG_PER_BATCH 2 instead of 8 (which is by default) because of using a single Tesla T4 GPU. I'm assuming the original experiment is done in a more powerful environment. However, I've set lr to 0.001 (by default 0.0), and it solved this error.

Eurus-Holmes commented 4 years ago

FloatingPointError: Loss became infinite or NaN

I'm about you share on this something. I've tried to pre-train with the provided synthetic samples and this error occurred. The thing that I've changed is IMG_PER_BATCH 2 instead of 8 (which is by default) because of using a single Tesla T4 GPU. I'm assuming the original experiment is done in a more powerful environment. However, I've set lr to 0.001 (by default 0.0), and it solved this error.

That's right. I also changed LOSS_WEIGHT.

aim-uofa / AdelaiDet

ABCNet training ERROR with custom datasets #129