yiningzeng commented 4 years ago

[10/15 13:57:56 d2.engine.hooks]: Overall training speed: 22047 iterations in 2:33:41 (0.4183 s / it)
[10/15 13:57:56 d2.engine.hooks]: Total training time: 2:37:09 (0:03:28 on hooks)
Traceback (most recent call last):
  File "tools/train_net.py", line 161, in <module>
    args=(args,),
  File "/usr/local/lib/python3.6/dist-packages/detectron2/engine/launch.py", line 52, in launch
    main_func(*args)
  File "tools/train_net.py", line 149, in main
    return trainer.train()
  File "/usr/local/lib/python3.6/dist-packages/detectron2/engine/defaults.py", line 329, in train
    super().train(self.start_iter, self.max_iter)
  File "/usr/local/lib/python3.6/dist-packages/detectron2/engine/train_loop.py", line 132, in train
    self.run_step()
  File "/usr/local/lib/python3.6/dist-packages/detectron2/engine/train_loop.py", line 212, in run_step
    loss_dict = self.model(data)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/detectron2/modeling/meta_arch/rcnn.py", line 82, in forward
    proposals, proposal_losses = self.proposal_generator(images, features, gt_instances)
  File "/usr/local/lib/python3.6/dist-packages/torch/nn/modules/module.py", line 541, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/dist-packages/detectron2/modeling/proposal_generator/rpn.py", line 172, in forward
    outputs.predict_proposals(),
  File "/usr/local/lib/python3.6/dist-packages/detectron2/modeling/proposal_generator/rpn_outputs.py", line 416, in predict_proposals
    pred_anchor_deltas_i, anchors_i.tensor
  File "/usr/local/lib/python3.6/dist-packages/detectron2/modeling/box_regression.py", line 76, in apply_deltas
    assert torch.isfinite(deltas).all().item()
AssertionError

Environment

---------------------  --------------------------------------------------
Python                 3.6.8 (default, Oct  7 2019, 12:59:55) [GCC 8.3.0]
Detectron2 Compiler    GCC 7.4
DETECTRON2_ENV_MODULE  <not set>
PyTorch                1.3.0
PyTorch Debug Build    False
CUDA available         True
GPU 0                  GeForce GTX 1060 6GB
Pillow                 6.2.0
cv2                    4.1.1
---------------------  --------------------------------------------------
PyTorch built with:
  - GCC 7.3
  - Intel(R) Math Kernel Library Version 2019.0.4 Product Build 20190411 for Intel(R) 64 architecture applications
  - Intel(R) MKL-DNN v0.20.5 (Git Hash 0125f28c61c1f822fd48570b4c1066f96fcb9b2e)
  - OpenMP 201511 (a.k.a. OpenMP 4.5)
  - NNPACK is enabled
  - CUDA Runtime 10.1
  - NVCC architecture flags: -gencode;arch=compute_35,code=sm_35;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_61,code=sm_61;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_50,code=compute_50
  - CuDNN 7.6.3
  - Magma 2.5.1
  - Build settings: BLAS=MKL, BUILD_NAMEDTENSOR=OFF, BUILD_TYPE=Release, CXX_FLAGS= -Wno-deprecated -fvisibility-inlines-hidden -fopenmp -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -O2 -fPIC -Wno-narrowing -Wall -Wextra -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-sign-compare -Wno-unused-parameter -Wno-unused-variable -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Wno-stringop-overflow, DISABLE_NUMA=1, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, USE_CUDA=True, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=ON, USE_NNPACK=ON, USE_OPENMP=ON, USE_STATIC_DISPATCH=OFF,

ppwwyyxx commented 4 years ago

That's not the end of training. The training has stopped because your model has diverged to nans of infinite values.

yiningzeng commented 4 years ago

That's not the end of training. The training has stopped because your model has diverged to nans of infinite values.

It means there is something wrong with my custom datasets?

ppwwyyxx commented 4 years ago

I can only say there is something wrong in your training -- which is a combination of your dataset, your model and your configurations.

yiningzeng commented 4 years ago

I can only say there is something wrong in your training -- which is a combination of your dataset, your model and your configurations.

thanks. I will check

yoosan commented 4 years ago

I got the same error when training COCO2017 dataset. ps. I have not modified any config or code.

fpoms commented 4 years ago

I've also noticed the same issue when training out of the box for LVIS Instance Segmentation (specifically mask_rcnn_R_101_FPN_1x.yaml). The only modification I made is changing IMS_PER_BATCH from 16 -> 4.

yoosan commented 4 years ago

I got the same error when training COCO2017 dataset. ps. I have not modified any config or code.

Hi @yiningzeng , could you reopen this issue?

ppwwyyxx commented 4 years ago

The only modification I made is changing IMS_PER_BATCH from 16 -> 4.

That definitely sounds like a modification that could lead to this issue.

I got the same error when training COCO2017 dataset. ps. I have not modified any config or code.

If you run into this issue with unmodified config and code, please include details following the issue template, with full command and full logs.

yoosan commented 4 years ago

I would like to say, assert torch.isfinite(deltas).all().item() is very sensitive when changing the hyperparameter, such as learning rate, batch size etc. Well, setting learning rate half (0.02 -> 0.01) solve this problem.

yoosan commented 4 years ago

The only modification I made is changing IMS_PER_BATCH from 16 -> 4.

That definitely sounds like a modification that could lead to this issue.

I got the same error when training COCO2017 dataset. ps. I have not modified any config or code.

If you run into this issue with unmodified config and code, please include details following the issue template, with full command and full logs.

I indeed change num of gpus from 8 to 4, that may lead this error.

yiningzeng commented 4 years ago

The only modification I made is changing IMS_PER_BATCH from 16 -> 4.

That definitely sounds like a modification that could lead to this issue.

I got the same error when training COCO2017 dataset. ps. I have not modified any config or code.

If you run into this issue with unmodified config and code, please include details following the issue template, with full command and full logs.

I indeed change num of gpus from 8 to 4, that may lead this error.

I changed the datasets and set GPU = 5,IMS per batch = 40 to run on 6 gpus. It works normally.

ppwwyyxx commented 4 years ago

Correct me if that is not the case, but it seems to me that everyone who encounters this error have made changes to the default training settings. And many have also fixed it after tuning the setting a bit more. This does not sound like a detectron2 problem, therefore closing.

facebookresearch / detectron2

"AssertionError" at the end of the training #82

Environment