Closed Eurus-Holmes closed 4 years ago
BTW, my environment:
torch.cuda.is_available() = True
torch.backends.cudnn.is_available() = True
torch.backends.cudnn.version() = 7603
CUDA Version: 10.1
torch 1.5.0+cu101
torchvision 0.6.0+cu101
I don't think it is because of the environmental problem.
@Eurus-Holmes This doesn't seem like an environmental problem. Assuming you have passed the provided demo.
Can you make sure the json file is correct (same structures as we provided) and also try to include more training samples? Can you also tried using a pretrained model to finetune on your custom dataset?
@Yuliang-Liu
Yes, I can make sure my JSON file is correct. What I am doing is using a pretrained model (weights/batext/pretrain_attn_R_50.pth
) to finetune on my custom dataset.
After updating the AdelaiDet and detectron2, this weird cuDNN error is gone, I occurred the same problem as this issue.
@stan-haochen
According to this pull request, I have changed adet/data/augmentation.py
, adet/data/dataset_mapper.py
, adet/data/detection_utils.py
files, the problem of this issue is solved.
However, the origin cuDNN error occurred again...
@stan-haochen According to this pull request, I have changed
adet/data/augmentation.py
,adet/data/dataset_mapper.py
,adet/data/detection_utils.py
files, the problem of this issue is solved. However, the origin cuDNN error occurred again...
Could uou check if training with official data works?
@stan-haochen
I have reinstalled latest AdelaiDet and detectron2, now I can train on the official Totaltext dataset with a pretrained model (weights/batext/pretrain_attn_R_50.pth
).
However, when I train on the custom dataset, it occurred a new error:
Traceback (most recent call last):
File "tools/train_net2.py", line 244, in <module>
args=(args,),
File "./ABCNet/AdelaiDet/env/lib/python3.7/site-packages/detectron2/engine/launch.py", line 62, in launch
main_func(*args)
File "tools/train_net2.py", line 232, in main
return trainer.train()
File "tools/train_net2.py", line 114, in train
self.train_loop(self.start_iter, self.max_iter)
File "tools/train_net2.py", line 103, in train_loop
self.run_step()
File "./ABCNet/AdelaiDet/env/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 218, in run_step
self._detect_anomaly(losses, loss_dict)
File "./ABCNet/AdelaiDet/env/lib/python3.7/site-packages/detectron2/engine/train_loop.py", line 241, in _detect_anomaly
self.iter, loss_dict
FloatingPointError: Loss became infinite or NaN at iteration=2!
loss_dict = {'rec_loss': tensor(nan, device='cuda:0', grad_fn=<MulBackward0>), 'loss_fcos_cls': tensor(0.2020, device='cuda:0', grad_fn=<DivBackward0>), 'loss_fcos_loc': tensor(0.1354, device='cuda:0', grad_fn=<DivBackward0>), 'loss_fcos_ctr': tensor(0.6088, device='cuda:0', grad_fn=<DivBackward0>), 'loss_fcos_bezier': tensor(2.2358, device='cuda:0', grad_fn=<DivBackward0>)}
Should I modify other configurations? PS: could you please reopen this issue? Thx!
I don't think this problem is related to the code. It seems that you should check your data and adjust the hyperparams carefully.
One suggestion is to modify loss weights to keep them the same scale as in the official datasets.
@stan-haochen
Hi, what do you mean that modify loss weights to keep them the same scale as in the official datasets?
The loss weights only occurred at adet/config/defaults.py
: _C.MODEL.BASIS_MODULE.LOSS_WEIGHT = 0.3
, this value should be taken according to what conditions?
You are free to change the code to make it work for your case.
@stan-haochen Hi, I have solved this problem by reducing the learning rate and increasing the loss weight, thanks for your help!
FloatingPointError: Loss became infinite or NaN
I'm about you share on this something. I've tried to pre-train with the provided synthetic samples and this error occurred. The thing that I've changed is IMG_PER_BATCH 2
instead of 8 (which is by default) because of using a single Tesla T4 GPU. I'm assuming the original experiment is done in a more powerful environment. However, I've set lr
to 0.001
(by default 0.0
), and it solved this error.
FloatingPointError: Loss became infinite or NaN
I'm about you share on this something. I've tried to pre-train with the provided synthetic samples and this error occurred. The thing that I've changed is
IMG_PER_BATCH 2
instead of 8 (which is by default) because of using a single Tesla T4 GPU. I'm assuming the original experiment is done in a more powerful environment. However, I've setlr
to0.001
(by default0.0
), and it solved this error.
That's right. I also changed LOSS_WEIGHT
.
Hi, I am training with custom datasets, following this issue.
@shuangyichen @Yuliang-Liu Running train_net.py use command "OMP_NUM_THREADS=1 python tools/train_net.py --config-file configs/BAText/TotalText/attn_R_50.yaml --num-gpus 1"
dataset arch: datasets
specify train img and annotations in "builtin.py": "mydataset_train":("mydataset/train_img","mydataset/annotations/train.json")
specify train config in "configs/BAText/TotalText/Base-TotalText.yaml" DATASETS: TRAIN: ("mydataset_train",) TEST: ("mydataset_train",)
Originally posted by @chenyangMl in https://github.com/aim-uofa/AdelaiDet/issues/100#issuecomment-644056170
But it occurred an error:
However, I could run the ABCNet demo successfully (without changing anything). So, what is happening to it?