Open mozpp opened 5 years ago
@ruinmessi
I can't reproduce your error. I notice your test size is 240, in yolov3, we only use the test size that can be divisible by 32.
I also met this error when use pytorch-1.2.0, but after I switched to pytorch-1.1.0, the error disappeared.
ssh://root@10.10.6.29:10074/usr/local/bin/python -u /project/ASFF/main.py --cfg=config/yolov3_baseline.cfg -d=VOC --tfboard --checkpoint=weights/darknet53_feature_mx.pth --start_epoch=0 --half --log_dir log/VOC -s=240 --checkpoint= Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='', dataset='VOC', debug=False, distributed=False, dropblock=False, eval_interval=10, half=True, local_rank=0, log_dir='log/VOC', n_cpu=4, ngpu=2, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=240, testset=False, tfboard=True, use_cuda=True, vis=False) successfully loaded config file: {'MODEL': {'TYPE': 'YOLOv3', 'BACKBONE': 'darknet53'}, 'TRAIN': {'LR': 0.001, 'MOMENTUM': 0.9, 'DECAY': 0.0005, 'BURN_IN': 5, 'MAXEPOCH': 300, 'COS': True, 'SYBN': True, 'MIX': True, 'NO_MIXUP_EPOCHS': 30, 'LABAL_SMOOTH': True, 'BATCHSIZE': 4, 'IMGSIZE': 608, 'IGNORETHRE': 0.7, 'RANDRESIZE': True}, 'TEST': {'CONFTHRE': 0.01, 'NMSTHRE': 0.6, 'IMGSIZE': 608}} Training YOLOv3 strong baseline! using cuda using tfboard Traceback (most recent call last): File "/project/ASFF/main.py", line 455, in main() File "/project/ASFF/main.py", line 389, in main optimizer.backward(loss) File "/project/ASFF/utils/fp16_utils/fp16_optimizer.py", line 483, in backward self.loss_scaler.backward(loss.float(), retain_graph=retain_graph) File "/project/ASFF/utils/fp16_utils/loss_scaler.py", line 45, in backward scaled_loss.backward(retain_graph=retain_graph) File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 118, in backward torch.autograd.backward(self, gradient, retain_graph, create_graph) File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 93, in backward allow_unreachable=True) # allow_unreachable flag RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 4, 76, 76, 25]], which is output 0 of CloneBackward, is at version 9; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Process finished with exit code 1
same error....may i ask do you solve it ?~~
....torch==1.5.1 is ok
ssh://root@10.10.6.29:10074/usr/local/bin/python -u /project/ASFF/main.py --cfg=config/yolov3_baseline.cfg -d=VOC --tfboard --checkpoint=weights/darknet53_feature_mx.pth --start_epoch=0 --half --log_dir log/VOC -s=240 --checkpoint= Setting Arguments.. : Namespace(asff=False, cfg='config/yolov3_baseline.cfg', checkpoint='', dataset='VOC', debug=False, distributed=False, dropblock=False, eval_interval=10, half=True, local_rank=0, log_dir='log/VOC', n_cpu=4, ngpu=2, no_wd=False, rfb=False, save_dir='save', start_epoch=0, test=False, test_size=240, testset=False, tfboard=True, use_cuda=True, vis=False) successfully loaded config file: {'MODEL': {'TYPE': 'YOLOv3', 'BACKBONE': 'darknet53'}, 'TRAIN': {'LR': 0.001, 'MOMENTUM': 0.9, 'DECAY': 0.0005, 'BURN_IN': 5, 'MAXEPOCH': 300, 'COS': True, 'SYBN': True, 'MIX': True, 'NO_MIXUP_EPOCHS': 30, 'LABAL_SMOOTH': True, 'BATCHSIZE': 4, 'IMGSIZE': 608, 'IGNORETHRE': 0.7, 'RANDRESIZE': True}, 'TEST': {'CONFTHRE': 0.01, 'NMSTHRE': 0.6, 'IMGSIZE': 608}} Training YOLOv3 strong baseline! using cuda using tfboard Traceback (most recent call last): File "/project/ASFF/main.py", line 455, in
main()
File "/project/ASFF/main.py", line 389, in main
optimizer.backward(loss)
File "/project/ASFF/utils/fp16_utils/fp16_optimizer.py", line 483, in backward
self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
File "/project/ASFF/utils/fp16_utils/loss_scaler.py", line 45, in backward
scaled_loss.backward(retain_graph=retain_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/tensor.py", line 118, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "/usr/local/lib/python3.6/dist-packages/torch/autograd/init.py", line 93, in backward
allow_unreachable=True) # allow_unreachable flag
RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [2, 4, 76, 76, 25]], which is output 0 of CloneBackward, is at version 9; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).
Process finished with exit code 1