avBuffer / UNet3plus_pth

UNet3+/ UNet++/UNet, used in Deep Automatic Portrait Matting in Pytorth
243 stars 39 forks source link

out_scale/grads_have_scale,ZeroDivisionError: float division by zero #20

Open chengzhen123 opened 1 year ago

chengzhen123 commented 1 year ago

Traceback (most recent call last): File "train.py", line 202, in lr=args.lr, device=device, img_scale=args.scale, val_percent=args.val / 100) File "train.py", line 95, in train_net scaled_loss.backward() File "D:\ProgramData\Anaconda3\envs\UNet3plus\lib\contextlib.py", line 119, in exit next(self.gen) File "D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp\handle.py", line 123, in scale_loss optimizer._post_amp_backward(loss_scaler) File "D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp_process_optimizer.py", line 249, in post_backward_no_master_weights post_backward_models_are_masters(scaler, params, stashed_grads) File "D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp_process_optimizer.py", line 135, in post_backward_models_are_masters scale_override=(grads_have_scale, stashed_have_scale, out_scale)) File "D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp\scaler.py", line 183, in unscale_with_stashed out_scale/grads_have_scale, ZeroDivisionError: float division by zero

epochs跑到2次,就报这个错误,查到网上说将lr改小一个等级,就可以。我把lr从0.01 改成 0.001,到了26epoch又报这个错误。 请问,是否有其他方法消除这个错误?以及这个错误由什么引起的?谢谢

Susu0812 commented 1 year ago

回溯(最近一次调用):文件 “train.py”,第 202 行,在 lr=args.lr, device=device, img_scale=args.scale, val_percent=args.val / 100) 文件 “train.py”,第 95 行,在 train_net scaled_loss.backward() 文件中 “D:\ProgramData\Anaconda3\envs\UNet3plus\lib\contextlib.py”,第 119 行,在退出 next(self.gen) 文件 “D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp\handle.py”,行123,在scale_loss optimizer._post_amp_backward(loss_scaler)文件“D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp_process_optimizer.py”中,第249行,在post_backward_no_master_weights post_backward_models_are_masters(scaler,params,stashed_grads)文件中,文件“D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp_process_optimizer.py”,第135行,在post_backward_models_are_mastersscale_override=(grads_have_scale, stashed_have_scale, out_scale)) 文件 “D:\ProgramData\Anaconda3\envs\UNet3plus\lib\site-packages\apex\amp\scaler.py”,第 183 行,在 unscale_with_stashed out_scale/grads_have_scale 中,零除错误:浮点除以零

epochs跑到2次,就报这个错误,查到网上说将LR改小一个等级,就可以。我把lr从0.01 改成 0.001,到了26epoch又报这个错误。 请问,是否有其他方法消除这个错误?以及这个错误由什么引起的?谢谢

我也遇到了相同的问题,请问您解决了吗

lxy5513 commented 6 months ago

继续改小lr...