量化的IAO中用BN融合训练出现梯度爆炸

666DZY666 / micronet

micronet, a model compression and deploy lib. compression: 1、quantization: quantization-aware-training(QAT), High-Bit(>2b)(DoReFa/Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference)、Low-Bit(≤2b)/Ternary and Binary(TWN/BNN/XNOR-Net); post-training-quantization(PTQ), 8-bit(tensorrt); 2、 pruning: normal、regular and group convolutional channel pruning; 3、 group convolution structure; 4、batch-normalization fuse for quantization. deploy: tensorrt, fp32/fp16/int8(ptq-calibration)、op-adapt(upsample)、dynamic_shape

MIT License

2.2k stars 478 forks source link

量化的IAO中用BN融合训练出现梯度爆炸 #49

Open yhl1010 opened 3 years ago

yhl1010 commented 3 years ago

代码没有做任何改动，将bn_fold设为1训练会出现loss=nan的情况。修改了batch size 和学习率都一样，请问怎么解决呢？

ghost commented 3 years ago

您好，请问一下，您这个BN融合的代码是在哪里找的呢？

yhl1010 commented 3 years ago

您好，请问一下，您这个BN融合的代码是在哪里找的呢？

就是在IAO/models/util_wqaq里的BNFold_Conv2d_Q函数，就训练过程中总是会出现loss=nan,不知道为什么

ghost commented 3 years ago

建议换一个模型，我是用自己的模型进行训练，没有出现梯度消失的情况

	王良

邮箱：18729264891@163.com |

签名由网易邮箱大师定制

在2020年10月15日 14:26，yhl1010 写道：

您好，请问一下，您这个BN融合的代码是在哪里找的呢？

就是在IAO/models/util_wqaq里的BNFold_Conv2d_Q函数，就训练过程中总是会出现loss=nan,不知道为什么

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

yhl1010 commented 3 years ago

建议换一个模型，我是用自己的模型进行训练，没有出现梯度消失的情况 | | 王良 | | 邮箱：18729264891@163.com | 签名由网易邮箱大师定制在2020年10月15日 14:26，yhl1010 写道：您好，请问一下，您这个BN融合的代码是在哪里找的呢？就是在IAO/models/util_wqaq里的BNFold_Conv2d_Q函数，就训练过程中总是会出现loss=nan,不知道为什么 — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

您也是把BNFold_Conv2d_Q这个函数去替换原模型中的conv+bn吗？您的模型大吗？是用在哪个领域呢，分类，检测还是分割？谢谢

ghost commented 3 years ago

对，我用的是vgg16，检测方面

	王良

邮箱：18729264891@163.com |

签名由网易邮箱大师定制

在2020年10月15日 16:13，yhl1010 写道：

建议换一个模型，我是用自己的模型进行训练，没有出现梯度消失的情况 | | 王良 | | 邮箱：18729264891@163.com | 签名由网易邮箱大师定制在2020年10月15日 14:26，yhl1010 写道：您好，请问一下，您这个BN融合的代码是在哪里找的呢？就是在IAO/models/util_wqaq里的BNFold_Conv2d_Q函数，就训练过程中总是会出现loss=nan,不知道为什么 — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

您也是把BNFold_Conv2d_Q这个函数去替换原模型中的conv+bn吗？您的模型大吗？是用在哪个领域呢，分类，检测还是分割？谢谢

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

yhl1010 commented 3 years ago

对，我用的是vgg16，检测方面 | | 王良 | | 邮箱：18729264891@163.com | 签名由网易邮箱大师定制在2020年10月15日 16:13，yhl1010 写道：建议换一个模型，我是用自己的模型进行训练，没有出现梯度消失的情况 | | 王良 | | 邮箱：18729264891@163.com | 签名由网易邮箱大师定制在2020年10月15日 14:26，yhl1010 写道：您好，请问一下，您这个BN融合的代码是在哪里找的呢？就是在IAO/models/util_wqaq里的BNFold_Conv2d_Q函数，就训练过程中总是会出现loss=nan,不知道为什么 — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe. 您也是把BNFold_Conv2d_Q这个函数去替换原模型中的conv+bn吗？您的模型大吗？是用在哪个领域呢，分类，检测还是分割？谢谢 — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

感谢，我去试一下用在分割任务中！

ghost commented 3 years ago

您好，请问做过关于将最后训练完成的模型拿出来用bn_folding.py和bn_floding_test.py这两个程序进行压缩量化？

yhl1010 commented 3 years ago

您好，请问做过关于将最后训练完成的模型拿出来用bn_folding.py和bn_floding_test.py这两个程序进行压缩量化？

还没有额，我才刚开始研究这个工程，不好意思！

ghost commented 3 years ago

如果你后面做过且有效果的请告知我一下，谢谢

WangQiangItachi commented 3 years ago

你好，我也是把bn_fold设为1时训练几个epoch就会出现loss=nan，如果是在训练好的模型上继续做量化训练，第一个batch的backward就会使部分参数的梯度为nan，没法继续训练，问下大佬这个问题解决了吗，谢谢。

666DZY666 commented 3 years ago

已修复。

yanfang12 commented 3 years ago

您好我在训mobilenet的检测模型时也遇到了某些epoch之后就梯度nan的问题，请问您是怎么修复的呢？

已修复。