alibaba / TinyNeuralNetwork

TinyNeuralNetwork is an efficient and easy-to-use deep learning model compression framework.
MIT License
745 stars 115 forks source link

quantizer.quantize()中load模型权重报Missing key(s), Unexpected key(s),size mismatch #74

Closed YeahTech closed 2 years ago

YeahTech commented 2 years ago
main_worker(args)

File "post_train_quantization.py", line 78, in main_worker ptq_model = quantizer.quantize() File "/home/yaoxinghua/miniconda3/envs/yolo/lib/python3.7/site-packages/TinyNeuralNetwork-0.1.0.20220512170349+d0053782b9ca90a8554d211660f82c7da1e36962-py3.7.egg/tinynn/graph/quantization/quantizer.py", line 194, in quantize rewritten_model.load_state_dict(torch.load(model_weights_path)) File "/home/yaoxinghua/miniconda3/envs/yolo/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1407, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Stereo_qat: Missing key(s) in state_dict: "stereo_stem_4_0.weight", "stereo_stem_4_0.bias", "stereo_stem_4_0_1.weight", "stereo_stem_4_0_1.bias", "stereo_cost_agg_conv_agg_1_1_activate.weight", "stereo_cost_agg_conv_agg_1_1_activate.bias", "stereo_cost_agg_conv_agg_1_1_activate.running_mean", "stereo_cost_agg_conv_agg_1_1_activate.running_var". Unexpected key(s) in state_dict: "stereo_feature_block0_0_0_conv_dw.weight", "stereo_feature_block0_0.0.conv_dw.weight", "stereo_feature_block0_0.0.bn1.weight", "stereo_feature_block0_0.0.bn1.bias", "stereo_feature_block0_0.0.bn1.running_mean", "stereo_feature_block0_0.0.bn1.running_var", "stereo_feature_block0_0.0.bn1.num_batches_tracked", "stereo_feature_block0_0.0.conv_pw.weight", "stereo_feature_block0_0.0.bn2.weight", "stereo_feature_block0_0.0.bn2.bias", "stereo_feature_block0_0.0.bn2.running_mean", "stereo_feature_block0_0.0.bn2.running_var", "stereo_feature_block0_0.0.bn2.num_batches_tracked", "stereo_feature_block0_0_0_conv_dw_1.weight", "stereo_feature_block0_0_1.0.conv_dw.weight", "stereo_feature_block0_0_1.0.bn1.weight", "stereo_feature_block0_0_1.0.bn1.bias", "stereo_feature_block0_0_1.0.bn1.running_mean", "stereo_feature_block0_0_1.0.bn1.running_var", "stereo_feature_block0_0_1.0.bn1.num_batches_tracked", "stereo_feature_block0_0_1.0.conv_pw.weight", "stereo_feature_block0_0_1.0.bn2.weight", "stereo_feature_block0_0_1.0.bn2.bias", "stereo_feature_block0_0_1.0.bn2.running_mean", "stereo_feature_block0_0_1.0.bn2.running_var", "stereo_feature_block0_0_1.0.bn2.num_batches_tracked", "stereo_stem_4_0.conv.weight", "stereo_stem_4_0.bn.weight", "stereo_stem_4_0.bn.bias", "stereo_stem_4_0.bn.running_mean", "stereo_stem_4_0.bn.running_var", "stereo_stem_4_0.bn.num_batches_tracked", "stereo_stem_4_0_1.conv.weight", "stereo_stem_4_0_1.bn.weight", "stereo_stem_4_0_1.bn.bias", "stereo_stem_4_0_1.bn.running_mean", "stereo_stem_4_0_1.bn.running_var", "stereo_stem_4_0_1.bn.num_batches_tracked". size mismatch for stereo_cost_agg_conv_skip_1_bn.weight: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([960]). size mismatch for stereo_cost_agg_conv_skip_1_bn.bias: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([960]). size mismatch for stereo_cost_agg_conv_skip_1_bn.running_mean: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([960]). size mismatch for stereo_cost_agg_conv_skip_1_bn.running_var: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([960]).

YeahTech commented 2 years ago

尝试mobilenet的example没有任何问题,但是在尝试量化我自己的模型时,在rewritten_model.load_state_dict(torch.load(model_weights_path))环节报错,权重和定义不匹配了,不知道为啥,是graph的rewrite出错了吗

peterjc123 commented 2 years ago

@yaoxinghua 能不能试下基于这个样例https://github.com/alibaba/TinyNeuralNetwork/blob/main/examples/tracer/tracer_example.py 生成一份模型描述文件,然后看下这个能不能正常运行?如果可以的话,麻烦提供生成的模型描述给我们

YeahTech commented 2 years ago

@yaoxinghua 能不能试下基于这个样例https://github.com/alibaba/TinyNeuralNetwork/blob/main/examples/tracer/tracer_example.py 生成一份模型描述文件,然后看下这个能不能正常运行?如果可以的话,麻烦提供生成的模型描述给我们

感谢大佬回复,按照您说的,导出generate_code,报错和上面描述一样。 模型描述文件及pth见:链接: https://pan.baidu.com/s/139z_3-NOfQAbzaZhdNi9gg 提取码: ec6b

peterjc123 commented 2 years ago

@yaoxinghua 那你可能需要给下原始的模型定义了,看起来是模型的计算图捕获有问题

YeahTech commented 2 years ago

@yaoxinghua 那你可能需要给下原始的模型定义了,看起来是模型的计算图捕获有问题 感谢大佬, 原始模型定义见:链接: https://pan.baidu.com/s/1L4zyuVgEPRHTlVqUKrOPIA 提取码: nrtj tinynn导出的命令: python post_train_quantization.py --netconfig configs/stereo/cfg_coex-crestereo-input640-disp64-costAA2-GW-shenzhenbatch12.yaml

peterjc123 commented 2 years ago

@yaoxinghua 我可能要晚点才能看了,公司电脑没法安装百度网盘,其实github可以直接传zip文件的,你可以直接把文件拖到聊天框里面

YeahTech commented 2 years ago

@peterjc123 抱歉欠考虑了,附件如下: coex-share.zip

tinynn导出的命令: python post_train_quantization.py --netconfig configs/stereo/cfg_coex-crestereo-input640-disp64-costAA2-GW-shenzhenbatch12.yaml

peterjc123 commented 2 years ago

好吧,找到原因了,是因为你的代码中构造modules的时候某些modules被释放了,导致对象obj的地址id(obj)被重用了,进一步导致生成modules时使用的是之前的签名,修复已提交 https://github.com/alibaba/TinyNeuralNetwork/pull/76

另外我看了一下你的模型,里面有conv3d和conv3d_transpose,这些在tflite没有量化的实现,所以后面在converter中可能得还原成浮点的实现

顺便我还跑了一下convert_tflite2.py,感觉生成的模型还行,就是当中有如下的pattern,感觉后面可以优化下,让4d且轴为1的mean两边包裹transpose,这样这个transpose就可以传播到concat下面变成1个了 (更新:已提交 https://github.com/alibaba/TinyNeuralNetwork/pull/77 image

YeahTech commented 2 years ago

顺便我还跑了一下convert_tflite2.py,感觉生成的模型还行,就是当中有如下的pattern,感觉后面可以优化下,让4d且轴为1的mean两边包裹transpose,这样这个transpose就可以传播到concat下面变成1个了 (更新:已提交 #77 )

哇,感谢大佬的贴心帮助,不仅帮我解决问题,还提供了优化思路,实在太赞了,~(≧▽≦)/~

peterjc123 commented 2 years ago

@yaoxinghua 两个pr都已经合入了,我先把这个问题关了,还有别的问题,麻烦再提新的issue,谢谢