quantizer.quantize()中load模型权重报Missing key(s)， Unexpected key(s)，size mismatch

YeahTech commented 2 years ago

main_worker(args)

File "post_train_quantization.py", line 78, in main_worker ptq_model = quantizer.quantize() File "/home/yaoxinghua/miniconda3/envs/yolo/lib/python3.7/site-packages/TinyNeuralNetwork-0.1.0.20220512170349+d0053782b9ca90a8554d211660f82c7da1e36962-py3.7.egg/tinynn/graph/quantization/quantizer.py", line 194, in quantize rewritten_model.load_state_dict(torch.load(model_weights_path)) File "/home/yaoxinghua/miniconda3/envs/yolo/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1407, in load_state_dict self.class.name, "\n\t".join(error_msgs))) RuntimeError: Error(s) in loading state_dict for Stereo_qat: Missing key(s) in state_dict: "stereo_stem_4_0.weight", "stereo_stem_4_0.bias", "stereo_stem_4_0_1.weight", "stereo_stem_4_0_1.bias", "stereo_cost_agg_conv_agg_1_1_activate.weight", "stereo_cost_agg_conv_agg_1_1_activate.bias", "stereo_cost_agg_conv_agg_1_1_activate.running_mean", "stereo_cost_agg_conv_agg_1_1_activate.running_var". Unexpected key(s) in state_dict: "stereo_feature_block0_0_0_conv_dw.weight", "stereo_feature_block0_0.0.conv_dw.weight", "stereo_feature_block0_0.0.bn1.weight", "stereo_feature_block0_0.0.bn1.bias", "stereo_feature_block0_0.0.bn1.running_mean", "stereo_feature_block0_0.0.bn1.running_var", "stereo_feature_block0_0.0.bn1.num_batches_tracked", "stereo_feature_block0_0.0.conv_pw.weight", "stereo_feature_block0_0.0.bn2.weight", "stereo_feature_block0_0.0.bn2.bias", "stereo_feature_block0_0.0.bn2.running_mean", "stereo_feature_block0_0.0.bn2.running_var", "stereo_feature_block0_0.0.bn2.num_batches_tracked", "stereo_feature_block0_0_0_conv_dw_1.weight", "stereo_feature_block0_0_1.0.conv_dw.weight", "stereo_feature_block0_0_1.0.bn1.weight", "stereo_feature_block0_0_1.0.bn1.bias", "stereo_feature_block0_0_1.0.bn1.running_mean", "stereo_feature_block0_0_1.0.bn1.running_var", "stereo_feature_block0_0_1.0.bn1.num_batches_tracked", "stereo_feature_block0_0_1.0.conv_pw.weight", "stereo_feature_block0_0_1.0.bn2.weight", "stereo_feature_block0_0_1.0.bn2.bias", "stereo_feature_block0_0_1.0.bn2.running_mean", "stereo_feature_block0_0_1.0.bn2.running_var", "stereo_feature_block0_0_1.0.bn2.num_batches_tracked", "stereo_stem_4_0.conv.weight", "stereo_stem_4_0.bn.weight", "stereo_stem_4_0.bn.bias", "stereo_stem_4_0.bn.running_mean", "stereo_stem_4_0.bn.running_var", "stereo_stem_4_0.bn.num_batches_tracked", "stereo_stem_4_0_1.conv.weight", "stereo_stem_4_0_1.bn.weight", "stereo_stem_4_0_1.bn.bias", "stereo_stem_4_0_1.bn.running_mean", "stereo_stem_4_0_1.bn.running_var", "stereo_stem_4_0_1.bn.num_batches_tracked". size mismatch for stereo_cost_agg_conv_skip_1_bn.weight: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([960]). size mismatch for stereo_cost_agg_conv_skip_1_bn.bias: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([960]). size mismatch for stereo_cost_agg_conv_skip_1_bn.running_mean: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([960]). size mismatch for stereo_cost_agg_conv_skip_1_bn.running_var: copying a param with shape torch.Size([16]) from checkpoint, the shape in current model is torch.Size([960]).

YeahTech commented 2 years ago

尝试mobilenet的example没有任何问题，但是在尝试量化我自己的模型时，在rewritten_model.load_state_dict(torch.load(model_weights_path))环节报错，权重和定义不匹配了，不知道为啥，是graph的rewrite出错了吗

peterjc123 commented 2 years ago

@yaoxinghua 能不能试下基于这个样例https://github.com/alibaba/TinyNeuralNetwork/blob/main/examples/tracer/tracer_example.py 生成一份模型描述文件，然后看下这个能不能正常运行？如果可以的话，麻烦提供生成的模型描述给我们

YeahTech commented 2 years ago

@yaoxinghua 能不能试下基于这个样例https://github.com/alibaba/TinyNeuralNetwork/blob/main/examples/tracer/tracer_example.py 生成一份模型描述文件，然后看下这个能不能正常运行？如果可以的话，麻烦提供生成的模型描述给我们

感谢大佬回复，按照您说的，导出generate_code，报错和上面描述一样。模型描述文件及pth见:链接: https://pan.baidu.com/s/139z_3-NOfQAbzaZhdNi9gg 提取码: ec6b

peterjc123 commented 2 years ago

@yaoxinghua 那你可能需要给下原始的模型定义了，看起来是模型的计算图捕获有问题

YeahTech commented 2 years ago

@yaoxinghua 那你可能需要给下原始的模型定义了，看起来是模型的计算图捕获有问题感谢大佬，原始模型定义见：链接: https://pan.baidu.com/s/1L4zyuVgEPRHTlVqUKrOPIA 提取码: nrtj tinynn导出的命令： python post_train_quantization.py --netconfig configs/stereo/cfg_coex-crestereo-input640-disp64-costAA2-GW-shenzhenbatch12.yaml

peterjc123 commented 2 years ago

@yaoxinghua 我可能要晚点才能看了，公司电脑没法安装百度网盘，其实github可以直接传zip文件的，你可以直接把文件拖到聊天框里面

YeahTech commented 2 years ago

@peterjc123 抱歉欠考虑了，附件如下： coex-share.zip

tinynn导出的命令： python post_train_quantization.py --netconfig configs/stereo/cfg_coex-crestereo-input640-disp64-costAA2-GW-shenzhenbatch12.yaml

peterjc123 commented 2 years ago

好吧，找到原因了，是因为你的代码中构造modules的时候某些modules被释放了，导致对象obj的地址id(obj)被重用了，进一步导致生成modules时使用的是之前的签名，修复已提交 https://github.com/alibaba/TinyNeuralNetwork/pull/76

另外我看了一下你的模型，里面有conv3d和conv3d_transpose，这些在tflite没有量化的实现，所以后面在converter中可能得还原成浮点的实现

顺便我还跑了一下convert_tflite2.py，感觉生成的模型还行，就是当中有如下的pattern，感觉后面可以优化下，让4d且轴为1的mean两边包裹transpose，这样这个transpose就可以传播到concat下面变成1个了 （更新：已提交 https://github.com/alibaba/TinyNeuralNetwork/pull/77 ）

YeahTech commented 2 years ago

顺便我还跑了一下convert_tflite2.py，感觉生成的模型还行，就是当中有如下的pattern，感觉后面可以优化下，让4d且轴为1的mean两边包裹transpose，这样这个transpose就可以传播到concat下面变成1个了 （更新：已提交 #77 ）

哇，感谢大佬的贴心帮助，不仅帮我解决问题，还提供了优化思路，实在太赞了，~(≧▽≦)/~

peterjc123 commented 2 years ago

@yaoxinghua 两个pr都已经合入了，我先把这个问题关了，还有别的问题，麻烦再提新的issue，谢谢

alibaba / TinyNeuralNetwork

quantizer.quantize()中load模型权重报Missing key(s)， Unexpected key(s)，size mismatch #74