nanotrackv3 转 ncnn模型，pth转pt，head的推理结果异常nan

Sherlock-hh commented 4 months ago

使用的原始模型是./models/pretrained/nanotrackv3.pth 转换步骤为pth转pt,利用pnnx转ncnn。 backbone127的pt模型 backbone255的pt模型 head的pt模型 ncnnbackbone127_bin ncnnbackbone127_param ncnnbackbone255_bin ncnnbackbone255_param ncnnhead_bin ncnnhead_param

添加pth转换pt的代码，会导致python代码中head部分推理结果为nan，删除这两行代码就不会。

trace_model = torch.jit.trace(model, (torch.Tensor(1, 48, 8, 8), torch.Tensor(1, 48, 16, 16)))
trace_model.save('./models/pt/head.pt')

后续ncnn也一样的表现，head部分的推理结果为nan。

所有相关代码如下

1、修改NanoTrack/nanotrack/utils/model_load.py
在71行下增加
    model.load_state_dict(pretrained_dict, strict=False)
# backbone 模板特征提取模型
backbone_net=model.backbone
head_net=model.ban_net
trace_model = torch.jit.trace(backbone_net, torch.Tensor(1, 3, 127, 127))
trace_model.save('./models/pt/backbone_127.pt')

# backbone 图像特征提取模型
trace_model = torch.jit.trace(backbone_net, torch.Tensor(1, 3, 255, 255))
trace_model.save('./models/pt/backbone_255.pt')

# head 模型
trace_model = torch.jit.trace(head_net, (torch.Tensor(1, 96, 8, 8), torch.Tensor(1, 96, 16, 16)))
trace_model.save('./models/pt/head.pt')

pnnx转换
pip install pnnx

pnnx backbone_255.pt inputshape=[1,3,255,255]
pnnx backbone_127.pt inputshape=[1,3,127,127]
pnnx head.pt inputshape=[[1,96,8,8],[1,96,16,16]]

ncnn 调用 input和extract修改为in0，out0，in1，out1等等。从模型直接出来的数据为nan

Sherlock-hh commented 4 months ago

经过测试，发现这两句pth转换pt的代码是概率性引起nan值出现，而且pycharm的debug模式下，在ban_v3的forward函数中打断点，就不会出现nan。对于为什么会出现这样的现象并且在ncnn模型中复现这一点我真的是毫无头绪，想知道作者您是否有其他转换ncnn的方式，（已试过onnx版本转ncnn2023版，直接核心已转储崩溃，使用convertmodel.com则是不支持onnxsim）

Sherlock-hh commented 4 months ago

目前发现是jit.trace会影响模型的输出结果？尤其是当使用torch.Tensor()的时候，会导致很大概率出现nan。所以我把代码挪到推理中去了，并用深拷贝拷贝了一个新的模型。验证过保存的模型，使用jit.load(new_model)加载到代码中，结果是正确的。但是ncnn调用依旧结果不一致。想请问您这边有尝试过将v3转ncnn吗？大概是怎样的一个流程呢？

Sherlock-hh commented 4 months ago

通过对比python和ncnn代码，发现虽然backbone输出的特征值似乎不一致，但是head的输出是基本一样的，最后还是需要修改一下后处理部分，除了其他朋友提供的：

score_size=15
window_influence=0.455
penalty_k=0.138
lr=0.348

还有就是删除update函数中关于diff_xs和diff_ys这一块的代码调用，都删除，只用pred_xs. 以及create_grid()函数修改同python中的generate_point 一致即可。

CV-poo commented 2 months ago

通过对比python和ncnn代码，发现虽然backbone输出的特征值似乎不一致，但是head的输出是基本一样的，最后还是需要修改一下后处理部分，除了其他朋友提供的：
score_size=15
window_influence=0.455
penalty_k=0.138
lr=0.348
还有就是删除update函数中关于diff_xs和diff_ys这一块的代码调用，都删除，只用pred_xs. 以及create_grid()函数修改同python中的generate_point 一致即可。

@Sherlock-hh 博主你好，我最近也在处理NanoTrackV3的部署问题，我是采用rknn模型进行处理，基本遇到了和您一样的问题，请问您说的去除diff_xs以及diff_ys的部分是什么意思呀，难道不需要根据此求绝对坐标吗

ogvalt commented 2 months ago

@CV-poo how did you solved nan problem?

CV-poo commented 2 months ago

sorry，I have not sovled this problem. If you solve this, please tell me .Thanks.

CV-poo commented 2 months ago

@CV-poo how did you solved nan problem?

I can tell you my experience. First of all, the way nanotrackv3 generates grid coordinates is different from nanotrackv2. It uses the center as the origin of the coordinate system. Therefore, when performing coordinate transformation, diff-xs and diff_ys are not used. When processing coordinates, just add pred-xs. If your experiment is successful, please let me know, thank you!

ogvalt commented 2 months ago

sorry，I have not sovled this problem. If you solve this, please tell me .Thanks.

of course!

another question, have you tried quantizing nanotrackv3 or deploying it on rk3566 npu?

CV-poo commented 2 months ago

sorry，I have not sovled this problem. If you solve this, please tell me .Thanks.

of course!

another question, have you tried quantizing nanotrackv3 or deploying it on rk3566 npu?

I haven't experimented on 3566 yet, but I did test on rk3588, which should be very similar. I did not quantize when I configured nanotrack with rknn, and chose the non-quantized mode. I wish you a successful test. Good luck！

ogvalt commented 2 months ago

sorry，I have not sovled this problem. If you solve this, please tell me .Thanks.

of course!

another question, have you tried quantizing nanotrackv3 or deploying it on rk3566 npu?

I haven't experimented on 3566 yet, but I did test on rk3588, which should be very similar. I did not quantize when I configured nanotrack with rknn, and chose the non-quantized mode. I wish you a successful test. Good luck！

what kind of results did you get? cause my aren't that good

CV-poo commented 2 months ago

sorry，I have not sovled this problem. If you solve this, please tell me .Thanks.

of course!

another question, have you tried quantizing nanotrackv3 or deploying it on rk3566 npu?

I haven't experimented on 3566 yet, but I did test on rk3588, which should be very similar. I did not quantize when I configured nanotrack with rknn, and chose the non-quantized mode. I wish you a successful test. Good luck！

what kind of results did you get? cause my aren't that good

The result I got was not aligned either, and the target box was offset. So I need to modify other parts, but I haven't found it yet. If you find any problems, please communicate with me. Thank you

HonglinChu / SiamTrackers

nanotrackv3 转 ncnn模型，pth转pt，head的推理结果异常nan #159