Tjyy-1223 / Neurosurgeon

云边协同- collaborative inference 📚Neurosurgeon: Collaborative Intelligence Between the Cloud and Mobile Edge
55 stars 8 forks source link

同一模型和同一参数下进行多次测试但是得到的结果不完全相同 #10

Open Sunwb-star opened 9 months ago

Sunwb-star commented 9 months ago

大佬你好,我在部署代码的过程中主要出现了两个问题。

  1. 我自己利用猫狗大战的数据集训练了一个AlexNet,得到了网络的参数,然后我在云端和边缘端都将参数加载到了模型当中,之后进行了分割,传入了真实的猫和狗的图片数据,并在云端推理完成的时候打印二分类的概率结果,虽然每次基本都能得到正确的分类结果,但是每一次得到的二分类的具体概率都不相同,例如第一次得到tensor([0.92, 0.08])、第二次得到tensor([0.89, 0.11]),虽然都不会影响最后的分类结果,但是确实输出的预测概率并不相同。
  2. 您的原始代码在Windows和Ubuntu中都能成功运行,但是我在并没有修改net_utils下的getdata函数,只修改了传入数据的size以及给模型加载参数的情况下,云端cloud_api很奇怪地变得在Windows平台下可以收到edge_out,但是在Ubuntu下就收不到,导致程序一直卡住无法进行云端推理,想问一下大佬您有什么好的建议。 以上是我的问题,希望大佬能拨冗赐教一下,谢谢。
Tjyy-1223 commented 8 months ago

你好,下面是对你的问题的一些思考:

1.可以在边缘测之间传入相同的输入,运行两次,观测一下是否单设备执行两个输出是相同的。以模型分割的原理来讲,云边端设备上参数都是相同的,不会出现这样的问题,检查一下有没有设置model.eval(),来暂停dropout()带来的随机性。

  1. 不修改传入数据的size可以正常使用吗?按照你的描述可能是ubuntu底层造成的原因,可以在getdata函数中设置一些print输出,来确定一下程序阻塞到了哪一行代码中。
Sunwb-star commented 8 months ago

感谢您的回复,第一个问题确实如您所说,在我添加了model.eval()之后模型每次都能输出相同的结果,万分感谢,输出结果如下: successfully connection :<socket.socket fd=1852, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 9999), raddr=('127.0.0.1', 53495)> get model type successfully. get partition point successfully. get edge_output and transfer latency successfully. short message , transfer latency has been sent successfully short message , cloud latency has been sent successfully 概率: tensor([[0.4158, 0.5842]], device='cuda:0') 预测类别: dog ================= DNN Collaborative Inference Finished. =================== successfully connection :<socket.socket fd=968, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 9999), raddr=('127.0.0.1', 53504)> get model type successfully. get partition point successfully. get edge_output and transfer latency successfully. short message , transfer latency has been sent successfully short message , cloud latency has been sent successfully 概率: tensor([[0.4158, 0.5842]], device='cuda:0') 预测类别: dog ================= DNN Collaborative Inference Finished. =================== successfully connection :<socket.socket fd=1840, family=AddressFamily.AF_INET, type=SocketKind.SOCK_STREAM, proto=0, laddr=('127.0.0.1', 9999), raddr=('127.0.0.1', 53517)> get model type successfully. get partition point successfully. get edge_output and transfer latency successfully. short message , transfer latency has been sent successfully short message , cloud latency has been sent successfully 概率: tensor([[0.4158, 0.5842]], device='cuda:0') 预测类别: dog ================= DNN Collaborative Inference Finished. =================== 至于第二个问题,我之后再进行一下测试,然后再跟您讨论一下。 再次感谢您的帮助,谢谢!