Azure / iot-edge-modbus

Modbus protocol module for use with the Azure IoT Edge
Other
90 stars 71 forks source link

After restart Docker (rebooted laptop) not all data is read anymore #27

Closed sandervandevelde closed 6 years ago

sandervandevelde commented 6 years ago

I rebooted my laptop and had to restart Docker (does not start automatically).

This is what I see when I run the logging of the module:

40001: 4759 00002: 1 00001: 1 40002: 4757 40001: 4754 00001: 1 00002: 1 40002: 4743 Something wrong with the socket, disposing... A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond Something wrong with the socket, disposing... Object reference not set to an instance of an object. Something wrong with the socket, disposing... Object reference not set to an instance of an object. Something wrong with the socket, disposing... Object reference not set to an instance of an object. Connection lost, reconnecting... 00001: 1 00001: 1 00001: 1 00001: 1 00001: 1 00001: 1 00001: 1 etc.

Before the reboot, I received 4 values (as configured). After the reboot, I only received 1 value.

Configuration is:

"desired": { "PublishInterval": "10000", "SlaveConfigs": { "Slave01": { "SlaveConnection": "192.168.1.142", "HwId": "Wise4012E-142", "Operations": { "Op01": { "PollingInterval": "10000", "UnitId": "1", "StartAddress": "40001", "Count": "1", "DisplayName": "KnobOne" }, "Op02": { "PollingInterval": "10000", "UnitId": "1", "StartAddress": "40002", "Count": "1", "DisplayName": "KnobTwo" }, "Op03": { "PollingInterval": "10000", "UnitId": "1", "StartAddress": "00001", "Count": "1", "DisplayName": "SwitchOne" }, "Op04": { "PollingInterval": "10000", "UnitId": "1", "StartAddress": "00002", "Count": "1", "DisplayName": "SwitchTwo" } } } },

I expected that the module is tollerant against restarting the Docker environment.

stephenctw commented 6 years ago

I'll look in to this. Could you provide details to reproduce the issue? What's the rate you hit this issue every time you reboot? Are you running the module with IoT Edge or as standalone container?

sandervandevelde commented 6 years ago

Hello, I think I have found the source of the issue. I have replicated the issue twice within an hour with several other attempts.

I tested this issue using an Edge module (using the IoT Hub Edge Device modules feature):

20.32 receiving 4 values (this is a good thing) 20:33 restarted docker using docker settings 20:35 agent comes up, only; no other images 20:36 iotedgectl setup run 20:36 iotedgectl start 20:38 receiving 4 values again 20:39 quit docker 20:40 start docker 20:44 agent only; need setup again 20:49 4 values again 20:50 surface pro sleep en open 20:51 4 values again 20:54 restart surface pro; starting docker; all modules back 20:56 4 values again 20:57 shutdown + start surface pro; start docker 40002: 4832 00001: 0 40001: 4832 00002: 0 Something wrong with the socket, disposing... An existing connection was forcibly closed by the remote host Something wrong with the socket, disposing... Object reference not set to an instance of an object. Something wrong with the socket, disposing... Object reference not set to an instance of an object. Something wrong with the socket, disposing... Object reference not set to an instance of an object. Connection lost, reconnecting... 00002: 0 00002: 0 00002: 0 00002: 0 00002: 0 00002: 0 00002: 0 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! 21:00 running iotedgectl setup again; did not work ; modbus in use by another process? had to reboot machine 21:09 edge running again 21:10 4 values again 21:11 shutdown and restart of surface pro 21:12 docker start 21:13 4 values again 21:18 restart surface pro 21:19 start docker 21:21 docker ps returns no containers???? no edge... 21:22 iotedgectl setup + start 21:24 4 values again 21:24 shutdown + start surface pro 21:25 docker start 21:26 4 values 21:28 shutdown + start surface pro 21:29 docker start 21:31 4 values 21:32 shutdown + start surface pro; changed to other WIFI router (I have 2 routers in house, both two bands (N + AC) so 4 choices) 21:34 start docker 21:36 4 values 40002: 4830 00002: 0 00001: 0 40001: 4827 Something wrong with the socket, disposing... An existing connection was forcibly closed by the remote host Connection lost, reconnecting... 00001: 0 40001: 4830 40002: 4827 00002: 0 00001: 0 21:37 shutdown + start surface pro; changed to other WIFI router 21:38 dokcker start 00002: 0 00001: 0 40001: 4827 40002: 4827 Something wrong with the socket, disposing... An existing connection was forcibly closed by the remote host Something wrong with the socket, disposing... Object reference not set to an instance of an object. Something wrong with the socket, disposing... Object reference not set to an instance of an object. Something wrong with the socket, disposing... Object reference not set to an instance of an object. Connection lost, reconnecting... 00001: 0 00001: 0 00001: 0 00001: 0 00001: 0 00001: 0 !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

It looks like switching from WIFI network (from 192.168.1.X to 192.168.2.Y or from wifi-N to wifi-AC) seems to be the issue. This could cause (sometimes) the "Object reference not set to an instance of an object."

stephenctw commented 6 years ago

Hi, I think I found the root cause of your issue, could you pull the latest source code and verify the fix? Thanks.

sandervandevelde commented 6 years ago

Hello Stephen,

No success. I connected a regular IoTHub device as told in https://github.com/Azure/iot-edge-modbus and ran my application.

I use a Advantech Wise 4012E IoT Developer kit. I can ping the device from my laptop.

IoT Hub module client initialized. Desired property change: {"PublishInterval":"5000","SlaveConfigs":{"Slave01":{"SlaveConnection":"192.168.1.142","HwId":"Wise4012E-142","Operations":{"Op01":{"PollingInterval":"5000","UnitId":"1","StartAddress":"40001","Count":"1","DisplayName":"KnobOne"},"Op02":{"PollingInterval":"5000","UnitId":"1","StartAddress":"40002","Count":"1","DisplayName":"KnobTwo"},"Op03":{"PollingInterval":"5000","UnitId":"1","StartAddress":"00001","Count":"1","DisplayName":"SwitchOne"},"Op04":{"PollingInterval":"5000","UnitId":"1","StartAddress":"00002","Count":"1","DisplayName":"SwitchTwo"}}}},"$version":2} Attempt to load configuration: {"PublishInterval":"5000","SlaveConfigs":{"Slave01":{"SlaveConnection":"192.168.1.142","HwId":"Wise4012E-142","Operations":{"Op01":{"PollingInterval":"5000","UnitId":"1","StartAddress":"40001","Count":"1","DisplayName":"KnobOne"},"Op02":{"PollingInterval":"5000","UnitId":"1","StartAddress":"40002","Count":"1","DisplayName":"KnobTwo"},"Op03":{"PollingInterval":"5000","UnitId":"1","StartAddress":"00001","Count":"1","DisplayName":"SwitchOne"},"Op04":{"PollingInterval":"5000","UnitId":"1","StartAddress":"00002","Count":"1","DisplayName":"SwitchTwo"}}}},"$version":2} 40002: 4819

00002: 1

40001: 4843

00001: 1

Something wrong with the socket, disposing...
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
Connection lost, reconnecting...
00002: 1

40001: 4843

Something wrong with the socket, disposing...
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
Connection lost, reconnecting...
Something wrong with the socket, disposing...
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
Connection lost, reconnecting...
00002: 1

40001: 4838

00001: 1

Something wrong with the socket, disposing...
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
Connection lost, reconnecting...
40002: 4835

40001: 4824

00001: 1

Something wrong with the socket, disposing...
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
Connection lost, reconnecting...
Something wrong with the socket, disposing...
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
Connection lost, reconnecting...
etc.

The values are ok but a lot of reconnects an a lot of values missing.

I used the same configuration as shown above, but now on that device instead of the modbus module.

Have run it several times al with the same outcome.

stephenctw commented 6 years ago

Hi, Looks like we get rid of "Object reference not set to an instance of an object." error with the fix. Now you lost some data and constantly reconnecting. Are you reproducing the issue with same steps?(e.g. switching from WIFI network (from 192.168.1.X to 192.168.2.Y or from wifi-N to wifi-AC)). I am wondering if you still see the issue when using a wired connection. Could you test it? Also, it would be helpful if you can capture the packets with wireshark when the issue happen.

Thanks.

sandervandevelde commented 6 years ago

This is just from turning the application on and watching the communication. I have not switched yet. I think I need a steady flow of telemetry before I can start doing some tests with changing the network. Will try to use wireshark next time.

stephenctw commented 6 years ago

Would like to confirm one thing. From your initial statement, the issue happens when you switch between Wi-Fi network, but now the issue happens when the application starts, right? Would like to know if my fix makes it worse. Can you verify this by switching to the code before fix?

Since all your reads are with the same Modbus device, they share the same TCP socket. So this doesn't make sense while some registers are read fine and some failed. I am guessing this might be related to the wireless network is not stable. I've tested the module with disconnect/reconnect scenario in a wired network but can't reproduce your issue. Below is my log.

400001: 17416
400002: 3723
400001: 4931
400002: 17934
Something wrong with the socket, disposing...
An established connection was aborted by the software in your host machine
Connection lost, reconnecting...
Connect Slave failed
No connection could be made because the target machine actively refused it 127.0.0.1:502
Connection lost, reconnecting...
Connect Slave failed
No connection could be made because the target machine actively refused it 127.0.0.1:502
Connection lost, reconnecting...
400001: 24848
400002: 11634
400001: 30850
400002: 10311

All reads resume correctly. With wireshark capture we can verify if it's networking issue or module issue.

sandervandevelde commented 6 years ago

Yes, according to the tests I described, I presume that change between (wifi) networks seems to have an impact on the quality of the module.

The Modbus device I have and use for testing supports Wifi only.

This evening I have checked out the latest version again.

I am sitting in the same room as the Wifi router and connected using a network cable. And I also switched between the wired network and two Wifi networks. My Wise module is showing it has a very good reception (all Wifi indication LEDs are lit).

The connection is better but still, a lot of the communication intervals, say 25-50 percent, have one or more values missing. The communication tries to reconnect but there does not seem to be a pattern.

Attached is a Wireshark export file. It is from one run, wired network (for the laptop) only. I hope this will make more sense.

wise4012EModbusExportSandervandeVelde.zip

stephenctw commented 6 years ago

Hi, See another bug in the code. The application is trying to read response too fast before the response arrived. This will cause the exception you saw earlier. I am not seeing the issue because I am using wired network, much less delay then wireless network. I now make a maximum 250ms wait for reading the response, should resolve your issue.

Please pull the latest code and give it a try. Thanks.

EDIT(20180201): the wait is actually 500ms, 250 is miscaculation.

sandervandevelde commented 6 years ago

Is it interesting to make this variable part of the desired properties?

Anyway, the output is a bit cleaner but I still see a lot of missing values:

40002: 4735

00002: 1

00001: 0

Clearing socket receive buffer...
Clearing socket receive buffer...
00001: 1

Clearing socket receive buffer...
40001: 4743

40002: 4738

00002: 1

I added another wireshark file. I hope this will show you the reason of the dropped values.

Configuration has not changed: wired laptop, wireless Wise module, Wifi router nearby.

wise4012EModbusExportSandervandeVelde-updated20180201.zip

stephenctw commented 6 years ago

Yes, I will make this variable configurable. Looks like the same issue to me yet the 500ms wait I add is not enough. Please pull latest code and add two values "RetryCount" and "RetryInterval" in your configuration as below.

{
   "Slave01": {
      "SlaveConnection": "127.0.0.1",
      "HwId": "PowerMeter-0a:01:01:01:01:01",
      "RetryCount": "10",
      "RetryInterval": "100",
}

"RetryCount" indicates the max retry attempt when socket receive buffer is empty. "RetryInterval" indicates the wait interval in ms between each retry.

If the issue persists, increase either of the values and try again. There should exists a value sufficient enough to eliminate the issue.

You can also find evidence in your wireshark capture. Packet #51 is requesting coil value no.1 at time 15.567130, the response of that request arrive as packet #56 at 16.201111. The time difference is 633ms and exceeds 500ms.

sandervandevelde commented 6 years ago

I tried the latest version, feels good!

Stability is improved dramatically. No drop-outs occurred during the tests I just performed.

I'm not sure if the two settings had any effect, even without them everything worked fine.

I have attached two files. One shows communication with less optimal wireless conditions (and switching between network). The other one shows good (wireless) conditions and providing new Retry interval (75 and 50).

Would like to see this being available as an IoT Edge module.

wise4012EModbusExportSandervandeVelde-updated20180202.zip

wise4012EModbusExportSandervandeVelde-updated20180202-lessgoodcondition.zip

stephenctw commented 6 years ago

Glad to hear the good news. If you are confident the issue is resolved with current fix, you can close the issue. And I will prepare a release as an IoT Edge module on Docker Hub.

sandervandevelde commented 6 years ago

Thanks for sorting this out. Will close issue now and looking forward to the updated module.

stephenctw commented 6 years ago

Now available on Docker Hub. IoT Edge Modbus 1.0.2 Preview.