Heartbeat logic for shdr-adapter prevents correct syncronization with adapters

victoririzarx3d commented 1 month ago

Hi @PatrickRitchie

I found an issue with the shdr-adapter on v6.4.1. Tested on Ubuntu 20.04

I have an adapter running locally on port 7878 for my device data. The shdr-adapter module configuration on the agent looks like this

- shdr-adapter:
    deviceKey: H350
    hostname: localhost
    port: 7878

I keep noticing that the adapter connects and disconnects at fixed intervals.

The agent logs show this every 30 seconds more or less

2024-07-31 15:55:24.5688|DEBUG|modules.shdr-adapter|ID = adapter_shdr_56697b5624 : Listening for connection from localhost on Port 7878
2024-07-31 15:55:24.5688|INFO|modules.shdr-adapter|ID = adapter_shdr_56697b5624 : Connected to Adapter at localhost on Port 7878
2024-07-31 15:55:24.5688|DEBUG|modules.shdr-adapter|ID = adapter_shdr_56697b5624 : Initial PING sent to : localhost on Port 7878
2024-07-31 15:55:24.5806|DEBUG|modules.shdr-adapter|ID = adapter_shdr_56697b5624 : PONG Received from : localhost on Port 7878 : Heartbeat = 10000ms
2024-07-31 15:55:54.5544|DEBUG|modules.shdr-adapter|ID = adapter_shdr_56697b5624 : PING sent to : localhost on Port 7878
2024-07-31 15:56:04.5556|DEBUG|modules.shdr-adapter|ID = adapter_shdr_56697b5624 : Unable to write data to the transport connection: Broken pipe.
2024-07-31 15:56:04.5556|INFO|modules.shdr-adapter|ID = adapter_shdr_56697b5624 : Disconnected from localhost on Port 7878

On my adapter side what I notice is that the heartbeat thread keeps timing out. From the logs you can see that the time measured between the agent receiving the PONG response to the time it tries to send the PING is 30 secs. The default timeout is 10000 ms so the adapter is waiting at most 20 seconds, therefore it times out and closes the socket, which then triggers the disconnection on the agent side because it is trying to write the PING on a closed socket

I tried to change the timeout on the agent side but still it was always sending the PING late

Looking at the code I found that the logic controlling the sending of the PING is preventing the round trip of the heartbeat to be successful because it is sending only data when the difference between the last message and the current time is larger than the timeout value AND when the difference between the last heartbeat and current time is larger than the timeout value. This makes the sending of the PING dependable on the message timing. Changing the condition to OR solves the issue

https://github.com/TrakHound/MTConnect.NET/blob/48e861a5617de5436680e535f9264c27dd47564a/libraries/MTConnect.NET-SHDR/Shdr/ShdrClient.cs#L256-L263

Making the agent honor the timeout value coming from the adapter

On the MTConnect adapter agent protocol docs it is stated that the agent should honor the adapter's timeout value once it receives a correct PONG response

The agent and the adapter have a heartbeat that makes sure each is responsive to properly handle disconnects in a timely manner. The Heartbeat frequency is set by the adapter and honored by the agent. When the agent connects to the adapter, it first sends a PING and then expects the response PONG where is specified in milliseconds

Currently the agent just checks that the PING text matches the pattern, parses the timeout integer, and logs the value. I have made some changes to add heartbeat as a class member that is defaulted to 10000 and, if the PONG response is correct, parse the timeout value and update the heartbeat member it and use it where needed.

I can create a PR for all of the above if it suits

PatrickRitchie commented 1 month ago

Thank you for looking into this and for the information.

It looks like you are correct that I was not setting the Heartbeat based on the response from the Adapter. This looks like something I initially had implemented but when I reorganized the class I separated the ListenForAdapter() and ProcessResponse() methods and mistakenly set the "heartbeat" variable as local in each method which broke that functionality. I believe this may be the same (or similar) problem that was mentioned in Issue #62.

If you want to create a PR to properly set the heartbeat, that would be great!

I am not sure about the suggestion to change the AND to OR on line 257 as that would cause the PING to be sent even if SHDR data was received within the Heartbeat. There seems to be some differences between the MTCUP, GitHub, and (older) MTC_SHDR.doc documentation. I believe where I got the idea that the Agent should not send a PING if data was received is from the below text taken from the older MTC_SHDR.doc document:

The agent and the adapter should consider the other side dead if it does not receive a response or a heartbeat within 2 * interval.

The full section:

B.5. Heartbeats The new MTConnect agent supports heartbeats to detect when the adapter is no longer connected. This functionality was added since the adapter could become silent simply because the machine is not active and data is not changing. This would appear the same as a process that was either frozen or a machine was turned off without closing the socket properly. The adapter may choose the heartbeat interval and the agent will respect it. The agent and the adapter should consider the other side dead if it does not receive a response or a heartbeat within 2 interval. For example, if the adapter responds to the heartbeat request with 10000 meaning 10000 ms or 10 seconds, the agent will consider the adapter dead if it does not respond within 20 seconds. Similarly, the adapter will consider the agent dead if it does not heartbeat in 20 seconds. The protocol is rather simple, the agent sends the adapter a request on a single line as follows: ``` PING Any line that begins with a * is considered a command and is not parsed as a key/value pair stream. When the PING is received the adapter responds with a PONG as follows: * PONG 10000``` The number after the pong is the heartbeat interval in milliseconds, in this case 10 seconds. If the adapter does not respond with a PONG the agent will consider the adapter a legacy adapter and will disable heartbeats and timeouts. The current implementation will have the server send pings every x seconds and disconnect if it does not receive any data in 20 seconds. The adapter should check the read side of the socket after every full attempt to get data from the device and if it receives a PING it should immediately write a PONG on the socket. If a multithreaded adapter is used, the write must be mutexed to avoid interspersing the data. After the agent disconnects or times out, the adapter should not exit, but should wait for another connection from the agent. As well, the agent will attempt to reconnect when a connection to the adapter disconnects or times out.

If this is causing an issue then we can make the change you suggested. Otherwise I will create an issue within the SC Agent Working Group to get clarification on the expected behavior.

victoririzarx3d commented 1 month ago

Thanks for the info @PatrickRitchie

This clarifies the behavior we see. Our adapter is based on some older python code that probably was interacting with the old C++ mtconnect agent that required the heartbeat to be sent regardless if data has been exchanged i between the heartbeat interval. If the change is non breaking on the agent side then It would be nice to have it there, I am not sure If you have seen similar issues when the agent interacts with other shdr adapters. Otherwise we will path the adapter on our side or path the agent code for our system.

I will create a PR for the heartbeat setting this week

PatrickRitchie commented 1 month ago

Yes I should be able to add this as a configuration option to force the Ping regardless of whether data has been received.

I will add this to the next release.

TrakHound / MTConnect.NET

Heartbeat logic for shdr-adapter prevents correct syncronization with adapters #70

Making the agent honor the timeout value coming from the adapter