bitfocus / companion-module-generic-swp08

MIT License
5 stars 3 forks source link

Needs a keep alive message to use in routed networks #21

Open phillipivan opened 5 days ago

phillipivan commented 5 days ago

Lack of keep alive means the module cant be safely used in routed networks, and the session will be closed out after a period of inactivity.

JeLuF commented 5 days ago

Why is there a difference between routed networks and local ones?

phillipivan commented 5 days ago

When a silent session is held open for a protracted period, depending on the network infrastructure it is crossing it is liable to be marked as stale and closed out, causing a session reset.

The would be problem enough - however, for reasons I don't totally understand - the session silently closed by an intermediary is consistently detected by the client, and never* by the host. So the client reconnects - the host accepts the new session, but the old one remains held open.

And thus a cycle repeats itself at a more or less fixed interval (for a consistently quiet session over a given time interval) until the host can not accept any more parallel sessions, and all connectivity is lost. If you are working with software, often you may have a separate method to connect to the host (such as ssh) to restart the software, however hardware almost invariably requires a reboot to resolve.

This is exactly what I saw when testing it with our Ross Ultrix today, but I have seen this exact pattern play out with many other devices and protocols (I made a PR to fix it in the BMD Teranex module, and Andrew was good enough to fix it in the Yamaha RCP module).

Ultimately I consider this a design defect of the protocol, but since we rarely have the possibility of changing that, implementing a Keep Alive message after a short period of inactivity (say 30s) is a simple and safe band aid. In the absence of a formal KA message, a benign query or sometimes even a faulty message will suffice.

Since I don't have the luxury of working in flat networks, any module/system exhibiting these characteristics is a significant liability.

*Really never, I've seen probably a dozen different instances of this problem and Ive never seen the host close out the old session.

JeLuF commented 5 days ago

Sounds like you have some firewalls or NAT gateways in your network. A plain router doesn't need to track sessions. It's working on OSI layer 2, and sessions are a layer 3 thing.

NAT gateways or firewalls would explain why you dont see the host closing the connection. The entry in the connection table of the network device would expire and there wouldn't be any notification about this to the two communication partners. The moment one of the two would try to send a message, it would receive a "reset" from the gateway/firewall. The other communication partner would still not be aware that the connection has been closed.

Thanks for explaining. I was just curious about this and will now have to check "my" modules. I guess some of them don't do heartbeats/keepalives.

phillipivan commented 4 days ago

Sounds like you have some firewalls or NAT gateways in your network. A plain router doesn't need to track sessions. It's working on OSI layer 2, and sessions are a layer 3 thing.

Yeah that's right, plain routing isn't really a thing in our infrastructure - and I suspect that is true of many larger institutions/systems these days.

NAT gateways or firewalls would explain why you dont see the host closing the connection. The entry in the connection table of the network device would expire and there wouldn't be any notification about this to the two communication partners. The moment one of the two would try to send a message, it would receive a "reset" from the gateway/firewall. The other communication partner would still not be aware that the connection has been closed.

Thanks for explaining. I was just curious about this and will now have to check "my" modules. I guess some of them don't do heartbeats/keepalives.

My observation is that the client side seems to find out the session has been closed pretty quick, which cant be because of failed packet sends because if they were semi-regular this wouldn't have happened in the first place - I have observed logs where this fault iterates on a very regular time interval (so we can infer when the system is not in use, i.e. over night, there is no traffic after some initial connect and query).

At any rate, because the consequences are so severe (loss of control, often requiring a reboot to restore) I think we (module devs) should actively try and mitigate it where it is a possibility.