crossbario / crossbar

Crossbar.io - WAMP application router
https://crossbar.io/
Other
2.05k stars 274 forks source link

Allow TCP user timeout configuration in endpoint #1929

Closed Bastian-Krause closed 2 years ago

Bastian-Krause commented 2 years ago

TCP_USER_TIMEOUT (see man tcp) is a socket option that causes TCP connections to be closed if transmitted data is not acked within the given time (in milliseconds). This allows failing fast instead of waiting for ACKs to arrive up to 20 min on dead connections.

A use case for this is an embedded device connecting to crossbar. If this is shut down forcibly, the connection hangs and subsequent procedure registrations fail until the connection is closed. This can take up to 20 minutes.

oberstet commented 2 years ago

Cool! this makes sense, and I already had a peek into the PR which looks great and complete. Will review that later ..

Anyways, couple of comments:

Ok!

TCP_USER_TIMEOUT (since Linux 2.6.37)
      This option takes an unsigned int as an argument.  When
      the value is greater than 0, it specifies the maximum
      amount of time in milliseconds that transmitted data may
      remain unacknowledged, or bufferred data may remain
      untransmitted (due to zero window size) before TCP will
      forcibly close the corresponding connection and return
      ETIMEDOUT to the application.  If the option value is
      specified as 0, TCP will use the system default.

      Increasing user timeouts allows a TCP connection to
      survive extended periods without end-to-end connectivity.
      Decreasing user timeouts allows applications to "fail
      fast", if so desired.  Otherwise, failure may take up to
      20 minutes with the current system defaults in a normal
      WAN environment.

I guess having the option to set this via node config is sth good to have in any case.

However, my personal view/experience with mobile clients and TCP connections (incl. WAMP/WebSocket/RawSocket) is:

Further, on mobile networks, a client modem can go into different level of energy saving deep sleep, where the TCP connection is apparently alive, but no traffic is hitting the device - this is via the signaling/control channel of the mobile network.

And the actual parameters for the 3GPP defined deep sleep and wakeup algos is defined by the mobile carrier.

Once in deep sleep, sending a WAMP event or call result can take 30s.

To prevent that, and fast detect full TCP connection failures, and keep the connection in a snappy state, one needs to incur actual TCP level payload traffic, both in terms of maximum time between traffic, and for bandwidth!

eg do sth with 4KB traffic volume every 15s

again, this is easy with Crossbar.io and Autobahn, as you can specify the ping-pong payload size per configuration - exactly for this use. you couldn't do that with the new option ..

oberstet commented 2 years ago

here is a reference for the mobile networks and power saving states aspects https://mailarchive.ietf.org/arch/msg/hybi/DCNqIF6HSMBryccBEgVCR2e2hNo/

Bastian-Krause commented 2 years ago

Cool! this makes sense, and I already had a peek into the PR which looks great and complete. Will review that later ..

Thanks!

Anyways, couple of comments:

Ok!

Right, should I mention that somewhere?

  • How do I set the system default on Linux?

I read that part too and immediately thought, that's the way to go. I've looked at the Kernel code, but could not find anything there. Google didn't help either. Now I am not sure whether that sentence actually means a system-wide user timeout default or just using the other TCP timeouts that exist.

  • Why does a user want to use the new option in Crossbar.io to configure per-TCP connection TCP timeouts, rather than the host admin having it configured at the system level?

I've implemented this since I could not find a way to set this system-wide. If you find a way to do that, please let me know.

  • How does that work on a containerized (Docker) Crossbar.io?

What exactly? Setting this system-wide? Then no idea, see above.

TCP_USER_TIMEOUT (since Linux 2.6.37)
      This option takes an unsigned int as an argument.  When
      the value is greater than 0, it specifies the maximum
      amount of time in milliseconds that transmitted data may
      remain unacknowledged, or bufferred data may remain
      untransmitted (due to zero window size) before TCP will
      forcibly close the corresponding connection and return
      ETIMEDOUT to the application.  If the option value is
      specified as 0, TCP will use the system default.

      Increasing user timeouts allows a TCP connection to
      survive extended periods without end-to-end connectivity.
      Decreasing user timeouts allows applications to "fail
      fast", if so desired.  Otherwise, failure may take up to
      20 minutes with the current system defaults in a normal
      WAN environment.

I guess having the option to set this via node config is sth good to have in any case.

However, my personal view/experience with mobile clients and TCP connections (incl. WAMP/WebSocket/RawSocket) is:

  • one needs both client initiated and router initiated WebSocket ping-pongs or even WAMP level bidirectional "ping-pong" (eg call "something.echo" on each side and vice-versa)
  • only this allows fast detection of broken TCP both at the client and the router side
  • one needs timeouts both at the client and router level of course
  • Crossbar.io and Autobahn support bidirectional WebSocket keep-alive heartbeating

Yes, ping/pong is a valid approach an in fact the TCP user timeout does not do anything if no data is transmitted. Then you would need something like keep-alive probing on top.

I've implemented this for our usecase in labgrid. We cannot rely on auto-ping-pong there, since some components like our pytest plugin do not use asynchronous code execution. Since we want to use the same endpoint and port for all components, the TCP user timeout approach made the most sense to us.

Further, on mobile networks, a client modem can go into different level of energy saving deep sleep, where the TCP connection is apparently alive, but no traffic is hitting the device - this is via the signaling/control channel of the mobile network.

And the actual parameters for the 3GPP defined deep sleep and wakeup algos is defined by the mobile carrier.

Once in deep sleep, sending a WAMP event or call result can take 30s.

To prevent that, and fast detect full TCP connection failures, and keep the connection in a snappy state, one needs to incur actual TCP level payload traffic, both in terms of maximum time between traffic, and for bandwidth!

eg do sth with 4KB traffic volume every 15s

again, this is easy with Crossbar.io and Autobahn, as you can specify the ping-pong payload size per configuration - exactly for this use. you couldn't do that with the new option ..

I don't know too much about the mobile aspect. The use case I described is a small embedded board located in a testing rack, so it is not mobile. But it can happen that the board is powered down by pulling the power plug due do reconfiguration of the testing hardware in the rack.

oberstet commented 2 years ago

Right, should I mention that somewhere?

yes, I think it would be good to

and then just say: "0 behaves like not configured at all"

Now I am not sure whether that sentence actually means a system-wide user timeout default or just using the other TCP timeouts that exist.

yeah, the linux man page is ambigious .. could be a kernel compile define? anyways, I also tried to find it .. unsuccessfully. It might be even worth mentioning this in the feature docs as well. to save a 3rd person time, or to trigger someone to clarify ..

I've implemented this since I could not find a way to set this system-wide. If you find a way to do that, please let me know.

me neither. and that's worth mentioning, because it's a very good motivation for the feature;)

What exactly? Setting this system-wide? Then no idea, see above.

if you run Crossbar.io in a Docker container, can that node set the socket option, and does that take effect? some socket types/options require host root, or specific capabilities on Docker daemon, or ...

Since we want to use the same endpoint and port for all components, the TCP user timeout approach made the most sense to us.

Ok, I see. Crossbar.io also support "universal endpoints" which auto-detect Web vs WebSocket vs RawSocket - but we only have hearbeating on incoming connections that turn out to be WebSocket.

TCP user timeout approach made the most sense to us

yes, sounds good!

I don't know too much about the mobile aspect. The use case I described is a small embedded board located in a testing rack, so it is not mobile.

ah, right, then forget my comments. on a wired connection, things behave differently .. "pulling the power plug": yes, you will want fast TCP connection loss detection then - if the device and router are on different subnets / ethernet segments ..