Closed m-stein closed 4 months ago
I'm going to integrate the fixes in #4728 plus your debug commit into my Sculpt system and will report any insights.
Update: I had one aborted connection yesterday but the log did not reveal any new information as I didn't enable verbose packet log yet. But, after I integrate your additional debug commit networking broke completely with the following error.
[runtime -> nic_router] Error: Uncaught exception of type 'Net::Interface::Bad_transport_protocol'
[runtime -> nic_router] Warning: abort called - thread: ep
Thus, I investigated and finally 7d1779ef39612fcac1ebe3ea41f6f9c4d9325061 fixed the uncaught exception.
[runtime -> nic_router] Warning: unknown transport layer protocol
[runtime -> nic_router] Warning: unknown transport layer protocol
I'm curious how this may happen and how you ensure the exception is caught at all other places.
Small update: I use Sculpt 23.04 and had an SSH interruption this morning with the following extended error message.
packet_write_wait: Connection to \<IP address> port 22: Broken pipe packet_write_wait: Connection to UNKNOWN port 65535: Broken pipe
The second line bothers me. It may stem from the nature of the connection that uses the SSH proxy mechanism but could also hint the issue we are looking for.
I set up a Linux-based bridge to monitor all network traffic of my Sculpt machine and sighted two SSH connection interruptions. In both cases the TCP source port of the established connection on the Sculpt side suddenly changes, which the server side denied by RST. Further packets from the SSH server to the original TCP port are denied by NIC router with ICMP Destination unreachable (Network unreachable). @m-stein I can provide you with the PCAP files.
This morning I provoked stress on the NIC router with sudo nmap -sS -O <LAN router IP>
and triggered SSH interruption with comparable symptoms after some seconds.
No. Time Source SPort Destination DPort Protocol Length Info
672353 2023-12-05 07:41:00,134210878 10.0.0.30 51419 10.0.0.6 22 TCP 66 51419 → 22 [ACK] Seq=131622 Ack=6142062 Win=161664 Len=0 TSval=2000755302 TSecr=1528207574
672461 2023-12-05 07:41:30,178468389 10.0.0.30 51419 10.0.0.6 22 SSHv2 118 Client: Encrypted packet (len=52)
672462 2023-12-05 07:41:30,178846776 10.0.0.6 22 10.0.0.30 51419 SSHv2 94 Server: Encrypted packet (len=28)
1 672463 2023-12-05 07:41:30,179274264 10.0.0.30 51419 10.0.0.6 22 TCP 66 51419 → 22 [ACK] Seq=131674 Ack=6142090 Win=162048 Len=0 TSval=2000785277 TSecr=1528237619
2 674726 2023-12-05 07:42:00,157076594 10.0.0.30 54273 10.0.0.6 22 SSH 118 Client: Encrypted packet (len=52)
3 674727 2023-12-05 07:42:00,157343544 10.0.0.6 22 10.0.0.30 54273 TCP 60 22 → 54273 [RST] Seq=1 Win=0 Len=0
4 674728 2023-12-05 07:42:00,163496730 10.0.0.6 22 10.0.0.30 51419 SSHv2 478 Server: Encrypted packet (len=412)
5 674729 2023-12-05 07:42:00,163788823 10.0.0.30 22 10.0.0.6 51419 ICMP 70 Destination unreachable (Network unreachable)
674734 2023-12-05 07:42:00,370934689 10.0.0.6 22 10.0.0.30 51419 TCP 478 [TCP Retransmission] 22 → 51419 [PSH, ACK] Seq=6142090 Ack=131674 Win=64128 Len=412 TSval=1528267812 TSecr=2000785277
674735 2023-12-05 07:42:00,371266450 10.0.0.30 22 10.0.0.6 51419 ICMP 70 Destination unreachable (Network unreachable)
674742 2023-12-05 07:42:00,578927168 10.0.0.6 22 10.0.0.30 51419 TCP 478 [TCP Retransmission] 22 → 51419 [PSH, ACK] Seq=6142090 Ack=131674 Win=64128 Len=412 TSval=1528268020 TSecr=2000785277
- NIC router rejects with ICMP but source and destination port seem mixed up?
Let me update my interpretation here: There is nothing mixed up, it's just the original TCP packet embedded in the ICMP message. Nevertheless I think Destination unreachable (Network unreachable) is not the correct error reply here. According to RFC1812
If a packet is to be forwarded to a host on a network that is directly connected to the router (i.e., the router is the last-hop router) and the router has ascertained that there is no path to the destination host then the router MUST generate a Destination Unreachable, Code 1 (Host Unreachable) ICMP message.
I propose to change the nic_router as follows.
+++ b/repos/os/src/server/nic_router/interface.cc
@@ -1396,7 +1396,7 @@ void Interface::_handle_ip(Ethernet_frame ð,
if(not ip.dst().is_multicast()) {
_send_icmp_dst_unreachable(local_intf, eth, ip,
- Icmp_packet::Code::DST_NET_UNREACHABLE);
+ Icmp_packet::Code::DST_HOST_UNREACHABLE);
}
if (_config().verbose()) {
log("[", local_domain, "] unroutable packet"); }
After looking at the captured packet traffic during four connection drops, I'm certain that the NIC router decides to drop the link for no reason related to the traffic itself.
@chelmuth Thanks a lot for gathering and providing all this detailed information! As discussed offline, I'll continue with this issue as soon as the File Vault has settled on a presentable state again. The ICMP-code modification you suggest for the router sounds sensible to me!
Thanks to the wonderful trace recorder, I was able to create a pcap trace in sculpt and debug the issue in wireshark.
In a setup where I have an open ssh connection and then run nmap -sS -O
I found a disappointingly simple explanation for the events: The nmap causes the nic router to run into resource exhaustion with the session at some point. So, the internal link state of the ssh connection is thrown away in an attempt to free resources for the nmap stress. When ssh eventually becomes active again, the nic router creates a new link state with a different port.
I can only guess that the reason for this not being a frequent problem is that not so security-aware applications may just work around a changing source port (ignorance, new connection). Anyway, I'm not sure yet what to do about the ssh issue. One solution would be to give a nic router client the opportunity to resolve resource exhaustion by updating the session quota before throwing stuff away. Kind of a band aid would be to make garbage collection smarter, in case it actually makes a difference which link state to throw away and which not.
Regarding our offline discussion about network timeouts it seems worthwhile to look in to Linux. For Linux as client or server host I played around with sudo netstat -ncow --tcp
and issued several short-living ssh -t <server host> true
sessions. The server side always dropped the connection immediately while the client entered TIME_WAIT
for 60s. Also noteworthy is the server output for long-living ssh -t <server host> bash
sessions that alternates between the following lines (with differing timeout values).
tcp 0 0 <server IP>:22 <client IP>:56988 ESTABLISHED keepalive (6218,43/0/0)
tcp 0 164 <server IP>:22 <client IP>:56988 ESTABLISHED on (0,20/0/0)
I took some reading into online resources regarding the topic. Here are some things I found:
Timeouts
Resource exhaustion
Prevention and recovery This paper (https://netdevconf.info/2.1/papers/conntrack.pdf) elaborates the topic for nf_conntrack:
Furthermore I found an article series about nf_conntrack (https://thermalcircle.de/doku.php?id=blog:linux:connection_tracking_3_state_and_examples) that gives some insight about "early dropping":
We should also keep in mind, that the above referenced examples are talking about very different limits than we do. While they usually accept at least several 10K connections, I observed limits like 170-270 connections with a session to the NIC router.
Thanks for the exhaustive review.
@m-stein given those findings, do you already have an actionable plan?
If not, for addressing the concrete issue at hand, I'd suggest two steps:
I took some reading into online resources regarding the topic. Here are some things I found:
Timeouts
Just some links for reference depicting Linux kernel default values.
https://elixir.bootlin.com/linux/latest/source/net/netfilter/nf_conntrack_proto_icmp.c#L25 https://elixir.bootlin.com/linux/latest/source/net/netfilter/nf_conntrack_proto_udp.c#L27 https://elixir.bootlin.com/linux/latest/source/net/netfilter/nf_conntrack_proto_tcp.c#L61
A unidirectional UDP timeout of 30s looks quite reasonable to me and may be implemented first following @nfeske's plan.
@nfeske Thanks for your feedback!
On 23.05.24 13:49, Norman Feske wrote:
Thanks for the exhaustive review.
@m-stein https://github.com/m-stein given those findings, do you already have an actionable plan?
So far:
I've re-implemented basic garbage collection without exceptions and inline. The latter means that the router doesn't jump out of packet handling, free resources and try handling the packet again from the beginning, but instead frees resources where exhaustion happens and continues).
I've implemented that only as much quota as needed is freed.
I'm at implementing proper TCP connection-state tracking as it is currently very rudimentary and not sufficient for determining something like IPS_ASSURED)
My plan is to use an IPS_ASSURED-like member in link objects which is always false for ICMP, true with timeout after request-reply-request for UDP and true for TCP in ESTABLISHED state. Furthermore, the router should try to evict ICMP first, then UDP and TCP last
If not, for addressing the concrete issue at hand, I'd suggest two steps:
Keeping the pool of UDP-related meta data separate from TCP-related meta data. So UDP cannot interfere with the connection state of TCP-based protocols. These pools are separate. What kind of interference do you mean?
Evicting connection meta data for non-IPS_ASSURED connections in a least-recently used fashion. If all connections are marked with IPS_ASSURED, evict the least recently used one.
I'll implement the first but would advice against the latter. As far as I learned, other appliances simply drop new packets in this case in order to prevent the issue that @chelmuth ran into. What do you think of the probing-approach instead?
Martin
@chelmuth Thanks for these helpful references! My suggestion would be to use all nf_conntrack timeouts as default in the nic_router and actively probe established TCP, say every 5 minutes, in order to cut down the 5 days.
My suggestion would be to use all nf_conntrack timeouts as default in the nic_router and actively probe established TCP, say every 5 minutes, in order to cut down the 5 days.
Probing sounds interesting. How does it work?
From https://netdevconf.info/2.1/papers/conntrack.pdf:
Instead of just closing a connection without warning, it would be possible to actively probe endpoints similar to what is done by the SO_KEEPALIVE mechanism described in the tcp manual page[7] by injecting packets after the connection has been idle for some time.
So, in essence, the router would do the same as any Linux with TCP keepalive but not after an eternity but much more frequent.
From https://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/:
... send your peer a keepalive probe packet with no data in it and the ACK flag turned on. You can do this because of the TCP/IP specifications, as a sort of duplicate ACK, and the remote endpoint will have no arguments, as TCP is a stream-oriented protocol. On the other hand, you will receive a reply from the remote host (which doesn't need to support keepalive at all, just TCP/IP), with no data and the ACK set.
As far as I understand it:
There's also a paragraph in the latter regarding the more general topic of this issue:
The other useful goal of keepalive is to prevent inactivity from disconnecting the channel. It's a very common issue, when you are behind a NAT proxy or a firewall, to be disconnected without a reason. This behavior is caused by the connection tracking procedures implemented in proxies and firewalls, which keep track of all connections that pass through them. Because of the physical limits of these machines, they can only keep a finite number of connections in their memory. The most common and logical policy is to keep newest connections and to discard old and inactive connections first. ... periodically sending packets over the network is a good way to always be in a polar position with a minor risk of deletion.
@chelmuth Thanks for cross-referencing.
Regarding probing:
A common method that is applicable only to TCP is to preferentially abandon sessions for crashed endpoints, followed by closed TCP connections and partially open connections. A NAT can check if an endpoint for a session has crashed by sending a TCP keep-alive packet and receiving a TCP RST packet in response. If the NAT cannot determine whether the endpoint is active, it should not abandon the session until the TCP connection has been idle for some time. Note that an established TCP connection can stay idle (but live) indefinitely; hence, there is no fixed value for an idle-timeout that accommodates all applications. However, a large idle-timeout motivated by recommendations in [RFC1122] can reduce the chances of abandoning a live session.
In fact, TCP permits you to handle a stream, not packets, and so a zero-length data packet is not dangerous for the user program.
TCP Keep-Alive Set when the segment size is zero or one, the current sequence number is one byte less than the next expected sequence number, and none of SYN, FIN, or RST are set.
TCP Keep-Alive ACK Set when all of the following are true: The segment size is zero. The window size is non-zero and hasn’t changed. The current sequence number is the same as the next expected sequence number. The current acknowledgment number is the same as the last-seen acknowledgment number. The most recently seen packet in the reverse direction was a keepalive. The packet is not a SYN, FIN, or RST.
@chelmuth Regarding our offline discussion:
Currently, a NIC session request at the router comes in with 3,5M quota, of which only 205K remain after create_session
.
The sizes of dynamically allocated session objects:
So, theoretical max number of connections (without counting DHCP/ARP objects or meta data) is 485.
Furthermore, I have to correct myself regarding the current idle timeouts:
Suggestions:
Suggestions:
- I would drop TCP links right after the close handshake.
- I would lower non-established TCP to 30 seconds as 60 seconds seem too much to me.
- I would slightly increase the default quota of NIC sessions because 205K doesn't seem too much to me.
- Although RFCs and OSs suggest 7440 seconds, I'd stay with the 10 minutes for established TCP as we had no problems with this so far.
I agree to all four points but request your special attention regarding the impact of increased default resource requirements in point 3. Automatic tests may need to be adapted and Sculpt integration tested.
@chelmuth Thanks for your feedback. I'll keep an eye on the tests.
This commit series should solve this issue and #4534:
2b16fd9337 nic_router: destroy timed out ARP waiters 6dd39946ef nic_router: drop closed tcp links immediately 0098391380 nic_router: lower non-open tcp timeout to 30 sec c7e678631f nic_router: mark tcp open only with full handshake 0f57a4eb6d nic_router: remove reference utilities 05bd3c2e06 nic_router: smarter emergency free on exhaustion 9fd9cae21f nic_router: fix leak on domain deinit 0434abc2ab nic_router: remove Invalid exceptions d2cc2ec648 nic_router: remove pointer utilities e3484f034e nic_router: no Ip_config_static exception 09f81021bb nic_router: no Never_reached exception 228c9f7604 nic_router: no Mac_allocator::Alloc_failed 389d068783 nic_router: remove Bad_send_dhcp_args exception fd37e59412 nic_router: no Bad_transport_protocol exception bfb11eda3e nic_router: remove bit-array/alloc exceptions 90384c2e8b nic_router: remove Retry_without_domain exception 6e32e4f6e0 nic_router: remove Report::Empty exception 6eaa17a8b4 nic_router: don't throw Nonexistent_attribute 5189310c09 nic_router: don't throw Nonexistent_sub_node 1cf557e583 nic_router: don't throw Option_not_found (DHCP) 62fa100361 nic_router: don't throw Deref_unconstructed_object 9f6f7fc96a nic_router: don't throw Pointer::Invalid 1129b700d9 nic_router: remove Dhcp_allocation_tree exceptions 0013867ac9 nic_router: remove Keep_ip_config exception 991e74f007 nic_router: remove Packet_postponed exception dfcb14cc6d nicrouter: remove unused Dismiss* exceptions 07bd56568a nic_router: remove Alloc_dhcp_msg_buffer_failed 27f70f9d7f nic_router: remove Port_allocator exceptions d2aab1d0c0 nic_router: remove Alloc_ip_failed exception 38d91326f9 nic_router: remove No_next_hop exception 07ed56fe87 nic_router: remove Bad_network_protocol exception b1ae7412de nic_router: remove Drop_packet exception 891f012cb4 nic_router: remove Resource_exhaustion exception 4b75398902 nic_router: keep links on resource exhaustion 3c1af9304c net/port.h: default constructor c9cf3a9d8a os: raise nic connection ram quota fb87898d35 xml_node: support attribute access via lambda
I kept the git history for the C++-exception-related commits quite detailed because these commits apply subtle changes to the execution flow in the router. I imagine, tracking down hidden long-term bugs that might come from these changes is a lot easier with the smaller commits.
@m-stein Great! Could you please publish a Sculpt image compatible to 24.04.1 (despite the slight base API changes)?
After thorough reconsideration, I'm going to defer commit c9cf3a9d8ac51608e91c76b2a9e576c71b36694b until the fixes are merged, intensively tested, and we are then still facing issues with resource shortage. Even then, I'm now convinced that the default resource quotas should address clients like the archive fetch for depot in Sculpt, but more demanding clients like vbox may express their needs explicitly.
For now, the merge is stalled by @nfeske's comment which has a point IMO. @m-stein could you check if attribute_value()
fits your use cases? It should, as each configuration needs a sane default value, or not?
@m-stein Could you please publish a Sculpt image compatible to 24.04.1 (despite the slight base API changes)?
Your published image references everything from depot user mstein, which is impractical to test on my working machine. Could you please update the boot image just with the fixed nic_router following the guide at https://genodians.org/nfeske/2023-11-10-modding-sculpt#A_system_image_for_the_PC? Note, --depot-auto-update
must not be enabled in build.conf to keep versions intact.
@chelmuth I'll try and like to add that the image I published is not ready for productive use yet. I'll inform you.
@chelmuth @nfeske I've tried to meet all of your requests, re-pushed a merge_to_staging and published a tested Sculpt image.
I'm using the published image just now. What I learned so far:
<ram-quota avail="139330"/>
ab -c 50 -n 10000 https://fast.com/
may lead to states where networking appears stuck, but after a couple of minutes everything seems fine again.Example runtime/nic_router/state
<state>
<ram quota="20928146" used="10940416" shared="4096"/>
<cap quota="289" used="54" shared="1"/>
<domain name="default" rx_bytes="41396" tx_bytes="38135" ipv4="10.0.1.1/24" gw="0.0.0.0">
<tcp-links>
<destroyed value="1"/>
</tcp-links>
<udp-links>
<destroyed value="1"/>
</udp-links>
<dhcp-allocations>
<destroyed value="1"/>
</dhcp-allocations>
<interface label="update -> tcpip -> " link_state="true">
<ram-quota used="3321856" limit="3526722" avail="204866"/>
<cap-quota used="4" limit="7" avail="3"/>
<tcp-links>
<dissolved_timeout_closed value="1"/>
</tcp-links>
<udp-links>
<dissolved_timeout_open value="1"/>
</udp-links>
<dhcp-allocations>
<alive value="1"/>
</dhcp-allocations>
</interface>
<interface label="sculpt_vm_vbox6 -> vbox -> 0" link_state="true">
<ram-quota used="3387392" limit="3526722" avail="139330"/>
<cap-quota used="5" limit="7" avail="2"/>
<tcp-links>
<refused_for_ram value="7048"/>
<refused_for_ports value="2021"/>
<opening value="207"/>
<dissolved_timeout_closing value="13"/>
<dissolved_timeout_closed value="4014"/>
<dissolved_no_timeout value="3105"/>
<destroyed value="7132"/>
</tcp-links>
<udp-links>
<refused_for_ram value="150"/>
<dissolved_timeout_opening value="5"/>
<dissolved_timeout_open value="21"/>
<dissolved_no_timeout value="8"/>
<destroyed value="34"/>
</udp-links>
<icmp-links>
<refused_for_ram value="18"/>
</icmp-links>
<arp-waiters>
<destroyed value="24"/>
</arp-waiters>
<dhcp-allocations>
<alive value="1"/>
</dhcp-allocations>
</interface>
</domain>
<domain name="http" rx_bytes="0" tx_bytes="0" ipv4="10.0.80.1/24" gw="0.0.0.0"/>
<domain name="telnet" rx_bytes="0" tx_bytes="0" ipv4="10.0.23.1/24" gw="0.0.0.0"/>
<domain name="uplink" rx_bytes="13868" tx_bytes="25631" ipv4="10.0.0.30/24" gw="10.0.0.1">
<dns ip="10.0.0.2"/>
<dns-domain name="genode.labs"/>
<interface label="nic -> eth0" link_state="true">
<ram-quota used="3387392" limit="3527557" avail="140165"/>
<cap-quota used="5" limit="7" avail="2"/>
<arp-waiters>
<destroyed value="159"/>
</arp-waiters>
</interface>
</domain>
</state>
What bothers me here is:
<open>
connections (which may be because those are not reported).<icmp-links> <refused_for_ram value="18"/> </icmp-links>
despite there's still RAM available.Now I'm already at
<interface label="sculpt_vm_vbox6 -> vbox -> 0" link_state="true">
<ram-quota used="3387392" limit="3526722" avail="139330"/>
<cap-quota used="5" limit="7" avail="2"/>
<tcp-links>
<refused_for_ram value="11768"/>
<refused_for_ports value="2527"/>
<opening value="3"/>
<dissolved_timeout_opening value="581"/>
<dissolved_timeout_closing value="44"/>
<dissolved_timeout_closed value="88864"/>
<dissolved_no_timeout value="5279"/>
<destroyed value="94768"/>
</tcp-links>
...
which means 11768 connections were refused due to RAM shortage with 139330 available bytes.
@m-stein I've tried to meet all of your requests, re-pushed a merge_to_staging and published a tested Sculpt image.
Did you miss to actually push your branch?
Using the updated Sculpt 24.04.1, @chelmuth found a new issue: Under stress, the router eventually refuses new TCP/UDP/ICMP connections as expected but, at this point, the relevant NIC session still has around 140K of session RAM quota left.
This comes from the fact that the heap uses exponentially increasing chunk sizes for its back-end allocations. So, after some allocations the heap aims for significantly large dataspaces. In addition to that, the routers Session_env (session-local RAM allocator and region manager) always try to reserve the worsed-case costs of an operation before doing the operation. That said, on the first failing attempt of the heap to expand itself, the session is left with quite an amount of quota that is now rendered useless as the heap has no means of accessing it, once the chunk size has grown that much.
One approach would be to modify the default session quota to a value that minimizes the wastage given this specific use case. However, this would not account for other NIC-session use cases.
Another approach is to replace the session-local heap in the router with a combination of sliced heap (back end) and TSLABs(for the 5 types that sessions allocate dynamically). I just tried this approach but it results in higher CAP quota requirements. Looking only at the session creation the heap-version requires 5 caps while the sliced-heap-version requires 11 caps (default quota is 8). The additional caps come from one dataspace for the packet allocator bits, one dataspace for some other packet-stream-rx-related meta data (not buffers) and the initial blocks for the 5 TSLABs.
Of course, we could raise the default CAP quota in order to solve that.
One other approach would be to stay with the heap and implement that it shrinks its chunk-size when it fails to allocate a dataspace. However, we settled on closing this issue without trying this approach.
In my last posting, I meant 140K of RAM quota not 14K.
nic_router/state snapshot of the day
<domain name="default" rx_bytes="450459707" tx_bytes="31371499" ipv4="10.0.1.1/24" gw="0.0.0.0">
<interface label="sculpt_vm_vbox6 -> vbox -> 0" link_state="true">
<ram-quota used="3387392" limit="3526722" avail="139330"/>
<cap-quota used="5" limit="7" avail="2"/>
<tcp-links>
<opening value="13"/>
<dissolved_timeout_opening value="30"/>
<dissolved_timeout_closing value="118"/>
<dissolved_timeout_closed value="2005"/>
<destroyed value="2153"/>
</tcp-links>
<udp-links>
<open value="2"/>
<dissolved_timeout_opening value="31"/>
<dissolved_timeout_open value="2651"/>
<dissolved_no_timeout value="399"/>
<destroyed value="3081"/>
</udp-links>
<icmp-links>
<dissolved_timeout_open value="1"/>
<destroyed value="1"/>
</icmp-links>
<arp-waiters>
<alive value="18446744073709551606"/> <!- this is hex 0xfffffffffffffff6 which makes me curious -->
<destroyed value="34"/>
</arp-waiters>
<dhcp-allocations>
<alive value="1"/>
</dhcp-allocations>
</interface>
</domain>
@chelmuth I've tested and published a new Sculpt (2024-06-11) and pushed a corresponding merge_to_staging.
nic_router/state snapshot of the day
Oops. Yeah, I think I found the cause. Will provide a fix soon.
Oops. Yeah, I think I found the cause. Will provide a fix soon.
Are you planning to update your image with the fix? Then I'll wait with the upgrade.
Yes I'll update it as well.
@chelmuth With my latest Sculpt (2024-06-12) I cannot reproduce bogus ARP stats anymore. I also updated my merge_to_staging accordingly.
I updated my sculpt system and merged the commits to staging. Experiences with the previous version were already quite good - stable SSH for 3 days - despite the small arp-waiter report hiccup.
@chelmuth 8a1bfaa944 should fix the fetchurl_lxip regression.
@chelmuth This 93fa8aba03 should fix the regression with run/nic_router_ipv4_fram.
@chelmuth Debugging the failing nic_router_flood test, I found that it is actually a regression caused by this issue. I had to change the original series in order to add two fixups that should fix the regression:
4e5aaf5301 nic_router: fix interface-local quota reporting 4c0e584333 nic_router: destroy timed out ARP waiters c82aeb5ea7 nic_router: drop closed tcp links immediately (updated) 5fd26cf912 nic_router: lower non-open tcp timeout to 30 sec 013dc53d10 fixup "nic_router: mark tcp open only with full handshake" 9b0ff9652b nic_router: mark tcp open only with full handshake bce341291f nic_router: remove reference utilities (updated) 989deccb66 nic_router: fix leak on domain deinit 5ce2646e68 fixup "nic_router: smarter emergency free on exhaustion" ...
Kind of merged the series via inverse rebase.
@chelmuth Thanks!
Fixed in master.
@chelmuth reported that SSH connections from his Sculpt VM towards a server on a remote machine sporadically end up broken (without router re-configuration involved).