m-stein commented 1 year ago

@chelmuth reported that SSH connections from his Sculpt VM towards a server on a remote machine sporadically end up broken (without router re-configuration involved).

chelmuth commented 1 year ago

I'm going to integrate the fixes in #4728 plus your debug commit into my Sculpt system and will report any insights.

chelmuth commented 1 year ago

Update: I had one aborted connection yesterday but the log did not reveal any new information as I didn't enable verbose packet log yet. But, after I integrate your additional debug commit networking broke completely with the following error.

[runtime -> nic_router] Error: Uncaught exception of type 'Net::Interface::Bad_transport_protocol'
[runtime -> nic_router] Warning: abort called - thread: ep

Thus, I investigated and finally 7d1779ef39612fcac1ebe3ea41f6f9c4d9325061 fixed the uncaught exception.

[runtime -> nic_router] Warning: unknown transport layer protocol
[runtime -> nic_router] Warning: unknown transport layer protocol

I'm curious how this may happen and how you ensure the exception is caught at all other places.

chelmuth commented 1 year ago

Small update: I use Sculpt 23.04 and had an SSH interruption this morning with the following extended error message.

packet_write_wait: Connection to \<IP address> port 22: Broken pipe packet_write_wait: Connection to UNKNOWN port 65535: Broken pipe

The second line bothers me. It may stem from the nature of the connection that uses the SSH proxy mechanism but could also hint the issue we are looking for.

chelmuth commented 11 months ago

I set up a Linux-based bridge to monitor all network traffic of my Sculpt machine and sighted two SSH connection interruptions. In both cases the TCP source port of the established connection on the Sculpt side suddenly changes, which the server side denied by RST. Further packets from the SSH server to the original TCP port are denied by NIC router with ICMP Destination unreachable (Network unreachable). @m-stein I can provide you with the PCAP files.

This morning I provoked stress on the NIC router with sudo nmap -sS -O <LAN router IP> and triggered SSH interruption with comparable symptoms after some seconds.

    No.     Time                           Source     SPort  Destination  DPort  Protocol  Length  Info
    672353  2023-12-05 07:41:00,134210878  10.0.0.30  51419  10.0.0.6     22     TCP       66      51419 → 22 [ACK] Seq=131622 Ack=6142062 Win=161664 Len=0 TSval=2000755302 TSecr=1528207574
    672461  2023-12-05 07:41:30,178468389  10.0.0.30  51419  10.0.0.6     22     SSHv2     118     Client: Encrypted packet (len=52)
    672462  2023-12-05 07:41:30,178846776  10.0.0.6   22     10.0.0.30    51419  SSHv2     94      Server: Encrypted packet (len=28)
1   672463  2023-12-05 07:41:30,179274264  10.0.0.30  51419  10.0.0.6     22     TCP       66      51419 → 22 [ACK] Seq=131674 Ack=6142090 Win=162048 Len=0 TSval=2000785277 TSecr=1528237619
2   674726  2023-12-05 07:42:00,157076594  10.0.0.30  54273  10.0.0.6     22     SSH       118     Client: Encrypted packet (len=52)
3   674727  2023-12-05 07:42:00,157343544  10.0.0.6   22     10.0.0.30    54273  TCP       60      22 → 54273 [RST] Seq=1 Win=0 Len=0
4   674728  2023-12-05 07:42:00,163496730  10.0.0.6   22     10.0.0.30    51419  SSHv2     478     Server: Encrypted packet (len=412)
5   674729  2023-12-05 07:42:00,163788823  10.0.0.30  22     10.0.0.6     51419  ICMP      70      Destination unreachable (Network unreachable)
    674734  2023-12-05 07:42:00,370934689  10.0.0.6   22     10.0.0.30    51419  TCP       478     [TCP Retransmission] 22 → 51419 [PSH, ACK] Seq=6142090 Ack=131674 Win=64128 Len=412 TSval=1528267812 TSecr=2000785277
    674735  2023-12-05 07:42:00,371266450  10.0.0.30  22     10.0.0.6     51419  ICMP      70      Destination unreachable (Network unreachable)
    674742  2023-12-05 07:42:00,578927168  10.0.0.6   22     10.0.0.30    51419  TCP       478     [TCP Retransmission] 22 → 51419 [PSH, ACK] Seq=6142090 Ack=131674 Win=64128 Len=412 TSval=1528268020 TSecr=2000785277

Last successful ACK of the SSH connection (source port 51419)
Sudden source-port change to 54273 (which is just 1 more than the source port of the last nmap stress packet)
SSH server rejects 54273 by RST
SSH server sends to original 51419
NIC router rejects with ICMP but source and destination port seem mixed up?

chelmuth commented 10 months ago

NIC router rejects with ICMP but source and destination port seem mixed up?

Let me update my interpretation here: There is nothing mixed up, it's just the original TCP packet embedded in the ICMP message. Nevertheless I think Destination unreachable (Network unreachable) is not the correct error reply here. According to RFC1812

If a packet is to be forwarded to a host on a network that is directly connected to the router (i.e., the router is the last-hop router) and the router has ascertained that there is no path to the destination host then the router MUST generate a Destination Unreachable, Code 1 (Host Unreachable) ICMP message.

I propose to change the nic_router as follows.

+++ b/repos/os/src/server/nic_router/interface.cc
@@ -1396,7 +1396,7 @@ void Interface::_handle_ip(Ethernet_frame          &eth,
    if(not ip.dst().is_multicast()) {

        _send_icmp_dst_unreachable(local_intf, eth, ip,
-                                  Icmp_packet::Code::DST_NET_UNREACHABLE);
+                                  Icmp_packet::Code::DST_HOST_UNREACHABLE);
    }
    if (_config().verbose()) {
        log("[", local_domain, "] unroutable packet"); }

After looking at the captured packet traffic during four connection drops, I'm certain that the NIC router decides to drop the link for no reason related to the traffic itself.

m-stein commented 10 months ago

@chelmuth Thanks a lot for gathering and providing all this detailed information! As discussed offline, I'll continue with this issue as soon as the File Vault has settled on a presentable state again. The ICMP-code modification you suggest for the router sounds sensible to me!

m-stein commented 6 months ago

Thanks to the wonderful trace recorder, I was able to create a pcap trace in sculpt and debug the issue in wireshark.

In a setup where I have an open ssh connection and then run nmap -sS -O , the connection breaks quickly. It does so because after a bit of a pause in the ssh exchange, my vbox guest sends a valid ssh packet again (maybe keep alive) that is then nat'ed by the nic router to a source port other than the one used before in the connection. The server doesn't like this and sends a tcp reply with reset flag set on the new port. The reply is nat'ed back to the correct guest ports and so, my ssh has no chance of understanding why the connection was cancelled.

I found a disappointingly simple explanation for the events: The nmap causes the nic router to run into resource exhaustion with the session at some point. So, the internal link state of the ssh connection is thrown away in an attempt to free resources for the nmap stress. When ssh eventually becomes active again, the nic router creates a new link state with a different port.

I can only guess that the reason for this not being a frequent problem is that not so security-aware applications may just work around a changing source port (ignorance, new connection). Anyway, I'm not sure yet what to do about the ssh issue. One solution would be to give a nic router client the opportunity to resolve resource exhaustion by updating the session quota before throwing stuff away. Kind of a band aid would be to make garbage collection smarter, in case it actually makes a difference which link state to throw away and which not.

chelmuth commented 5 months ago

Regarding our offline discussion about network timeouts it seems worthwhile to look in to Linux. For Linux as client or server host I played around with sudo netstat -ncow --tcp and issued several short-living ssh -t <server host> true sessions. The server side always dropped the connection immediately while the client entered TIME_WAIT for 60s. Also noteworthy is the server output for long-living ssh -t <server host> bash sessions that alternates between the following lines (with differing timeout values).

tcp        0      0 <server IP>:22             <client IP>:56988         ESTABLISHED keepalive (6218,43/0/0)
tcp        0    164 <server IP>:22             <client IP>:56988         ESTABLISHED on (0,20/0/0)

m-stein commented 5 months ago

I took some reading into online resources regarding the topic. Here are some things I found:

Timeouts

RFC 1122 "Requirements for internet hosts" paragraph 4.2.3.6 requests that the TCP keep-alive timeout must default to no less than 2 hours (https://datatracker.ietf.org/doc/html/rfc1122#page-101).
RFC 5382 "NAT behavioral requirements for TCP" section 5 consequently requests that a NAT router should not abandon an idle session in "established" state for at least 2 hours and 4 minutes. Note that the timeouts for other states are significantly smaller, like the "time_wait" state with 60 seconds (https://datatracker.ietf.org/doc/html/rfc5382#section-5).
It seems that most OS's (I looked at BSD, Windows, macOS and Linux) indeed use a default TCP keep-alive timeout of 2 hours, which I could also observe with netstat. However, with SSHv2, for instance, I see a 3-way (SSH-SSH-TCP_ACK) keep alive every 60 seconds.
MikroTik RouterOS keeps states for idle "established" TCP connections for 1 day (https://help.mikrotik.com/docs/display/ROS/Connection+tracking).
Cisco also uses a 1 day idle timeout for such connections (https://community.cisco.com/t5/networking-knowledge-base/how-to-configure-a-nat-translation-timeout/ta-p/3109488).

Resource exhaustion

nf_conntrack and VyOS simply drop new packets when hitting a configurable max number of connections and provide the user with descriptive warnings (https://www.pc-freak.net/blog/resolving-nf_conntrack-table-full-dropping-packet-flood-message-in-dmesg-linux-kernel-log/ and https://forum.vyos.io/t/bgpd-10gbps-nf-conntrack-table-full-dropping-packet/5060). However, there are also approaches for so-called early-drops in order to prevent this from happening.
With Cisco it looks similar (https://www.cisco.com/c/en/us/support/docs/ip/network-address-translation-nat/8605-13.html#toc-hId-1514464423 and https://community.cisco.com/t5/routing/nat-translation-table-filling-up/td-p/3952652).

Prevention and recovery This paper (https://netdevconf.info/2.1/papers/conntrack.pdf) elaborates the topic for nf_conntrack:

Those connections might be dropped early under stress:
- Connections that have not seen traffic in both directions yet
- TCP that has not seen the whole 3-way handshake yet
- UDP that has not seen at least one Request-Reply-Request sequence yet
- TCP in WAIT states
Use recommended timeouts by default and only under stress check connections against smaller timeouts
Allow for more efficient flow-specific timeouts, e.g., per port or rule
Let the router probe connections in order to detect abandoned ones (timeout- or stress-triggered)

Furthermore I found an article series about nf_conntrack (https://thermalcircle.de/doku.php?id=blog:linux:connection_tracking_3_state_and_examples) that gives some insight about "early dropping":

Connections are marked with IPS_ASSURED when they should not be early-dropped.
ICMP connections never have this flag
For UDP and TCP the flag is set according to the suggestions in the above paper

We should also keep in mind, that the above referenced examples are talking about very different limits than we do. While they usually accept at least several 10K connections, I observed limits like 170-270 connections with a session to the NIC router.

nfeske commented 5 months ago

Thanks for the exhaustive review.

@m-stein given those findings, do you already have an actionable plan?

If not, for addressing the concrete issue at hand, I'd suggest two steps:

Keeping the pool of UDP-related meta data separate from TCP-related meta data. So UDP cannot interfere with the connection state of TCP-based protocols.
Evicting connection meta data for non-IPS_ASSURED connections in a least-recently used fashion. If all connections are marked with IPS_ASSURED, evict the least recently used one.

chelmuth commented 5 months ago

I took some reading into online resources regarding the topic. Here are some things I found:

Timeouts

Just some links for reference depicting Linux kernel default values.

https://elixir.bootlin.com/linux/latest/source/net/netfilter/nf_conntrack_proto_icmp.c#L25 https://elixir.bootlin.com/linux/latest/source/net/netfilter/nf_conntrack_proto_udp.c#L27 https://elixir.bootlin.com/linux/latest/source/net/netfilter/nf_conntrack_proto_tcp.c#L61

A unidirectional UDP timeout of 30s looks quite reasonable to me and may be implemented first following @nfeske's plan.

m-stein commented 5 months ago

@nfeske Thanks for your feedback!

On 23.05.24 13:49, Norman Feske wrote:

Thanks for the exhaustive review.

@m-stein https://github.com/m-stein given those findings, do you already have an actionable plan?

So far:

I've re-implemented basic garbage collection without exceptions and inline. The latter means that the router doesn't jump out of packet handling, free resources and try handling the packet again from the beginning, but instead frees resources where exhaustion happens and continues).
I've implemented that only as much quota as needed is freed.
I'm at implementing proper TCP connection-state tracking as it is currently very rudimentary and not sufficient for determining something like IPS_ASSURED)
My plan is to use an IPS_ASSURED-like member in link objects which is always false for ICMP, true with timeout after request-reply-request for UDP and true for TCP in ESTABLISHED state. Furthermore, the router should try to evict ICMP first, then UDP and TCP last

If not, for addressing the concrete issue at hand, I'd suggest two steps:

Keeping the pool of UDP-related meta data separate from TCP-related meta data. So UDP cannot interfere with the connection state of TCP-based protocols. These pools are separate. What kind of interference do you mean?

Evicting connection meta data for non-IPS_ASSURED connections in a least-recently used fashion. If all connections are marked with IPS_ASSURED, evict the least recently used one.

I'll implement the first but would advice against the latter. As far as I learned, other appliances simply drop new packets in this case in order to prevent the issue that @chelmuth ran into. What do you think of the probing-approach instead?

Martin

m-stein commented 5 months ago

@chelmuth Thanks for these helpful references! My suggestion would be to use all nf_conntrack timeouts as default in the nic_router and actively probe established TCP, say every 5 minutes, in order to cut down the 5 days.

chelmuth commented 5 months ago

My suggestion would be to use all nf_conntrack timeouts as default in the nic_router and actively probe established TCP, say every 5 minutes, in order to cut down the 5 days.

Probing sounds interesting. How does it work?

m-stein commented 5 months ago

From https://netdevconf.info/2.1/papers/conntrack.pdf:

Instead of just closing a connection without warning, it would be possible to actively probe endpoints similar to what is done by the SO_KEEPALIVE mechanism described in the tcp manual page[7] by injecting packets after the connection has been idle for some time.

So, in essence, the router would do the same as any Linux with TCP keepalive but not after an eternity but much more frequent.

From https://tldp.org/HOWTO/html_single/TCP-Keepalive-HOWTO/:

... send your peer a keepalive probe packet with no data in it and the ACK flag turned on. You can do this because of the TCP/IP specifications, as a sort of duplicate ACK, and the remote endpoint will have no arguments, as TCP is a stream-oriented protocol. On the other hand, you will receive a reply from the remote host (which doesn't need to support keepalive at all, just TCP/IP), with no data and the ACK set.

As far as I understand it:

The router would send keepalives to both peers.
In case that a peer is not reachable anymore, a short reply-timeout at the router would eventually remove the connection state and the router could optionally also close the connection at the other peer by sending an RST.
In case that a peer is reachable but somehow lost awareness of the connection (e.g. hard reset) this peer responds to the keepalive with an RST packet, causing the router and the other peer to close the connection.

There's also a paragraph in the latter regarding the more general topic of this issue:

The other useful goal of keepalive is to prevent inactivity from disconnecting the channel. It's a very common issue, when you are behind a NAT proxy or a firewall, to be disconnected without a reason. This behavior is caused by the connection tracking procedures implemented in proxies and firewalls, which keep track of all connections that pass through them. Because of the physical limits of these machines, they can only keep a finite number of connections in their memory. The most common and logical policy is to keep newest connections and to discard old and inactive connections first. ... periodically sending packets over the network is a good way to always be in a polar position with a minor risk of deletion.

chelmuth commented 5 months ago

4534 seems related this discussion (ARP waiter removal).

m-stein commented 5 months ago

@chelmuth Thanks for cross-referencing.

m-stein commented 5 months ago

Regarding probing:

@chelmuth added that interfering with the connection traffic might cause problems if not done right and that we should first find out how others do that.
It's not easy to find resources regarding simple tcp probing implementations in routers (only for checking liveliness) and other probing seems quite elaborate.
However, RFC 5382 also encourages the approach

A common method that is applicable only to TCP is to preferentially abandon sessions for crashed endpoints, followed by closed TCP connections and partially open connections. A NAT can check if an endpoint for a session has crashed by sending a TCP keep-alive packet and receiving a TCP RST packet in response. If the NAT cannot determine whether the endpoint is active, it should not abandon the session until the TCP connection has been idle for some time. Note that an established TCP connection can stay idle (but live) indefinitely; hence, there is no fixed value for an idle-timeout that accommodates all applications. However, a large idle-timeout motivated by recommendations in [RFC1122] can reduce the chances of abandoning a live session.
Furthermore, from tldp.org:

In fact, TCP permits you to handle a stream, not packets, and so a zero-length data packet is not dangerous for the user program.
And the wireshark docs give insight on the keepalive packets layout (we'd have to keep track of the last sequence number):

TCP Keep-Alive Set when the segment size is zero or one, the current sequence number is one byte less than the next expected sequence number, and none of SYN, FIN, or RST are set.

TCP Keep-Alive ACK Set when all of the following are true: The segment size is zero. The window size is non-zero and hasn’t changed. The current sequence number is the same as the next expected sequence number. The current acknowledgment number is the same as the last-seen acknowledgment number. The most recently seen packet in the reverse direction was a keepalive. The packet is not a SYN, FIN, or RST.

m-stein commented 5 months ago

@chelmuth Regarding our offline discussion:

Currently, a NIC session request at the router comes in with 3,5M quota, of which only 205K remain after create_session.

The sizes of dynamically allocated session objects:

TCP Link: 432
UDP Link: 424
ICMP Link: 424
ARP Waiter: 72
DHCP Allocation: 240

So, theoretical max number of connections (without counting DHCP/ARP objects or meta data) is 485.

Furthermore, I have to correct myself regarding the current idle timeouts:

established TCP: 10 minutes
other TCP: 1 minute (note: also after a complete close handshake as RFC recommends it just in case)
UDP: 30 seconds
ICMP: 10 seconds

Suggestions:

I would drop TCP links right after the close handshake.
I would lower non-established TCP to 30 seconds as 60 seconds seem too much to me.
I would slightly increase the default quota of NIC sessions because 205K doesn't seem too much to me.
Although RFCs and OSs suggest 7440 seconds, I'd stay with the 10 minutes for established TCP as we had no problems with this so far.

chelmuth commented 5 months ago

Suggestions:

I would drop TCP links right after the close handshake.

I would lower non-established TCP to 30 seconds as 60 seconds seem too much to me.

I would slightly increase the default quota of NIC sessions because 205K doesn't seem too much to me.

Although RFCs and OSs suggest 7440 seconds, I'd stay with the 10 minutes for established TCP as we had no problems with this so far.

I agree to all four points but request your special attention regarding the impact of increased default resource requirements in point 3. Automatic tests may need to be adapted and Sculpt integration tested.

m-stein commented 5 months ago

@chelmuth Thanks for your feedback. I'll keep an eye on the tests.

m-stein commented 4 months ago

This commit series should solve this issue and #4534:

2b16fd9337 nic_router: destroy timed out ARP waiters 6dd39946ef nic_router: drop closed tcp links immediately 0098391380 nic_router: lower non-open tcp timeout to 30 sec c7e678631f nic_router: mark tcp open only with full handshake 0f57a4eb6d nic_router: remove reference utilities 05bd3c2e06 nic_router: smarter emergency free on exhaustion 9fd9cae21f nic_router: fix leak on domain deinit 0434abc2ab nic_router: remove Invalid exceptions d2cc2ec648 nic_router: remove pointer utilities e3484f034e nic_router: no Ip_config_static exception 09f81021bb nic_router: no Never_reached exception 228c9f7604 nic_router: no Mac_allocator::Alloc_failed 389d068783 nic_router: remove Bad_send_dhcp_args exception fd37e59412 nic_router: no Bad_transport_protocol exception bfb11eda3e nic_router: remove bit-array/alloc exceptions 90384c2e8b nic_router: remove Retry_without_domain exception 6e32e4f6e0 nic_router: remove Report::Empty exception 6eaa17a8b4 nic_router: don't throw Nonexistent_attribute 5189310c09 nic_router: don't throw Nonexistent_sub_node 1cf557e583 nic_router: don't throw Option_not_found (DHCP) 62fa100361 nic_router: don't throw Deref_unconstructed_object 9f6f7fc96a nic_router: don't throw Pointer::Invalid 1129b700d9 nic_router: remove Dhcp_allocation_tree exceptions 0013867ac9 nic_router: remove Keep_ip_config exception 991e74f007 nic_router: remove Packet_postponed exception dfcb14cc6d nicrouter: remove unused Dismiss* exceptions 07bd56568a nic_router: remove Alloc_dhcp_msg_buffer_failed 27f70f9d7f nic_router: remove Port_allocator exceptions d2aab1d0c0 nic_router: remove Alloc_ip_failed exception 38d91326f9 nic_router: remove No_next_hop exception 07ed56fe87 nic_router: remove Bad_network_protocol exception b1ae7412de nic_router: remove Drop_packet exception 891f012cb4 nic_router: remove Resource_exhaustion exception 4b75398902 nic_router: keep links on resource exhaustion 3c1af9304c net/port.h: default constructor c9cf3a9d8a os: raise nic connection ram quota fb87898d35 xml_node: support attribute access via lambda

I kept the git history for the C++-exception-related commits quite detailed because these commits apply subtle changes to the execution flow in the router. I imagine, tracking down hidden long-term bugs that might come from these changes is a lot easier with the smaller commits.

chelmuth commented 4 months ago

@m-stein Great! Could you please publish a Sculpt image compatible to 24.04.1 (despite the slight base API changes)?

chelmuth commented 4 months ago

After thorough reconsideration, I'm going to defer commit c9cf3a9d8ac51608e91c76b2a9e576c71b36694b until the fixes are merged, intensively tested, and we are then still facing issues with resource shortage. Even then, I'm now convinced that the default resource quotas should address clients like the archive fetch for depot in Sculpt, but more demanding clients like vbox may express their needs explicitly.

For now, the merge is stalled by @nfeske's comment which has a point IMO. @m-stein could you check if attribute_value() fits your use cases? It should, as each configuration needs a sane default value, or not?

chelmuth commented 4 months ago

@m-stein Could you please publish a Sculpt image compatible to 24.04.1 (despite the slight base API changes)?

Your published image references everything from depot user mstein, which is impractical to test on my working machine. Could you please update the boot image just with the fixed nic_router following the guide at https://genodians.org/nfeske/2023-11-10-modding-sculpt#A_system_image_for_the_PC? Note, --depot-auto-update must not be enabled in build.conf to keep versions intact.

m-stein commented 4 months ago

@chelmuth I'll try and like to add that the image I published is not ready for productive use yet. I'll inform you.

m-stein commented 4 months ago

@chelmuth @nfeske I've tried to meet all of your requests, re-pushed a merge_to_staging and published a tested Sculpt image.

chelmuth commented 4 months ago

I'm using the published image just now. What I learned so far:

My ssh connection was not aborted during my tests up to now.
Sometimes TCP, UDP, or ICMP connections are refused despite the session still features <ram-quota avail="139330"/>
Stress testing with ab -c 50 -n 10000 https://fast.com/ may lead to states where networking appears stuck, but after a couple of minutes everything seems fine again.
Did I note, my ssh connection is still alive?

Example runtime/nic_router/state

<state>
    <ram quota="20928146" used="10940416" shared="4096"/>
    <cap quota="289" used="54" shared="1"/>
    <domain name="default" rx_bytes="41396" tx_bytes="38135" ipv4="10.0.1.1/24" gw="0.0.0.0">
        <tcp-links>
            <destroyed value="1"/>
        </tcp-links>
        <udp-links>
            <destroyed value="1"/>
        </udp-links>
        <dhcp-allocations>
            <destroyed value="1"/>
        </dhcp-allocations>
        <interface label="update -> tcpip -> " link_state="true">
            <ram-quota used="3321856" limit="3526722" avail="204866"/>
            <cap-quota used="4" limit="7" avail="3"/>
            <tcp-links>
                <dissolved_timeout_closed value="1"/>
            </tcp-links>
            <udp-links>
                <dissolved_timeout_open value="1"/>
            </udp-links>
            <dhcp-allocations>
                <alive value="1"/>
            </dhcp-allocations>
        </interface>
        <interface label="sculpt_vm_vbox6 -> vbox -> 0" link_state="true">
            <ram-quota used="3387392" limit="3526722" avail="139330"/>
            <cap-quota used="5" limit="7" avail="2"/>
            <tcp-links>
                <refused_for_ram value="7048"/>
                <refused_for_ports value="2021"/>
                <opening value="207"/>
                <dissolved_timeout_closing value="13"/>
                <dissolved_timeout_closed value="4014"/>
                <dissolved_no_timeout value="3105"/>
                <destroyed value="7132"/>
            </tcp-links>
            <udp-links>
                <refused_for_ram value="150"/>
                <dissolved_timeout_opening value="5"/>
                <dissolved_timeout_open value="21"/>
                <dissolved_no_timeout value="8"/>
                <destroyed value="34"/>
            </udp-links>
            <icmp-links>
                <refused_for_ram value="18"/>
            </icmp-links>
            <arp-waiters>
                <destroyed value="24"/>
            </arp-waiters>
            <dhcp-allocations>
                <alive value="1"/>
            </dhcp-allocations>
        </interface>
    </domain>
    <domain name="http" rx_bytes="0" tx_bytes="0" ipv4="10.0.80.1/24" gw="0.0.0.0"/>
    <domain name="telnet" rx_bytes="0" tx_bytes="0" ipv4="10.0.23.1/24" gw="0.0.0.0"/>
    <domain name="uplink" rx_bytes="13868" tx_bytes="25631" ipv4="10.0.0.30/24" gw="10.0.0.1">
        <dns ip="10.0.0.2"/>
        <dns-domain name="genode.labs"/>
        <interface label="nic -> eth0" link_state="true">
            <ram-quota used="3387392" limit="3527557" avail="140165"/>
            <cap-quota used="5" limit="7" avail="2"/>
            <arp-waiters>
                <destroyed value="159"/>
            </arp-waiters>
        </interface>
    </domain>
</state>

What bothers me here is:

I never saw any <open> connections (which may be because those are not reported).
<icmp-links> <refused_for_ram value="18"/> </icmp-links> despite there's still RAM available.

Now I'm already at

        <interface label="sculpt_vm_vbox6 -> vbox -> 0" link_state="true">
            <ram-quota used="3387392" limit="3526722" avail="139330"/>
            <cap-quota used="5" limit="7" avail="2"/>
            <tcp-links>
                <refused_for_ram value="11768"/>
                <refused_for_ports value="2527"/>
                <opening value="3"/>
                <dissolved_timeout_opening value="581"/>
                <dissolved_timeout_closing value="44"/>
                <dissolved_timeout_closed value="88864"/>
                <dissolved_no_timeout value="5279"/>
                <destroyed value="94768"/>
            </tcp-links>
...

which means 11768 connections were refused due to RAM shortage with 139330 available bytes.

chelmuth commented 4 months ago

@m-stein I've tried to meet all of your requests, re-pushed a merge_to_staging and published a tested Sculpt image.

Did you miss to actually push your branch?

m-stein commented 4 months ago

Using the updated Sculpt 24.04.1, @chelmuth found a new issue: Under stress, the router eventually refuses new TCP/UDP/ICMP connections as expected but, at this point, the relevant NIC session still has around 140K of session RAM quota left.

This comes from the fact that the heap uses exponentially increasing chunk sizes for its back-end allocations. So, after some allocations the heap aims for significantly large dataspaces. In addition to that, the routers Session_env (session-local RAM allocator and region manager) always try to reserve the worsed-case costs of an operation before doing the operation. That said, on the first failing attempt of the heap to expand itself, the session is left with quite an amount of quota that is now rendered useless as the heap has no means of accessing it, once the chunk size has grown that much.

One approach would be to modify the default session quota to a value that minimizes the wastage given this specific use case. However, this would not account for other NIC-session use cases.

Another approach is to replace the session-local heap in the router with a combination of sliced heap (back end) and TSLABs(for the 5 types that sessions allocate dynamically). I just tried this approach but it results in higher CAP quota requirements. Looking only at the session creation the heap-version requires 5 caps while the sliced-heap-version requires 11 caps (default quota is 8). The additional caps come from one dataspace for the packet allocator bits, one dataspace for some other packet-stream-rx-related meta data (not buffers) and the initial blocks for the 5 TSLABs.

Of course, we could raise the default CAP quota in order to solve that.

One other approach would be to stay with the heap and implement that it shrinks its chunk-size when it fails to allocate a dataspace. However, we settled on closing this issue without trying this approach.

m-stein commented 4 months ago

In my last posting, I meant 140K of RAM quota not 14K.

chelmuth commented 4 months ago

nic_router/state snapshot of the day

<domain name="default" rx_bytes="450459707" tx_bytes="31371499" ipv4="10.0.1.1/24" gw="0.0.0.0">
  <interface label="sculpt_vm_vbox6 -> vbox -> 0" link_state="true">
    <ram-quota used="3387392" limit="3526722" avail="139330"/>
    <cap-quota used="5" limit="7" avail="2"/>
    <tcp-links>
      <opening value="13"/>
      <dissolved_timeout_opening value="30"/>
      <dissolved_timeout_closing value="118"/>
      <dissolved_timeout_closed value="2005"/>
      <destroyed value="2153"/>
    </tcp-links>
    <udp-links>
      <open value="2"/>
      <dissolved_timeout_opening value="31"/>
      <dissolved_timeout_open value="2651"/>
      <dissolved_no_timeout value="399"/>
      <destroyed value="3081"/>
    </udp-links>
    <icmp-links>
      <dissolved_timeout_open value="1"/>
      <destroyed value="1"/>
    </icmp-links>
    <arp-waiters>
      <alive value="18446744073709551606"/> <!- this is hex 0xfffffffffffffff6 which makes me curious -->
      <destroyed value="34"/>
    </arp-waiters>
    <dhcp-allocations>
      <alive value="1"/>
    </dhcp-allocations>
  </interface>
</domain>

m-stein commented 4 months ago

@chelmuth I've tested and published a new Sculpt (2024-06-11) and pushed a corresponding merge_to_staging.

m-stein commented 4 months ago

nic_router/state snapshot of the day

Oops. Yeah, I think I found the cause. Will provide a fix soon.

chelmuth commented 4 months ago

Oops. Yeah, I think I found the cause. Will provide a fix soon.

Are you planning to update your image with the fix? Then I'll wait with the upgrade.

m-stein commented 4 months ago

Yes I'll update it as well.

m-stein commented 4 months ago

@chelmuth With my latest Sculpt (2024-06-12) I cannot reproduce bogus ARP stats anymore. I also updated my merge_to_staging accordingly.

chelmuth commented 4 months ago

I updated my sculpt system and merged the commits to staging. Experiences with the previous version were already quite good - stable SSH for 3 days - despite the small arp-waiter report hiccup.

m-stein commented 4 months ago

@chelmuth 8a1bfaa944 should fix the fetchurl_lxip regression.

m-stein commented 4 months ago

@chelmuth This 93fa8aba03 should fix the regression with run/nic_router_ipv4_fram.

m-stein commented 4 months ago

@chelmuth Debugging the failing nic_router_flood test, I found that it is actually a regression caused by this issue. I had to change the original series in order to add two fixups that should fix the regression:

4e5aaf5301 nic_router: fix interface-local quota reporting 4c0e584333 nic_router: destroy timed out ARP waiters c82aeb5ea7 nic_router: drop closed tcp links immediately (updated) 5fd26cf912 nic_router: lower non-open tcp timeout to 30 sec 013dc53d10 fixup "nic_router: mark tcp open only with full handshake" 9b0ff9652b nic_router: mark tcp open only with full handshake bce341291f nic_router: remove reference utilities (updated) 989deccb66 nic_router: fix leak on domain deinit 5ce2646e68 fixup "nic_router: smarter emergency free on exhaustion" ...

chelmuth commented 4 months ago

Kind of merged the series via inverse rebase.

m-stein commented 4 months ago

@chelmuth Thanks!

nfeske commented 4 months ago

Fixed in master.

genodelabs / genode

nic_router: broken SSH pipes #4729

4534 seems related this discussion (ARP waiter removal).