genodelabs / genode

Genode OS Framework
https://genode.org/
Other
1.08k stars 254 forks source link

nic_router: strange packet stalls or drops on continuous operation #2953

Closed chelmuth closed 5 years ago

chelmuth commented 6 years ago

Since nic_router was integrated into Sculpt I experience connection issues after 8-9 hours during a normal working day. The symptom includes connection timeouts (but not all attempts fail) and already established connections working flawlessly. After discussion with @cproc, we suspected the issue not originating from the actual packet count or amount of transfered data but the number of connections. Note, this does not mean an overly high amount of connections per seconds as it happens after hours. Nevertheless, I successfully tried to force the issue with Apache bench like follows.

ab -n 1000 -c 1 http://some.local.server

After about 800 connections the test progress stalls (which may hint a max connect per second issue too) and repeated execution leads to the symptom described above. Using more connection requests in parallel (e.g., -c 4) makes things even worse.

Note 1: I did not investigate the packets on the wire or between nic_drv and nic_router. Note 2: Since established connections still work, I do not suspect the ipxe_nic_drv. Note 3: Also ICMP ping suffers from the same symptom in the error situation: sometimes it works, sometime it hangs (also in between two pings of one run).

Is there any means in the router to investigate statistics about number of connections etc.?

m-stein commented 6 years ago

Currently there are no statistics about connections but it wouldn't be hard to implement.

Does this commit series from my merge_to_staging solve the problem for you? (For my Sculpt it does): d2c3a01 fixup "nic_router: improve handling of TCP termination" 49c615a nic_router: improve handling of TCP termination 51e5ace nic_router: limit packets handled per signal 082002a nic_router_flood: fix CRC error, make more precise f39e237 nic_router: allow ld_verbose attribute 8124750 nic_router: destroy links on insufficient resource 7c41e18 nic_router: "packet alloc" error only when verbose 36abbca nic_router.run: test multi-client http-server 59a3aa5 nic_router.run: more systematic naming scheme

m-stein commented 6 years ago

PS: I've tested your ab comman line.

chelmuth commented 6 years ago

Thank you. I integrated this series into my sculpt update to give it some real-world exposure ;-)

chelmuth commented 6 years ago

First tests with ab stall after 900 packets...

> ab -n 1000 -c 1 http://some.local.server
This is ApacheBench, Version 2.3 <$Revision: 1706008 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking mogli.genode.labs (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
chelmuth commented 6 years ago

What seems now to be working is that this situation recovers after a short time (120 seconds maybe?).

m-stein commented 6 years ago

I'm working at further commits that round up the issue with resource management in the NIC router. For instance, with the above commit series, the removal of old connections is only done when the RAM does not suffice for a new one, with insufficient ports the new connection still gets dropped. Or there are still cases where the session quota cannot be given back to a client once he closes the session.

chelmuth commented 6 years ago

So, today I experienced the same fatal connection issue after 7-8 hours. I tried several configuration tweaks at runtime like changing the uplink as well as reconfiguring the uplink and downlink domains. What I can tell is that Xubuntu running in VirtualBox is no longer able to finish DHCP with the nic_router. The /report/runtime/nic_router/state file reflects increasing tx_bytes for the default domain, though.

After VirtualBox shutdown, exit, and restart, the still running nic_router instance works flawless again for the virtual machine.

chelmuth commented 6 years ago

As discussed offline, I got it again after 6-7 hours, but today only UDP seemed to stuck. So I tried reconfiguring something UDP related and reduced the number of UDP ports in the uplink domain to 999. After that DNS magically works again. But, I restarted VirtualBox in between though.

chelmuth commented 6 years ago

The drama continues but thanks to @m-stein's statistics patch I've some data for todays failure of operation. First, the output of netstat in vbox, and second, /report/runtime/nic_router/state.

m-stein commented 6 years ago

@chelmuth This commit series on my nic_router_chelmuth_sculpt branch should fix your problem: 360c308 nic_router: rework quota accounting 7b4cbf9 fixup "nic_router: improve handling of TCP termination" 8fb282d nic_router: improve handling of TCP termination 3782dee nic_router: limit packets handled per signal 0551d0f nic_router_flood: fix CRC error, make more precise 8afe234 nic_router: allow ld_verbose attribute 103adef nic_router: destroy links on insufficient resource 5d1aeee nic_router: "packet alloc" error only when verbose 85218f1 Fixup "Xml_generator: fix and test missing '\0'" 543bf69 Xml_generator: fix and test missing '\0' 19fa2c8 Xml_generator: fix exception handling in Node(...) 1596821 heap: free DS on exceptions during attach

There is a problem left in the SLAB with falsely reported dangling allocations but I have no stable solution for this by now. Anyway, this should not influence the sculpt experience ;)

m-stein commented 6 years ago

PS: These commits are yet not ment to be merged! I would wait for your feedback and clean them up afterwards.

chelmuth commented 6 years ago

The following remark frightens me.

There is a problem left in the SLAB with falsely reported dangling allocations but I have no stable solution for this by now. Anyway, this should not influence the sculpt experience ;)

Please clarify and open a separate issue, so we can investigate promptly. The slab implementation is used all over the code base.

m-stein commented 6 years ago

Sorry I was in a hurry, its in the Allocator AVL in _revert_allocations_and_ranges where dangling_allocations is raised also for allocations that contain metadata of the metadata allocator (if the metadata alloc uses the AVL allocator as metadata alloc). These allocations can't be freed up to this point, but its fine anyway. So the warning is misleading. For the Allocator_avl_tpl, the metadata allocator is a Tslab. I added a method virtual bool Allocator_avl::_metdata_of_md_alloc(addr) { return false; } which in Allocator_avl_tpl returns _metdata.metadata(addr) which is implemented in the Slab that walks through all its blocks and compares them with the address. But it didn't work out as expected so far.

m-stein commented 6 years ago

I can open an issue as soon as I'm back.

chelmuth commented 6 years ago

I did not see any stalls with the current series for three days, but do not take this as all-clear signal already.

m-stein commented 6 years ago

This commit series is my suggestion for staging: b87f9c7 nic_router_flood: reworked to stress/analyze more f94c1ac net_flood: fix CRC error, make more precise e3ba14d nic_router: rework quota accounting 0315c66 nic_router: improve handling of TCP termination f13584e nic_router: limit packets handled per signal b2be39f nic_router: allow ld_verbose attribute 07f1456 nic_router: destroy links on insufficient resource 2e947d4 nic_router: "packet alloc" error only when verbose f943fd9 Xml_generator: fix and test missing '\0' d409769 Xml_generator: fix exception handling in Node(...) d1d8cf0 allocator_avl: fix dangling-allocations warning aa87736 heap: fix exception handling in _allocate_dataspace

The NIC router flood test has three maliciuos clients (icmp, udp, tcp) that create connections as fast as possible and one good ping client with 1 sec interval. It tests the following:

I've adapted the '\0' commit for the Xml_generator so that it only appends a '\0' in Xml_generator(...) which already fixes the xml_generator test but should not break other things.

m-stein commented 6 years ago

@chelmuth This fixup reverts picky changes that are not really necessary in the original commit: b0f2a02 Fixup "Xml_generator: fix exception handling in Node(...)"

chelmuth commented 6 years ago

From my side this issue can be closed except that the statistics feature (d91ff02747c849831e66c50be3780f30d9f631d1) is not yet available on the staging branch.

m-stein commented 5 years ago

I'd like to take the time to rethink the format of the statistics. Thus, I'd adress them in an extra issue.

m-stein commented 5 years ago

@chelmuth: Alex has a fixup e809e16 for 32552a07b1daa1c600253819aaccb2a78267a5ca. I think it's a good idea to merge it. Thanks @alex-ab!

m-stein commented 5 years ago

This is an explanation: https://github.com/genodelabs/genode/commit/32552a07b1daa1c600253819aaccb2a78267a5ca#commitcomment-33527756