Closed m-stein closed 1 year ago
On my 3992_nic_router_run_sporadic_failures branch I've created a test that triggers the failure pretty fast. I found that simply re-creating the server socket after each successful transmission causes the system to get stuck at some point, given that the client just keeps sending endlessly. In contrast, when not re-creating the server socket while the client sends endlessly doesn't trigger the fault except it was triggered before the first transmission. This indicates that the problem is related to socket creation at the server side.
Furthermore, the log shows that the server eventually is responding with an ICMP 3 3 packet which means "destination unreachable -> port unreachable" although the dst port remained the same:
Genode 20.11-99-gf97bb2bd17
467 MiB RAM and 63253 caps assigned to init
[init -> acpi_drv] Found MADT
[init -> acpi_drv] MADT IRQ 0 -> GSI 2 flags: 0
[init -> acpi_drv] MADT IRQ 5 -> GSI 5 flags: 13
[init -> acpi_drv] MADT IRQ 9 -> GSI 9 flags: 13
[init -> acpi_drv] MADT IRQ 10 -> GSI 10 flags: 13
[init -> acpi_drv] MADT IRQ 11 -> GSI 11 flags: 13
[init -> acpi_drv] Found MCFG
[init -> acpi_drv] MCFG BASE 0xb0000000 seg 0x0 bus 0x0-0xff
[init -> acpi_drv] RSDT OEM 'BOCHS ', table id 'BXPCRSDT', revision 1, creator 'BXPC' (1)
[init -> acpi_drv] SMBIOS table (entry point: 0x175b00 structures: 0xf5b20)
[init -> platform_drv] ECAM/MMCONF range 00000000:00000000.0-000000ff:0000001f.7 - addr [00000000b0000000,00000000c0000000)
[init -> platform_drv] Root bridge: 00000000:00000000.0
[init -> nic_router] [uplink] static IP config: interface 10.0.2.55/24, gateway 10.0.2.1 P2P 0
[init -> nic_router] [uplink] NIC sessions: 0
[init -> nic_router] [t2_d1] static IP config: interface 18.17.16.14/24, gateway 0.0.0.0 P2P 0
[init -> nic_router] [t2_d1] NIC sessions: 0
[init -> nic_bridge] --- NIC bridge started (mac=01:02:03:04:05:06) ---
[init -> nic_bridge] vmac = 02:02:02:02:02:00 ip = 10.0.2.55
[init -> nic_router] [uplink] NIC sessions: 1
[init -> nic_bridge] vmac = 02:02:02:02:02:01 ip = 10.0.2.212
[init -> t2_d0_c1_udp] lwIP Nic interface down
[init -> t2_d0_c1_udp] lwIP Nic interface up address=10.0.2.212 netmask=0.0.0.0 gateway=0.0.0.0
[init -> nic_router] [t2_d1] NIC sessions: 1
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:01 > ff:ff:ff:ff:ff:ff IPV4 10.0.2.212 > 255.255.255.255 UDP 68 > 67 DHCP 02:02:02:02:02:01 > 01
[init -> nic_router] [uplink] rcv ETH 01:02:03:04:05:06 > ff:ff:ff:ff:ff:ff IPV4 10.0.2.212 > 255.255.255.255 UDP 68 > 67 DHCP 02:02:02:02:02:01 > 01
[init -> t2_d1_s1_udp] lwIP Nic interface down
[init -> t2_d1_s1_udp] lwIP Nic interface up address=18.17.16.15 netmask=0.0.0.0 gateway=0.0.0.0
[init -> nic_router] [t2_d1] rcv ETH 02:02:02:02:02:01 > ff:ff:ff:ff:ff:ff ARP 02:02:02:02:02:01 18.17.16.15 > 00:00:00:00:00:00 18.17.16.15 cmd 1
[init -> nic_router] [t2_d1] rcv ETH 02:02:02:02:02:01 > ff:ff:ff:ff:ff:ff IPV4 18.17.16.15 > 255.255.255.255 UDP 68 > 67 DHCP 02:02:02:02:02:01 > 01
[init -> t2_d1_s1_udp] Error: starting server loop #1
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:01 > ff:ff:ff:ff:ff:ff ARP 02:02:02:02:02:01 10.0.2.212 > 00:00:00:00:00:00 10.0.2.55 cmd 1
[init -> nic_router] [uplink] snd ETH 02:02:02:02:02:00 > 02:02:02:02:02:01 ARP 02:02:02:02:02:00 10.0.2.55 > 02:02:02:02:02:01 10.0.2.212 cmd 2
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:01 > 01:02:03:04:05:06 IPV4 10.0.2.212 > 10.0.2.55 UDP 49188 > 1
[init -> nic_router] [t2_d1] snd ETH 02:02:02:02:02:00 > ff:ff:ff:ff:ff:ff ARP 02:02:02:02:02:00 18.17.16.14 > ff:ff:ff:ff:ff:ff 18.17.16.15 cmd 1
[init -> nic_router] [uplink] rcv ETH 01:02:03:04:05:06 > 02:02:02:02:02:00 IPV4 10.0.2.212 > 10.0.2.55 UDP 49188 > 1
[init -> nic_router] [t2_d1] snd ETH 02:02:02:02:02:00 > ff:ff:ff:ff:ff:ff ARP 02:02:02:02:02:00 18.17.16.14 > ff:ff:ff:ff:ff:ff 18.17.16.15 cmd 1
[init -> nic_router] [t2_d1] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 ARP 02:02:02:02:02:01 18.17.16.15 > 02:02:02:02:02:00 18.17.16.14 cmd 2
[init -> nic_router] [uplink] rcv ETH 01:02:03:04:05:06 > 02:02:02:02:02:00 IPV4 10.0.2.212 > 10.0.2.55 UDP 49188 > 1
[init -> nic_router] [t2_d1] snd ETH 02:02:02:02:02:00 > 02:02:02:02:02:01 IPV4 10.0.2.212 > 18.17.16.15 UDP 49188 > 1
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:01 > 01:02:03:04:05:06 IPV4 10.0.2.212 > 10.0.2.55 UDP 49188 > 1
[init -> nic_router] [t2_d1] snd ETH 02:02:02:02:02:00 > 02:02:02:02:02:01 IPV4 10.0.2.212 > 18.17.16.15 UDP 49188 > 1
[init -> t2_d1_s1_udp] Error: starting server loop #2
[init -> nic_router] [t2_d1] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 ARP 02:02:02:02:02:01 18.17.16.15 > 02:02:02:02:02:00 18.17.16.14 cmd 2
[init -> nic_router] [t2_d1] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 18.17.16.15 > 10.0.2.212 UDP 1 > 49188
[init -> nic_router] [uplink] snd ETH 02:02:02:02:02:00 > ff:ff:ff:ff:ff:ff ARP 02:02:02:02:02:00 10.0.2.55 > ff:ff:ff:ff:ff:ff 10.0.2.212 cmd 1
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:00 > ff:ff:ff:ff:ff:ff ARP 02:02:02:02:02:00 10.0.2.55 > ff:ff:ff:ff:ff:ff 10.0.2.212 cmd 1
[init -> nic_router] [uplink] rcv ETH 01:02:03:04:05:06 > 02:02:02:02:02:00 ARP 01:02:03:04:05:06 10.0.2.212 > 02:02:02:02:02:00 10.0.2.55 cmd 2
[init -> nic_router] [t2_d1] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 18.17.16.15 > 10.0.2.212 UDP 1 > 49188
[init -> nic_router] [uplink] snd ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49188
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49188
[init -> t2_d0_c1_udp] Received "UDP server at 10.0.2.55:1 ..."
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 ARP 02:02:02:02:02:01 10.0.2.212 > 02:02:02:02:02:00 10.0.2.55 cmd 2
[init -> nic_router] [t2_d1] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 18.17.16.15 > 10.0.2.212 ICMP 3 3
[init -> nic_router] [uplink] snd ETH 02:02:02:02:02:01 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 ICMP 3 3
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:01 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 ICMP 3 3
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 10.0.2.212 > 10.0.2.55 UDP 49189 > 1
[init -> nic_router] [t2_d1] snd ETH 02:02:02:02:02:00 > 02:02:02:02:02:01 IPV4 10.0.2.212 > 18.17.16.15 UDP 49189 > 1
[init -> t2_d1_s1_udp] Error: starting server loop #3
[init -> nic_router] [t2_d1] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 18.17.16.15 > 10.0.2.212 UDP 1 > 49189
[init -> nic_router] [uplink] snd ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49189
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49189
[init -> t2_d0_c1_udp] Received "UDP server at 10.0.2.55:1 ..."
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 10.0.2.212 > 10.0.2.55 UDP 49190 > 1
[init -> nic_router] [t2_d1] snd ETH 02:02:02:02:02:00 > 02:02:02:02:02:01 IPV4 10.0.2.212 > 18.17.16.15 UDP 49190 > 1
[init -> t2_d1_s1_udp] Error: starting server loop #4
[init -> nic_router] [t2_d1] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 18.17.16.15 > 10.0.2.212 UDP 1 > 49190
[init -> nic_router] [uplink] snd ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49190
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49190
[init -> t2_d0_c1_udp] Received "UDP server at 10.0.2.55:1 ..."
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 10.0.2.212 > 10.0.2.55 UDP 49191 > 1
[init -> nic_router] [t2_d1] snd ETH 02:02:02:02:02:00 > 02:02:02:02:02:01 IPV4 10.0.2.212 > 18.17.16.15 UDP 49191 > 1
[init -> t2_d1_s1_udp] Error: starting server loop #5
[init -> nic_router] [t2_d1] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 18.17.16.15 > 10.0.2.212 UDP 1 > 49191
[init -> nic_router] [uplink] snd ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49191
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49191
[init -> t2_d0_c1_udp] Received "UDP server at 10.0.2.55:1 ..."
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 10.0.2.212 > 10.0.2.55 UDP 49192 > 1
[init -> nic_router] [t2_d1] snd ETH 02:02:02:02:02:00 > 02:02:02:02:02:01 IPV4 10.0.2.212 > 18.17.16.15 UDP 49192 > 1
[init -> t2_d1_s1_udp] Error: starting server loop #6
[init -> nic_router] [t2_d1] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 18.17.16.15 > 10.0.2.212 UDP 1 > 49192
[init -> nic_router] [uplink] snd ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49192
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49192
[init -> t2_d0_c1_udp] Received "UDP server at 10.0.2.55:1 ..."
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 10.0.2.212 > 10.0.2.55 UDP 49193 > 1
[init -> nic_router] [t2_d1] snd ETH 02:02:02:02:02:00 > 02:02:02:02:02:01 IPV4 10.0.2.212 > 18.17.16.15 UDP 49193 > 1
[init -> t2_d1_s1_udp] Error: starting server loop #7
[init -> nic_router] [t2_d1] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 18.17.16.15 > 10.0.2.212 UDP 1 > 49193
[init -> nic_router] [uplink] snd ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49193
[init -> t2_d0_c1_udp] Received "UDP server at 10.0.2.55:1 ..."
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49193
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 10.0.2.212 > 10.0.2.55 UDP 49194 > 1
[init -> nic_router] [t2_d1] snd ETH 02:02:02:02:02:00 > 02:02:02:02:02:01 IPV4 10.0.2.212 > 18.17.16.15 UDP 49194 > 1
[init -> t2_d1_s1_udp] Error: starting server loop #8
[init -> nic_router] [t2_d1] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 18.17.16.15 > 10.0.2.212 UDP 1 > 49194
[init -> nic_router] [uplink] snd ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49194
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49194
[init -> t2_d0_c1_udp] Received "UDP server at 10.0.2.55:1 ..."
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 10.0.2.212 > 10.0.2.55 UDP 49195 > 1
[init -> nic_router] [t2_d1] snd ETH 02:02:02:02:02:00 > 02:02:02:02:02:01 IPV4 10.0.2.212 > 18.17.16.15 UDP 49195 > 1
[init -> t2_d1_s1_udp] Error: starting server loop #9
[init -> nic_router] [t2_d1] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 18.17.16.15 > 10.0.2.212 UDP 1 > 49195
[init -> nic_router] [uplink] snd ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49195
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49195
[init -> t2_d0_c1_udp] Received "UDP server at 10.0.2.55:1 ..."
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 10.0.2.212 > 10.0.2.55 UDP 49196 > 1
[init -> nic_router] [t2_d1] snd ETH 02:02:02:02:02:00 > 02:02:02:02:02:01 IPV4 10.0.2.212 > 18.17.16.15 UDP 49196 > 1
[init -> t2_d1_s1_udp] Error: starting server loop #10
[init -> nic_router] [t2_d1] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 18.17.16.15 > 10.0.2.212 UDP 1 > 49196
[init -> nic_router] [uplink] snd ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49196
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49196
[init -> t2_d0_c1_udp] Received "UDP server at 10.0.2.55:1 ..."
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 10.0.2.212 > 10.0.2.55 UDP 49197 > 1
[init -> nic_router] [t2_d1] snd ETH 02:02:02:02:02:00 > 02:02:02:02:02:01 IPV4 10.0.2.212 > 18.17.16.15 UDP 49197 > 1
[init -> t2_d1_s1_udp] Error: starting server loop #11
[init -> nic_router] [t2_d1] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 18.17.16.15 > 10.0.2.212 UDP 1 > 49197
[init -> nic_router] [uplink] snd ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49197
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:00 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 UDP 1 > 49197
[init -> t2_d0_c1_udp] Received "UDP server at 10.0.2.55:1 ..."
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 10.0.2.212 > 10.0.2.55 UDP 49198 > 1
[init -> nic_router] [t2_d1] snd ETH 02:02:02:02:02:00 > 02:02:02:02:02:01 IPV4 10.0.2.212 > 18.17.16.15 UDP 49198 > 1
[init -> nic_router] [t2_d1] rcv ETH 02:02:02:02:02:01 > 02:02:02:02:02:00 IPV4 18.17.16.15 > 10.0.2.212 ICMP 3 3
[init -> nic_router] [uplink] snd ETH 02:02:02:02:02:01 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 ICMP 3 3
[init -> nic_router] [uplink] rcv ETH 02:02:02:02:02:01 > 01:02:03:04:05:06 IPV4 10.0.2.55 > 10.0.2.212 ICMP 3 3
Btw., the number of socket re-creations at the server side (i.e. number of server loops / transmissions) at which the error occurs always differs but is normally below 100 AFAIS.
Thanks, let's sum this up. It's seems in some cases UDP socket creation or port binding gets stuck in the libc. So, the IP stack is still responsive (see the ICMP packets) but the application makes no progress. I think we can pick up this issue with your findings later.
This issue may be related to the complex PCB handling mentioned in #3835.
The test looks good now.
@nfeske I re-enabled test 2 in the run script ...
proc enable_test_2 { } { return 1 }
... and called this ...
x86_64$ while make run/nic_router KERNEL=nova BOARD=pc; do :; done
... and it still triggers this issue after a hand full of iterations. Should I re-open the issue?
Thank you @m-stein for pointing this out.
I must admit that I'm divided. On the one hand, the failed test points at a possible deficiency, which would - in principle - call for investigation. On the other hand, the issue remained stale for two years. Apparently nobody was bothered enough by it during that time. By being present yet unattended, it merely remains as a distraction for those who try to keep track of the open issues and monitor the results of our automated tests.
Should I re-open the issue?
If you are going to attack it, this would be the best way. If you don't, and nobody else does, I'd remove it from our view to keep the noise at bay. How would you decide?
Thanks for your detailed feedback! I think we should go with the approach you described. I'll leave the issue closed and re-open it as soon as I want to tackle it. The problem remains documented in the run script and isn't related to the main objective of the test anyway.
Sometimes, the nic_router test fails without errors. It turned out that normally, the failing test is test 2 with one UDP client/server pair. The test seems to get stuck during or shortly after initialization of the two libc components. I've created a recipe for the depot autopilot and stripped it down to test 2 in order to achieve better reproducability. This is the log output when it fails:
A successful run, at the other hand, would look like this: