dce-umip-nemo.cc does not return

tomhenderson commented 3 years ago

In preparing for possible DCE 1.11 release on Ubuntu 16.04, I encountered the following problem with a umip example. The version 1.11 targets are presently pulling the tip of the development repositories.

$ git clone https://gitlab.com/tomhenderson/bake.git
$ cd bake
$ git checkout -b dce-1.11 origin/dce.111
$ export PATH=$PATH:`pwd`/build/bin
$ export DCE_PATH=`pwd`/build/bin_dce:`pwd`/build/sbin
$ export LD_LIBRARY_PATH=`pwd`/build/lib
$ export PYTHONPATH=`pwd`/build/lib
$ ./bake.py configure -e dce-linux-1.11 -e dce-umip-1.11 -e dce-quagga-1.11
$ ./bake.py download
$ ./bake.py build
$ cd source/ns-3-dce
$ ./waf --run dce-umip-nemo

and the simulation does not return.

The problem is isolated to the dce-umip-nemo.cc example. The simulation enters an infinite loop at time 280.025306697s and not reach the scheduled stop time of 300s. The following command will yield the following output:

$ NS_LOG="DceRrTaskScheduler:DceTaskManager" NS_GLOBAL_VALUE="RngRun=1" ./waf --run dce-umip-nemo

...
+280.025306697s 5 DceRrTaskScheduler:DequeueNext(0x10a7c10)
+280.025306697s 5 DceTaskManager:Wakeup(0x10ab0a0, 0x7f7a6c015ea0, 2)
+280.025306697s 5 DceRrTaskScheduler:Enqueue(0x10a7c10, 0x7f7a6c015ea0)
+280.025306697s 5 DceTaskManager:Yield(0x10ab0a0, 0x10964b0)
+280.025306697s 5 DceRrTaskScheduler:Enqueue(0x10a7c10, 0x10964b0)
+280.025306697s 5 DceTaskManager:Schedule(0x10ab0a0)
+280.025306697s 5 DceRrTaskScheduler:PeekNext(): [DEBUG] next=0x7f7a6c015ea0
+280.025306697s 5 DceTaskManager:Schedule(): [DEBUG] Leaving 0x10964b0, delay +0ns next = 0x7f7a6c015ea0 entering main
+280.025306697s 5 DceTaskManager:Schedule(): [DEBUG] Delayed schedule +0ns
+280.025306697s 5 DceTaskManager:GarbageCollectDeadTasks(0x10ab0a0)
+280.025306697s 5 DceTaskManager:Schedule(0x10ab0a0)
+280.025306697s 5 DceRrTaskScheduler:PeekNext(): [DEBUG] next=0x7f7a6c015ea0
+280.025306697s 5 DceTaskManager:Schedule(): [DEBUG] Leaving main, entering 0x7f7a6c015ea0
+280.025306697s 5 DceRrTaskScheduler:DequeueNext(0x10a7c10)
+280.025306697s 5 DceTaskManager:Sleep(0x10ab0a0, 0x7f7a6c015ea0)
+280.025306697s 5 DceTaskManager:Schedule(0x10ab0a0)
+280.025306697s 5 DceRrTaskScheduler:PeekNext(): [DEBUG] next=0x10964b0
+280.025306697s 5 DceTaskManager:Schedule(): [DEBUG] Leaving 0x7f7a6c015ea0, delay +0ns next = 0x10964b0 entering main
+280.025306697s 5 DceTaskManager:Schedule(): [DEBUG] Delayed schedule +0ns
+280.025306697s 5 DceTaskManager:GarbageCollectDeadTasks(0x10ab0a0)
+280.025306697s 5 DceTaskManager:Schedule(0x10ab0a0)
+280.025306697s 5 DceRrTaskScheduler:PeekNext(): [DEBUG] next=0x10964b0
+280.025306697s 5 DceTaskManager:Schedule(): [DEBUG] Leaving main, entering 0x10964b0
+280.025306697s 5 DceRrTaskScheduler:DequeueNext(0x10a7c10)
+280.025306697s 5 DceTaskManager:Wakeup(0x10ab0a0, 0x7f7a6c015ea0, 2)

(repeats endlessly)

It seems from this logging that the same task is dequeued from the queue and rescheduled, over and over.

The ns-3 simulation event that occurs at this time is WifiNetDevice::ForwardUp(). This simulation is sending some pings over wifi, and the receipt and a forward up of one of these packets to the Linux stack is putting it into this loop.

If I change the fiber manager (Pthread or Ucontext) it makes no difference.

However, if I change the ns-3 random number stream (i.e., the random seed), the behavior disappears. For instance, this completes successfully for me:

$ NS_GLOBAL_VALUE="RngRun=2" ./waf --run dce-umip-nemo

By changing the run number, the random variable streams are different, and this affects the mobility (random walk) and probably also some Wi-Fi contention resolution, such that the problematic event is not triggered. However, it may be the case that if we let the simulation run for longer time, it would get similarly stuck at a later time. I also experienced some successful runs with RngRun=1 so the problem does not seem deterministic.

Parth debugged this a bit and provided the following information (I am quoting below):

Based on what I could understand from the log file gdb.txt, it's a polling operation which is being enqueued every time.

As far as I understand the dce implementation, the function void LinuxSocketImpl::Poll () should be invoked with the returned event that occurred. But it seems like it's never being called. Now, I'm not very sure if the linux socket factory is being used in this case, but I'm just guessing based on what I currently understand.

Also, I tried to output the m_active list's size every time the scheduler enqueues or dequeues something, and I could see that initially the queue size would go up to 12 or so, but later on it would remain to max 2, but majority of the times it would move back and forth from 1 (dequeue-next) to 0 (enqueue). This is probably because we keep adding the polling job to the queue and keep dequeuing it again and again.

With the condition of RngRun=2 it does work without a gdb attached, but runs infinitely under the gdb, similar to how RngRun=1 works without the gdb. It's actually pretty tough to determine if the current gdb execution could be trusted.

tomhenderson commented 3 years ago

I instrumented dce-umip-nemo.cc to output full logging for all DCE classes.

Using tcpdump, I found that on node 5 (which is the mobile node that is running the ping process), the last successful ping is at time 273:

$ tcpdump -r dce-umip-nemo-5-0.pcap -nn -tt

273.026578 IP6 2001:1:2:5:200:ff:fe00:8 > 2001:1:2:6::7: ICMP6, echo request, seq 254, length 64
273.026666 Acknowledgment RA:00:00:00:00:00:08
273.027276 IP6 2001:1:2:6::7 > 2001:1:2:5:200:ff:fe00:8: ICMP6, echo reply, seq 254, length 64
273.027292 Acknowledgment RA:00:00:00:00:00:07

In the above, the 'Acknowledgment' packets are Wi-Fi acks (which we can ignore).

Between time 273 and time 279, the mobile node's position is such that it only sends 'echo request' packets but doesn't receive a reply again until time 280.

At time 280, we can observe that the ICMPv6 echo reply is received, and this induces a loop so that the simulator doesn't move forward from this time (e.g. the Acknowledgment packet is never sent for this echo reply).

279.026578 IP6 2001:1:2:5:200:ff:fe00:8 > 2001:1:2:6::7: ICMP6, echo request, seq 260, length 64
279.026666 Acknowledgment RA:00:00:00:00:00:08
280.025306 IP6 2001:1:2:6::7 > 2001:1:2:5:200:ff:fe00:8: ICMP6, echo reply, seq 260, length 64

Now, if we look at the log output for these two cases (time 273 and time 280), we see:

+273.027276807s 5 DceTaskManager:Wakeup(0xe72170, 0xd97890, 2)
+273.027276807s 5 DceRrTaskScheduler:Enqueue(0xe71020, 0xd97890)
+273.027276807s 5 DceKernelSocketFdFactory:Recvmsg(): Received 64
+273.027276807s 5 Dce:dce_gettimeofday(0x1116f30, 5)
+273.027276807s 5 Dce:dce_inet_ntop(0x1116f30, 5, 10, 0x7fbc7d57ed48, 2001:1:2:6::7, 64)
+273.027276807s 5 DceStdio:dce_printf(0x1116f30, 5, "%d bytes from %s:")
+273.027276807s 5 DceStdio:dce_printf(0x1116f30, 5, " icmp_seq=%u")
+273.027276807s 5 DceStdio:dce_printf(0x1116f30, 5, " ttl=%d")
+273.027276807s 5 DceStdio:dce_printf(0x1116f30, 5, " time=%ld.%03ld ms")
+273.027276807s 5 DceStdio:dce_putchar(0x1116f30, 5,
)
+273.027276807s 5 DceStdio:dce_fflush(0x1116f30, 5, 0x111a5e0)
+273.027276807s 5 DceFd:dce_write(0x1116f30, 5, 1, 0x1118290, 63)
+273.027276807s 5 DceUnixFileFd:Write(0x111c060, 0x1116f30, 0x1118290, 63)
+273.027276807s 5 Dce:dce_gettimeofday(0x1116f30, 5)
+273.027276807s 5 DceFd:dce_recvmsg(0x1116f30, 5, 3, 0x7fbc7d57ed00, 0)
+273.027276807s 5 DceKernelSocketFdFactory:Recvmsg(0x1116f30, 5, 0x7fbc86c1b010, 0x7fbc7d57ed00, 0)
+273.027276807s 5 DceTaskManager:Sleep(0xe72170, 0x11ec3b0)
+273.027276807s 5 DceTaskManager:Schedule(0xe72170)
+273.027276807s 5 DceRrTaskScheduler:PeekNext(): next=0xd97890

and then at time 280:

+280.025306697s 5 DceKernelSocketFdFactory:Recvmsg(): Received 64
+280.025306697s 5 Dce:dce_gettimeofday(0x1116f30, 5)
+280.025306697s 5 Dce:dce_inet_ntop(0x1116f30, 5, 10, 0x7fbc7d57ed48, 2001:1:2:6::7, 64)
+280.025306697s 5 DceStdio:dce_printf(0x1116f30, 5, "%d bytes from %s:")
+280.025306697s 5 DceStdio:dce_printf(0x1116f30, 5, " icmp_seq=%u")
+280.025306697s 5 DceStdio:dce_printf(0x1116f30, 5, " ttl=%d")
+280.025306697s 5 DceStdio:dce_printf(0x1116f30, 5, " time=%ld ms")
+280.025306697s 5 DceStdio:dce_putchar(0x1116f30, 5,
)
+280.025306697s 5 DceStdio:dce_fflush(0x1116f30, 5, 0x111a5e0)
+280.025306697s 5 DceFd:dce_write(0x1116f30, 5, 1, 0x1118290, 61)
+280.025306697s 5 DceUnixFileFd:Write(0x111c060, 0x1116f30, 0x1118290, 61)
+280.025306697s 5 Dce:dce_gettimeofday(0x1116f30, 5)
+280.025306697s 5 Dce:dce_sched_yield(0x1116f30, 5)
+280.025306697s 5 DceTaskManager:Yield(0xe72170, 0x11ec3b0)
+280.025306697s 5 DceRrTaskScheduler:Enqueue(0xe71020, 0x11ec3b0)
+280.025306697s 5 DceTaskManager:Schedule(0xe72170)
+280.025306697s 5 DceRrTaskScheduler:PeekNext(): next=0xd97890
+280.025306697s 5 DceTaskManager:Schedule(): Leaving 0x11ec3b0, delay +0ns next = 0xd97890 entering main

Right after dce_gettimeofday(), there is a 'dce_sched_yield' call. This is the first instance in the logging trace that this appears. It then appears subsequently in the rest of the trace (until I control-C exited from the program).

By gdb breaking at this point, I traced this back to this code in iputils/ping_common.c (line 580).

                /* Technical part. Looks wicked. Could be dropped,
                 * if everyone used the newest kernel.
                 * Its purpose is:
                 * 1. Provide intervals less than resolution of scheduler.
                 *    Solution: spinning.
                 * 2. Avoid use of poll(), when recvmsg() can provide
                 *    timed waiting (SO_RCVTIMEO). */
                polling = 0;
                if ((options & (F_ADAPTIVE|F_FLOOD_POLL)) || next<SCHINT(interval)) {
                        int recv_expected = in_flight();

                        /* If we are here, recvmsg() is unable to wait for
                         * required timeout. */
                        if (1000*next <= 1000000/(int)HZ) {
                                /* Very short timeout... So, if we wait for
                                 * something, we sleep for MININTERVAL.
                                 * Otherwise, spin! */
                                if (recv_expected) {
                                        next = MININTERVAL;
                                } else {
                                        next = 0;
                                        /* When spinning, no reasons to poll.
                                         * Use nonblocking recvmsg() instead. */
                                        polling = MSG_DONTWAIT;
                                        /* But yield yet. */
                                        sched_yield();
                                }
                        }

In this case, 'next' is 1, and in_flight() is zero, so the code enters the branch that calls sched_yield(), which seems to send the program into a spin loop with nothing to move (simulation) time forward.

Setting next = MININTERVAL regardless of the value of recv_expected can avoid the spin loop. There is already a small iputils patch in ns-3-dce/utils, which can be extended to cover this.

tomhenderson commented 3 years ago

Patch to fix: https://github.com/direct-code-execution/ns-3-dce/pull/124

direct-code-execution / ns-3-dce-umip

dce-umip-nemo.cc does not return #2