Repeatable crash in tcp_sent()

lws-team commented 8 years ago

After the nagle fix, the libwebsockets test server on mbed3 can get tested intensively.

He sends about 3000 packets/s containing an incrementing number in ASCII, using websockets protocol. All is well until around 1054452, or 1121224, or 722709, etc packets, ie, kinda random depending on the run, after which he dies seemingly always in sal-stack-lwip tcp_out.c, in tcp_output()

...
    /* do not queue empty segments on the unacked list */
    } else {
      tcp_seg_free(seg);
    }
    seg = pcb->unsent;   <<<<====== line 1022
  }
#if TCP_OVERSIZE
  if (pcb->unsent == NULL) {
    /* last unsent has been removed, reset unsent_oversize */
    pcb->unsent_oversize = 0;
  }
#endif /* TCP_OVERSIZE */
...

The backtrace is this

#0  HardFault_Handler () at /home/agreen/projects/mbed/lws-test-server/yotta_modules/mbed-hal-k64f/source/bootstrap_gcc/startup_MK64F12.S:259
#1  <signal handler called>
#2  0x00007236 in tcp_output (pcb=pcb@entry=0x1fffb8c0 <memp_memory+424>)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/sal-stack-lwip/source/lwip/core/tcp_out.c:1022
#3  0x000115b6 in lwipv4_socket_send (socket=<optimized out>, buf=<optimized out>, len=<optimized out>)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/sal-stack-lwip/source/asynch_socket.c:617
#4  0x00003b1c in lws_ssl_capable_write_no_ssl (wsi=wsi@entry=0x2000b0f0, buf=buf@entry=0x2000b9aa "\201\006\067\062\062\067\060\071", len=len@entry=8)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/websockets/lib/lws-plat-mbed3.cpp:123
#5  0x00002878 in lws_issue_raw (wsi=0x2000b0f0, buf=0x2000b9aa "\201\006\067\062\062\067\060\071", len=8)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/websockets/lib/output.c:125
#6  0x00002b9c in lws_write (wsi=0x2000b0f0, buf=0x2000b9ac "722709", len=<optimized out>, protocol=<optimized out>)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/websockets/lib/output.c:489
#7  0x00001e92 in callback_dumb_increment (context=0x20007a18, wsi=0x2000b0f0, reason=<optimized out>, user=0x2000b998, in=0x0 <__isr_vector>, len=0)
    at /home/agreen/projects/mbed/lws-test-server/source/app.cpp:194
#8  0x0001024c in user_callback_handle_rxflow (callback_function=0x1e51 <callback_dumb_increment(lws_context*, lws*, lws_callback_reasons, void*, void*, size_t)>, 
    context=context@entry=0x20007a18, wsi=wsi@entry=0x2000b0f0, reason=reason@entry=LWS_CALLBACK_SERVER_WRITEABLE, user=0x2000b998, in=in@entry=0x0 <__isr_vector>, 
    len=len@entry=0) at /home/agreen/projects/mbed/lws-test-server/yotta_modules/websockets/lib/libwebsockets.c:657
#9  0x00004b52 in lws_calllback_as_writeable (wsi=0x2000b0f0, context=0x20007a18)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/websockets/lib/service.c:41
#10 lws_handle_POLLOUT_event (context=context@entry=0x20007a18, wsi=wsi@entry=0x2000b0f0, pollfd=pollfd@entry=0x1fffa0c8)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/websockets/lib/service.c:272
#11 0x00004cd6 in lws_service_fd (context=0x20007a18, pollfd=pollfd@entry=0x1fffa0c8)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/websockets/lib/service.c:515
#12 0x000106be in lws_conn::onSent (this=<optimized out>, s=<optimized out>, len=<optimized out>)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/websockets/lib/lws-plat-mbed3.cpp:286
#13 0x0000b440 in call (arg=<optimized out>, this=<optimized out>)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/core-util/core-util/FunctionPointerBase.h:80
#14 call (this=<optimized out>) at /home/agreen/projects/mbed/lws-test-server/yotta_modules/core-util/core-util/FunctionPointerBind.h:44
#15 operator() (this=<optimized out>) at /home/agreen/projects/mbed/lws-test-server/yotta_modules/core-util/core-util/FunctionPointerBind.h:104
#16 minar::SchedulerData::start (this=0x20007088) at /home/agreen/projects/mbed/lws-test-server/yotta_modules/minar/source/minar.cpp:471
#17 0x0000b468 in minar::Scheduler::start () at /home/agreen/projects/mbed/lws-test-server/yotta_modules/minar/source/minar.cpp:295
#18 0x00005766 in main () at /home/agreen/projects/mbed/lws-test-server/yotta_modules/mbed-drivers/source/retarget.cpp:458

It's certainly possible my side is doing something bad, but my part of this stack:

a) really tries to avoid malloc or new in his activities, there is a new per connection, but no new connections are coming after this test starts

b) regulates his packet send by only ever having one in flight at a time on a connection, he does not send another until onSent() is coming

if it's not related to OOM or packets piling up, then it seems like it might be a probability / stability related issue in the networking stack?

bremoran commented 8 years ago

One additional piece of information: it looks like the physical link dropped somewhere in there. My link-state-change callback fired before the hardfault.

bremoran commented 8 years ago

I have a new traceback, which might provide more insight:

#0  _exit (status=1) at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/mbed-drivers/source/retarget.cpp:467
#1  0x00020fa8 in abort ()
#2  0x0001f43c in __assert_func ()
#3  0x0001654e in enet_hal_init_rxbds (rxBds=0x20006ff0, buffer=0x0 <__isr_vector>, isLastBd=false)
    at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/mbed-hal-ksdk-mcu/source/TARGET_KSDK_CODE/hal/enet/fsl_enet_hal.c:193
#4  0x0001229e in k64f_rxqueue_pbuf (k64f_enet=0x1fffd4fc <k64f_enetdata>, p=0x20001264 <ram_heap+14860>, bidx=8)
    at /Users/bremor01/dev/yotta/import/sal-driver-lwip-k64f-eth/source/k64f_emac.c:146
#5  0x00012806 in k64f_low_level_input (netif=0x1fffa2fc <eth>, idx=8) at /Users/bremor01/dev/yotta/import/sal-driver-lwip-k64f-eth/source/k64f_emac.c:438
#6  0x0001282e in k64f_enetif_input (netif=0x1fffa2fc <eth>, idx=8) at /Users/bremor01/dev/yotta/import/sal-driver-lwip-k64f-eth/source/k64f_emac.c:470
#7  0x00012c32 in enet_mac_rx_isr (enetIfPtr=0x1fffd590 <enetDevIf>) at /Users/bremor01/dev/yotta/import/sal-driver-lwip-k64f-eth/source/k64f_emac.c:790
#8  0x00012cfc in ENET_Receive_IRQHandler () at /Users/bremor01/dev/yotta/import/sal-driver-lwip-k64f-eth/source/k64f_emac.c:833
#9  <signal handler called>
#10 0x00011ed8 in sys_arch_unprotect (lev=31 '\037') at /Users/bremor01/dev/yotta/import/sal-stack-lwip/source/arch/sys_arch.c:87
#11 0x0000c212 in memp_free (type=MEMP_RAW_PCB, mem=0x0 <__isr_vector>) at /Users/bremor01/dev/yotta/import/sal-stack-lwip/source/lwip/core/memp.c:472
#12 0xd47ce3d2 in ?? ()
#13 0xd47ce3d2 in ?? ()
Backtrace stopped: previous frame identical to this frame (corrupt stack?)

bremoran commented 8 years ago

It appears that I'm having consistent errors caused by this same race condition:

#0  HardFault_Handler () at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/mbed-hal-k64f/source/bootstrap_gcc/startup_MK64F12.S:259
#1  <signal handler called>
#2  0x0100a8c0 in ?? ()
#3  0x0000ba98 in udp_input (p=0x20000940 <ram_heap+12520>, inp=0x1fffa2fc <eth>) at /Users/bremor01/dev/yotta/import/sal-stack-lwip/source/lwip/core/udp.c:330
#4  0x0000e64c in ip_input (p=0x20000940 <ram_heap+12520>, inp=0x1fffa2fc <eth>) at /Users/bremor01/dev/yotta/import/sal-stack-lwip/source/lwip/core/ipv4/ip.c:499
#5  0x000085ee in ethernet_input (p=0x20000940 <ram_heap+12520>, netif=0x1fffa2fc <eth>) at /Users/bremor01/dev/yotta/import/sal-stack-lwip/source/lwip/netif/etharp.c:1282
#6  0x00012862 in k64f_enetif_input (netif=0x1fffa2fc <eth>, idx=9) at /Users/bremor01/dev/yotta/import/sal-driver-lwip-k64f-eth/source/k64f_emac.c:506
#7  0x00012c2e in enet_mac_rx_isr (enetIfPtr=0x1fffd590 <enetDevIf>) at /Users/bremor01/dev/yotta/import/sal-driver-lwip-k64f-eth/source/k64f_emac.c:810
#8  0x00012cf8 in ENET_Receive_IRQHandler () at /Users/bremor01/dev/yotta/import/sal-driver-lwip-k64f-eth/source/k64f_emac.c:853
#9  <signal handler called>
#10 0x00011e50 in sys_arch_unprotect (lev=31 '\037') at /Users/bremor01/dev/yotta/import/sal-stack-lwip/source/arch/sys_arch.c:87
#11 0x0000c176 in memp_free (type=MEMP_RAW_PCB, mem=0x0 <__isr_vector>) at /Users/bremor01/dev/yotta/import/sal-stack-lwip/source/lwip/core/memp.c:472

bremoran commented 8 years ago

I think I've made some progress: I managed to catch the board after it stopped sending packets, but before it crashed. It was in another endless loop in sys_check_timeouts. I checked the loop conditions, and noticed that one of the control variables, diff, is an unsigned, and no check is performed before subtracting a value from it.

When I broke into the loop, the value of diff (shown the snippet below) was greater than 3621842728. Each time around the loop, tmptimeout->time contained only 100, which easily explains the lockup and why it eventually leads to DHCP requests when execution escapes from the loop.

    do
    {
      had_one = 0;
      tmptimeout = next_timeout;
      if (tmptimeout->time <= diff) {
        /* timeout has expired */
        had_one = 1;
        timeouts_last_time = now;
        diff -= tmptimeout->time; // <========This is bad
        next_timeout = tmptimeout->next;
        handler = tmptimeout->h;
        arg = tmptimeout->arg;
#if LWIP_DEBUG_TIMERNAMES
        if (handler != NULL) {
          LWIP_DEBUGF(TIMERS_DEBUG, ("sct calling h=%s arg=%p\n",
            tmptimeout->handler_name, arg));
        }
#endif /* LWIP_DEBUG_TIMERNAMES */
        memp_free(MEMP_SYS_TIMEOUT, tmptimeout);
        if (handler != NULL) {
          handler(arg);
        }
      }
    /* repeat until all expired timers have been called */
    }while(had_one);

I'm running a test now to prove that diff overflows. If I succeed in demonstrating that, then the next step will be to replace the marked line above with:

diff = (diff < tmptimeout->next ? 0 : diff - tmptimeout->next);

See here: https://github.com/ARMmbed/sal-stack-lwip/blob/405bf2b2e89e11f47b1f58faebd97fe3d0621300/source/lwip/core/timers.c#L377

bremoran commented 8 years ago

My previous assumption was incorrect. The cause of the endless loop in sys_check_timeouts is due to a timer wrap case:

At line 368, diff is initialized with 4290672578 because now = 12 and timeouts_last_time = 4294730

This is because

#define LWIP_U32_DIFF(a, b) (((a) >= (b)) ? ((a) - (b)) : (((a) + ((b) ^ 0xFFFFFFFF) + 1)))

But, the timer that is feeding LwIP is a microsecond ticker, so it wraps early, leaving an enormous overhead to remove.

LwIP requires a millisecond ticker.

lws-team commented 8 years ago

Wow... great catch.

bremoran commented 8 years ago

I have tried replacing the initialization of diff with:

diff = now >= timeouts_last_time ? now - timeouts_last_time : now + ULONG_MAX/1000 - timeouts_last_time;

I'm also using all the deferred processing patches I mentioned earlier. I'll let you know when there are results.

bremoran commented 8 years ago

It has now been running for 13 hours without crashes. The fixes were: deferred packet processing, correction for using a microsecond timer wrap instead of a milisecond timer wrap in sys_check_timeouts

bremoran commented 8 years ago

After 18 hours, it died with:

11:00:01.813935 IP 192.168.0.110.80 > 192.168.0.102.52486: Flags [P.], seq 23656785:23656793, ack 3, win 2918, length 8
11:00:01.813960 IP 192.168.0.102.52486 > 192.168.0.110.80: Flags [.], ack 23656793, win 65535, length 0
11:00:01.888190 ARP, Request who-has 192.168.0.102 tell 192.168.0.110, length 46
11:00:01.888203 ARP, Reply 192.168.0.102 is-at 80:e6:50:28:37:3a, length 28
11:00:03.321955 ARP, Request who-has 192.168.0.102 tell 192.168.0.110, length 46
11:00:03.321971 ARP, Reply 192.168.0.102 is-at 80:e6:50:28:37:3a, length 28
11:00:06.394220 ARP, Request who-has 192.168.0.102 tell 192.168.0.110, length 46
11:00:06.394236 ARP, Reply 192.168.0.102 is-at 80:e6:50:28:37:3a, length 28
11:00:12.640908 ARP, Request who-has 192.168.0.102 tell 192.168.0.110, length 46
11:00:12.640921 ARP, Reply 192.168.0.102 is-at 80:e6:50:28:37:3a, length 28
11:00:25.134400 ARP, Request who-has 192.168.0.102 tell 192.168.0.110, length 46
11:00:25.134414 ARP, Reply 192.168.0.102 is-at 80:e6:50:28:37:3a, length 28
11:00:50.121668 ARP, Request who-has 192.168.0.102 tell 192.168.0.110, length 46
11:00:50.121681 ARP, Reply 192.168.0.102 is-at 80:e6:50:28:37:3a, length 28
11:01:20.946777 ARP, Request who-has 192.168.0.110 tell 192.168.0.106, length 28
11:01:21.561331 ARP, Request who-has 192.168.0.110 tell 192.168.0.106, length 28
11:01:22.585430 ARP, Request who-has 192.168.0.110 tell 192.168.0.106, length 28
11:01:40.096045 ARP, Request who-has 192.168.0.102 tell 192.168.0.110, length 46
11:01:40.096058 ARP, Reply 192.168.0.102 is-at 80:e6:50:28:37:3a, length 28
11:02:21.059412 ARP, Request who-has 192.168.0.110 tell 192.168.0.106, length 28
11:03:20.044870 ARP, Request who-has 192.168.0.102 tell 192.168.0.110, length 46
11:03:20.044883 ARP, Reply 192.168.0.102 is-at 80:e6:50:28:37:3a, length 28
11:04:21.079848 ARP, Request who-has 192.168.0.110 tell 192.168.0.106, length 28
11:04:21.591924 ARP, Request who-has 192.168.0.110 tell 192.168.0.106, length 28
11:06:39.839979 ARP, Request who-has 192.168.0.102 tell 192.168.0.110, length 46
11:06:39.839991 ARP, Reply 192.168.0.102 is-at 80:e6:50:28:37:3a, length 28
11:08:21.863438 ARP, Request who-has 192.168.0.110 (00:02:f7:f0:00:00) tell 192.168.0.102, length 28
11:08:21.865226 ARP, Reply 192.168.0.110 is-at 00:02:f7:f0:00:00, length 50
11:08:21.865242 IP 192.168.0.102.52486 > 192.168.0.110.80: Flags [F.], seq 3, ack 23656793, win 65535, length 0
11:08:21.879735 IP 192.168.0.110.80 > 192.168.0.102.52486: Flags [.], ack 4, win 2917, length 0
11:08:21.879774 IP 192.168.0.102.52486 > 192.168.0.110.80: Flags [.], ack 23656793, win 65535, length 0
11:09:59.747943 IP 192.168.0.110.80 > 192.168.0.102.52486: Flags [P.], seq 23656793:23656801, ack 4, win 2917, length 8
11:09:59.747984 IP 192.168.0.102.52486 > 192.168.0.110.80: Flags [R], seq 2183129202, win 0, length 0

The DUT reported the following message, but I do not think it was particularly useful, since I can see the reset packet in the tcpdump.

Socket Error: Connection reset by peer (21)

ARMmbed / sockets

Repeatable crash in tcp_sent() #41