ARMmbed / sockets

mbed sockets library abstraction layer
Other
6 stars 18 forks source link

Repeatable crash in tcp_sent() #41

Closed lws-team closed 8 years ago

lws-team commented 8 years ago

After the nagle fix, the libwebsockets test server on mbed3 can get tested intensively.

He sends about 3000 packets/s containing an incrementing number in ASCII, using websockets protocol. All is well until around 1054452, or 1121224, or 722709, etc packets, ie, kinda random depending on the run, after which he dies seemingly always in sal-stack-lwip tcp_out.c, in tcp_output()

...
    /* do not queue empty segments on the unacked list */
    } else {
      tcp_seg_free(seg);
    }
    seg = pcb->unsent;   <<<<====== line 1022
  }
#if TCP_OVERSIZE
  if (pcb->unsent == NULL) {
    /* last unsent has been removed, reset unsent_oversize */
    pcb->unsent_oversize = 0;
  }
#endif /* TCP_OVERSIZE */
...

The backtrace is this

#0  HardFault_Handler () at /home/agreen/projects/mbed/lws-test-server/yotta_modules/mbed-hal-k64f/source/bootstrap_gcc/startup_MK64F12.S:259
#1  <signal handler called>
#2  0x00007236 in tcp_output (pcb=pcb@entry=0x1fffb8c0 <memp_memory+424>)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/sal-stack-lwip/source/lwip/core/tcp_out.c:1022
#3  0x000115b6 in lwipv4_socket_send (socket=<optimized out>, buf=<optimized out>, len=<optimized out>)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/sal-stack-lwip/source/asynch_socket.c:617
#4  0x00003b1c in lws_ssl_capable_write_no_ssl (wsi=wsi@entry=0x2000b0f0, buf=buf@entry=0x2000b9aa "\201\006\067\062\062\067\060\071", len=len@entry=8)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/websockets/lib/lws-plat-mbed3.cpp:123
#5  0x00002878 in lws_issue_raw (wsi=0x2000b0f0, buf=0x2000b9aa "\201\006\067\062\062\067\060\071", len=8)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/websockets/lib/output.c:125
#6  0x00002b9c in lws_write (wsi=0x2000b0f0, buf=0x2000b9ac "722709", len=<optimized out>, protocol=<optimized out>)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/websockets/lib/output.c:489
#7  0x00001e92 in callback_dumb_increment (context=0x20007a18, wsi=0x2000b0f0, reason=<optimized out>, user=0x2000b998, in=0x0 <__isr_vector>, len=0)
    at /home/agreen/projects/mbed/lws-test-server/source/app.cpp:194
#8  0x0001024c in user_callback_handle_rxflow (callback_function=0x1e51 <callback_dumb_increment(lws_context*, lws*, lws_callback_reasons, void*, void*, size_t)>, 
    context=context@entry=0x20007a18, wsi=wsi@entry=0x2000b0f0, reason=reason@entry=LWS_CALLBACK_SERVER_WRITEABLE, user=0x2000b998, in=in@entry=0x0 <__isr_vector>, 
    len=len@entry=0) at /home/agreen/projects/mbed/lws-test-server/yotta_modules/websockets/lib/libwebsockets.c:657
#9  0x00004b52 in lws_calllback_as_writeable (wsi=0x2000b0f0, context=0x20007a18)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/websockets/lib/service.c:41
#10 lws_handle_POLLOUT_event (context=context@entry=0x20007a18, wsi=wsi@entry=0x2000b0f0, pollfd=pollfd@entry=0x1fffa0c8)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/websockets/lib/service.c:272
#11 0x00004cd6 in lws_service_fd (context=0x20007a18, pollfd=pollfd@entry=0x1fffa0c8)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/websockets/lib/service.c:515
#12 0x000106be in lws_conn::onSent (this=<optimized out>, s=<optimized out>, len=<optimized out>)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/websockets/lib/lws-plat-mbed3.cpp:286
#13 0x0000b440 in call (arg=<optimized out>, this=<optimized out>)
    at /home/agreen/projects/mbed/lws-test-server/yotta_modules/core-util/core-util/FunctionPointerBase.h:80
#14 call (this=<optimized out>) at /home/agreen/projects/mbed/lws-test-server/yotta_modules/core-util/core-util/FunctionPointerBind.h:44
#15 operator() (this=<optimized out>) at /home/agreen/projects/mbed/lws-test-server/yotta_modules/core-util/core-util/FunctionPointerBind.h:104
#16 minar::SchedulerData::start (this=0x20007088) at /home/agreen/projects/mbed/lws-test-server/yotta_modules/minar/source/minar.cpp:471
#17 0x0000b468 in minar::Scheduler::start () at /home/agreen/projects/mbed/lws-test-server/yotta_modules/minar/source/minar.cpp:295
#18 0x00005766 in main () at /home/agreen/projects/mbed/lws-test-server/yotta_modules/mbed-drivers/source/retarget.cpp:458

It's certainly possible my side is doing something bad, but my part of this stack:

a) really tries to avoid malloc or new in his activities, there is a new per connection, but no new connections are coming after this test starts

b) regulates his packet send by only ever having one in flight at a time on a connection, he does not send another until onSent() is coming

if it's not related to OOM or packets piling up, then it seems like it might be a probability / stability related issue in the networking stack?

bogdanm commented 8 years ago

Hi,

Are you running these tests on K64F?

Thanks, Bogdan

lws-team commented 8 years ago

Yes, it's a FRDM K64F using its onboard ethernet.

bogdanm commented 8 years ago

Ok, thanks, When you say "the Nagle fix", I assume you mean https://github.com/ARMmbed/sal-stack-lwip/pull/35/files. This is a long shot, but what happens if you comment out lines 616-618 from https://github.com/ARMmbed/sal-stack-lwip/pull/35/files#diff-29ead84cc847c042f4e60fcd26691663R616 ?

ciarmcom commented 8 years ago

ARM Internal Ref: IOTSFW-1417

lws-team commented 8 years ago

Yes I mean apply the patches about allow disable nagle.

If I kill the stanza you mention, tcp_output if nagle disabled, then he's back to 500ms delay between sending anything. I can't tell if that impacts the crash problem or not since it needs me to wait ~500Ksec to find out :-O

bogdanm commented 8 years ago

That's very interesting, thank you. This looks more and more like a memory corruption problem, one that will probably be quite difficult to debug :(

lws-team commented 8 years ago

Later I'll adapt the simple http test app from the other issues and confirm it's still coming without libwebsockets in the picture. Although if it's "that kind of bug" it's difficult to quickly draw firm conclusions from an apparent negative.

If it is still coming, also good to check if coming on different toolchain, I'm on gcc 5.2 from Fedora

lws-team commented 8 years ago

Please try to build

https://github.com/lws-team/mbed3-dumb-http-test

When he runs, he will use DHCP and then listen on port 80.

If you do, eg

$ echo "x" | nc 192.168.2.205 80

or whatever your K64F's IP is, then he should spam "aha12345" endlessly.

But he doesn't here. He might say this kind of thing

onRX: error 1 Socket Error: Null pointer (4)

or spam a few dozen and stop.

Can you see a problem in the test app? It's pretty simple and just has mbed3 core pieces as dependencies.

bogdanm commented 8 years ago

I can't see any obvious problem with that code. Do you get any disconnects while running it? I'll try to reproduce the problem on my side later.

bremoran commented 8 years ago

When I try to test this as described, I get a disconnect. This can easily be captured with tcpdump.

What you have shown here is correct behavior for an attempt to send on a disconnected socket. LwIP automatically frees its disconnected tcp_pbss under some circumstances, so the disconnect handler zeros the impl and generates an onDisconnect event.

The problem is that there's a potential race condition in disconnect. If execution has passed the null check in a socket API before disconnect happens it's vulnerable to a race condition. This will require some rework of how disconnect is handled.

lws-team commented 8 years ago

Well, I should have added -i 500s or whatever to nc, to keep him from timing out, if he's the source of the disconnects.

However when I try the patch above, I still get a bunch of outcomes none of which are what I expected.

$ echo "x" | nc -i 500s 192.168.2.205 80
aha12345Ncat: Connection reset by peer.
$ echo "x" | nc -i 500s 192.168.2.205 80
aha12345Ncat: Connection reset by peer.
$ echo "x" | nc -i 500s 192.168.2.205 80
Ncat: Connection reset by peer.
$ echo "x" | nc -i 500s 192.168.2.205 80
^C

The last one just sits there, he's stuck in tcp.c it seems

#0  tcp_slowtmr () at /home/agreen/projects/mbed/test1/yotta_modules/sal-stack-lwip/source/lwip/core/tcp.c:952
#1  0x0000401c in tcp_tmr () at /home/agreen/projects/mbed/test1/yotta_modules/sal-stack-lwip/source/lwip/core/tcp.c:117
#2  0x00004342 in tcpip_tcp_timer (arg=<optimized out>) at /home/agreen/projects/mbed/test1/yotta_modules/sal-stack-lwip/source/lwip/core/timers.c:82
#3  0x000044ca in sys_check_timeouts () at /home/agreen/projects/mbed/test1/yotta_modules/sal-stack-lwip/source/lwip/core/timers.c:389
#4  <signal handler called>
#5  minar::SchedulerData::start (this=0x20006a50) at /home/agreen/projects/mbed/test1/yotta_modules/minar/source/minar.cpp:357
#6  0x00008200 in minar::Scheduler::start () at /home/agreen/projects/mbed/test1/yotta_modules/minar/source/minar.cpp:295
#7  0x00002462 in main () at /home/agreen/projects/mbed/test1/yotta_modules/mbed-drivers/source/retarget.cpp:458

The corresponding feeling at the K64F for the four tests above is

Starting on port 80...
Socket Error: Connection aborted (20)
Socket Error: Connection aborted (20)
onRX: error 1
Socket Error: Connection aborted (20)
Socket Error: No data available (15)

Is this also what you're seeing? Or it works there and I need to worry about the toolchain?

bremoran commented 8 years ago

Here are the observations I have made:

  1. With the following command, it works as expected:

    echo "longer string" | nc -i 1 192.168.0.102 80 
  2. With the following command, I get an immediate disconnect as you've seen above.

    echo "longer string" | nc 192.168.0.102 80 

With the latter, the messages from the mbed:

Starting on port 80...
Server IP Address is 192.168.0.102:80
Socket Error: Connection aborted (20)
Socket Error: Connection aborted (20)
Socket Error: Socket not connected (8)
Socket Error: Connection aborted (20)

I would suggest verifying netcat's behaviour via tcpdump, since I'm seeing the following:

HOST->MBED [SYN]
MBED->HOST [SYN,ACK]
HOST->MBED [ACK]
HOST->MBED [PSH,ACK] (payload = "longer string")
HOST->MBED [FIN,ACK]
MBED->HOST [ACK]
MBED->HOST (payload = aha12345)

That looks to me like netcat isn't doing what I think it should.

bremoran commented 8 years ago

One comment: netcat closes the send side of the connection in both cases. However, it appears that the receive side stays open with the -i argument.

lws-team commented 8 years ago

Sorry it's my fault, netcat needs some extra magic to stay up after he's delivered what he was writing

$ cat <(echo x) - | nc -i 500s 192.168.2.205 80

lws-team commented 8 years ago

Aha after a few minutes he crashed

(gdb) bt

0 HardFault_Handler () at /home/agreen/projects/mbed/test1/yotta_modules/mbed-hal-k64f/source/bootstrap_gcc/startup_MK64F12.S:259

1

2 0x00004776 in tcp_output (pcb=pcb@entry=0x1fffb1c0 <memp_memory+762>)

at /home/agreen/projects/mbed/test1/yotta_modules/sal-stack-lwip/source/lwip/core/tcp_out.c:1022

3 0x0000daea in lwipv4_socket_send (socket=, buf=, len=)

at /home/agreen/projects/mbed/test1/yotta_modules/sal-stack-lwip/source/asynch_socket.c:626

4 0x00001c6c in connection::send_some (this=0x200085e8) at /home/agreen/projects/mbed/test1/source/app.cpp:29

5 0x0000cf12 in connection::onSent (this=, s=, len=) at /home/agreen/projects/mbed/test1/source/app.cpp:62

6 0x000081d8 in call (arg=, this=) at /home/agreen/projects/mbed/test1/yotta_modules/core-util/core-util/FunctionPointerBase.h:80

7 call (this=) at /home/agreen/projects/mbed/test1/yotta_modules/core-util/core-util/FunctionPointerBind.h:44

8 operator() (this=) at /home/agreen/projects/mbed/test1/yotta_modules/core-util/core-util/FunctionPointerBind.h:104

9 minar::SchedulerData::start (this=0x20006a50) at /home/agreen/projects/mbed/test1/yotta_modules/minar/source/minar.cpp:471

10 0x00008200 in minar::Scheduler::start () at /home/agreen/projects/mbed/test1/yotta_modules/minar/source/minar.cpp:295

11 0x00002462 in main () at /home/agreen/projects/mbed/test1/yotta_modules/mbed-drivers/source/retarget.cpp:458

I added > /dev/null so he could spew packets quickly, and watched him with tcpdump.

bremoran commented 8 years ago

Maybe our builds of netcat are different. When I launch netcat with the arguments you provide, netcat doesn't send anything. If I lower the interval to 1, it does send.

I ran the following command for about an hour:

cat <(echo x) - | nc -i 1 192.168.0.102 80 > /dev/null

I was still able to capture packets after an hour.

lws-team commented 8 years ago

Thanks... can you send me your .bin file? I will test it with everything else the same here.

bremoran commented 8 years ago

test1.bin.zip

lws-team commented 8 years ago

Your binary also stopped sending packets here after ~7 minutes... the .bin isn't enough to get a backtrace but it's completely consistent with the behaviour of my builds.

I guess this problem requires specific network conditions to reproduce then... the K64F is plugged into an ethernet network switch that has a very fast x86_64 box also connected, and that is the peer for the tests.

My K64F being physically unstable doesn't sound like it because where it fails is highly specific and repeatable.

From where it fails, can we add some diagnostic patch to try to uncover the mechanism, or some other idea?

lws-team commented 8 years ago

I retried it (also your binary) and got this after ~30s

Starting on port 80... Server IP Address is 192.168.2.205:80 Socket Error: Memory allocation failed (7)

lws-team commented 8 years ago

... and I left the tcpdump going and captured what happened just before he died

(edit to be clear this is another run where he seems to crash at k64f with no log on serial)

4:14:30.962815 IP 192.168.2.233.54895 > 192.168.2.205.80: Flags [.], ack 702609, win 29200, length 0
04:14:30.963327 IP 192.168.2.205.80 > 192.168.2.233.54895: Flags [P.], seq 702609:702617, ack 3, win 2918, length 8: HTTP
04:14:30.963349 IP 192.168.2.233.54895 > 192.168.2.205.80: Flags [.], ack 702617, win 29200, length 0
04:14:30.963866 IP 192.168.2.205.80 > 192.168.2.233.54895: Flags [P.], seq 702617:702625, ack 3, win 2918, length 8: HTTP
04:14:30.963900 IP 192.168.2.233.54895 > 192.168.2.205.80: Flags [.], ack 702625, win 29200, length 0
04:14:30.964416 IP 192.168.2.205.80 > 192.168.2.233.54895: Flags [P.], seq 702625:702633, ack 3, win 2918, length 8: HTTP
04:14:30.964449 IP 192.168.2.233.54895 > 192.168.2.205.80: Flags [.], ack 702633, win 29200, length 0
04:14:30.964962 IP 192.168.2.205.80 > 192.168.2.233.54895: Flags [P.], seq 702633:702641, ack 3, win 2918, length 8: HTTP
04:14:30.964978 IP 192.168.2.233.54895 > 192.168.2.205.80: Flags [.], ack 702641, win 29200, length 0
04:14:30.965490 IP 192.168.2.205.80 > 192.168.2.233.54895: Flags [P.], seq 702641:702649, ack 3, win 2918, length 8: HTTP
04:14:30.965531 IP 192.168.2.233.54895 > 192.168.2.205.80: Flags [.], ack 702649, win 29200, length 0
04:14:30.966046 IP 192.168.2.205.80 > 192.168.2.233.54895: Flags [P.], seq 702649:702657, ack 3, win 2918, length 8: HTTP
04:14:30.966085 IP 192.168.2.233.54895 > 192.168.2.205.80: Flags [.], ack 702657, win 29200, length 0
04:14:30.966581 IP 192.168.2.205.80 > 192.168.2.233.54895: Flags [P.], seq 702657:702665, ack 3, win 2918, length 8: HTTP
04:14:30.966586 IP 192.168.2.205.80 > 192.168.2.233.54895: Flags [P.], seq 702657:702665, ack 3, win 2918, length 8: HTTP
04:14:30.966592 IP 192.168.2.233.54895 > 192.168.2.205.80: Flags [.], ack 702665, win 29200, length 0

it seems he repeated the last sent packet within 5us...

lws-team commented 8 years ago

Again with your .bin, I retty the same test but with a different peer, an x86_64 laptop on the same 192.168.2.x network but using wlan via an AP. Unlike the other machine on ethernet link who can do ~3000 packets/sec he manages around 350 packets/sec.

He dies after some minutes exactly the same way and again his "last words" are sending a duplicated packet.

5:04:41.617727 IP 192.168.2.205.80 > 192.168.2.213.34822: Flags [P.], seq 416401:416409, ack 3, win 2918, length 8: HTTP
05:04:41.617800 IP 192.168.2.213.34822 > 192.168.2.205.80: Flags [.], ack 416409, win 29200, length 0
05:04:41.619601 IP 192.168.2.205.80 > 192.168.2.213.34822: Flags [P.], seq 416409:416417, ack 3, win 2918, length 8: HTTP
05:04:41.619675 IP 192.168.2.213.34822 > 192.168.2.205.80: Flags [.], ack 416417, win 29200, length 0
05:04:41.621685 IP 192.168.2.205.80 > 192.168.2.213.34822: Flags [P.], seq 416417:416425, ack 3, win 2918, length 8: HTTP
05:04:41.621757 IP 192.168.2.213.34822 > 192.168.2.205.80: Flags [.], ack 416425, win 29200, length 0
05:04:41.623478 IP 192.168.2.205.80 > 192.168.2.213.34822: Flags [P.], seq 416425:416433, ack 3, win 2918, length 8: HTTP
05:04:41.623549 IP 192.168.2.213.34822 > 192.168.2.205.80: Flags [.], ack 416433, win 29200, length 0
05:04:41.625359 IP 192.168.2.205.80 > 192.168.2.213.34822: Flags [P.], seq 416433:416441, ack 3, win 2918, length 8: HTTP
05:04:41.625424 IP 192.168.2.213.34822 > 192.168.2.205.80: Flags [.], ack 416441, win 29200, length 0
05:04:41.627513 IP 192.168.2.205.80 > 192.168.2.213.34822: Flags [P.], seq 416441:416449, ack 3, win 2918, length 8: HTTP
05:04:41.627541 IP 192.168.2.205.80 > 192.168.2.213.34822: Flags [P.], seq 416441:416449, ack 3, win 2918, length 8: HTTP
05:04:41.627559 IP 192.168.2.213.34822 > 192.168.2.205.80: Flags [.], ack 416449, win 29200, length 0
bremoran commented 8 years ago

I don't know if this will help, but this is the elf that generated that bin.

test1.zip

lws-team commented 8 years ago

I use yt debug normally... do you know how to start gdb by hand with that ELF?

I guess we find he stopped in the same place after sending the packet the second time.

bremoran commented 8 years ago

You should be able to start it by doing:

In one terminal

$ pyocd-gdbserver

In a second:

$ arm-none-eabi-gdb -ex "target remote localhost:3333" $elf
lws-team commented 8 years ago

He died here

#0  HardFault_Handler () at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/mbed-hal-k64f/source/bootstrap_gcc/startup_MK64F12.S:259
#1  <signal handler called>
#2  0x0000d338 in tcp_output (pcb=0x1fffafb0 <memp_memory+760>) at /Users/bremor01/dev/yotta/import/sal-stack-lwip/source/lwip/core/tcp_out.c:1015
#3  0x00007152 in lwipv4_socket_send (socket=0x200073fc, buf=0x2186c, len=8) at /Users/bremor01/dev/yotta/import/sal-stack-lwip/source/asynch_socket.c:626
#4  0x00005696 in mbed::Sockets::v0::Socket::send (this=0x20007368, buf=0x2186c, len=8) at /Users/bremor01/dev/yotta/import/sockets/source/v0/Socket.cpp:206
#5  0x00001e42 in connection::send_some (this=0x20007358) at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/source/app.cpp:29
#6  0x00001f20 in connection::onSent (this=0x20007358, len=8) at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/source/app.cpp:62
#7  0x00002c8e in mbed::util::FunctionPointer2<void, mbed::Sockets::v0::Socket*, unsigned short>::membercaller<connection> (object=0x20007358, member=0x2002f0e4, 
    arg=0x2002f0f8) at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/core-util/core-util/FunctionPointer.h:321
#8  0x000036c2 in mbed::util::FunctionPointerBase<void>::call (this=0x2002f0d8, arg=0x2002f0f8)
    at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/core-util/core-util/FunctionPointerBase.h:80
#9  0x00015366 in mbed::util::FunctionPointerBind<void>::call (this=0x2002f0d8)
    at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/core-util/core-util/FunctionPointerBind.h:44
#10 0x00014dc2 in mbed::util::FunctionPointerBind<void>::operator() (this=0x2002f0d8)
    at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/core-util/core-util/FunctionPointerBind.h:104
#11 0x00014962 in minar::SchedulerData::start (this=0x200068f8) at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/minar/source/minar.cpp:471
#12 0x0001460c in minar::Scheduler::start () at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/minar/source/minar.cpp:295
#13 0x00003ddc in main () at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/mbed-drivers/source/retarget.cpp:458
bremoran commented 8 years ago

I'm still having trouble reproducing this. When I run the command you suggested:

cat <(echo x) - | nc -i 500 192.168.0.2 80 > /dev/null

wireshark shows only the session establishment. I don't see any data packets being sent.

lws-team commented 8 years ago

It seems nc has two different pedigrees with very different code.

http://superuser.com/questions/324812/versions-of-netcat

I'll try one of the alternatives in the next 20 mins.

lws-team commented 8 years ago

OK on Fedora, 'nc' is a symlink to 'ncat'.

Please try

cat <(echo x) - | ncat -i 500 192.168.0.2 80 > /dev/null
bremoran commented 8 years ago

I have managed to reproduce the halt condition in sockets, but not the crash. See here: https://github.com/ARMmbed/sal-stack-lwip/issues/37 I think the two are related.

As a workaround to test an idea, please try the following change:

in app.cpp, add the following lines:

extern "C" void sys_check_timeouts();
void app_start(int argc, char *argv[])
{
    minar::Scheduler::postCallback(sys_check_timeouts).period(minar::milliseconds(250));

In yotta_modules/sal-driver-lwip-k64f-eth/source/k64f_emac.c comment line 843:

        if (k64f_phy_state.connected == STATE_UNKNOWN) {
            k64f_phy_state.connected = 1;
            netif_set_link_up(k64f_enetdata.netif);
        }
        emac_timer_fired = 0;
        // sys_check_timeouts();
     }

I'm not able to test this right now; I'd be happy to test it later today. I believe it's causing some related problems, so it might fix the crash you're seeing.

lws-team commented 8 years ago

OK it' started running.

lws-team commented 8 years ago

Died the same 2m later with

07:46:14.232396 IP 192.168.2.205.80 > 192.168.2.233.60253: Flags [P.], seq 2512129:2512137, ack 3, win 2918, length 8: HTTP
07:46:14.232404 IP 192.168.2.205.80 > 192.168.2.233.60253: Flags [P.], seq 2512129:2512137, ack 3, win 2918, length 8: HTTP
07:46:14.232409 IP 192.168.2.233.60253 > 192.168.2.205.80: Flags [.], ack 2512137, win 29200, length 0
bremoran commented 8 years ago

Thanks for checking. I'll continue looking into it.

lws-team commented 8 years ago

The characteristic of this crash problem is that he repeats a packet when he dies... he exactly repeats it and he repeats it right after the first go.

The crash is probably fallout of the same packet send action being run twice on the same struct. The actual problem is why and how does he get triggered to send the same packet twice.

bremoran commented 8 years ago

Yes, this sounds plausible. It sounds like a race condition, but I'm not sure quite where that would be.

bremoran commented 8 years ago

This issue may also be relevant: https://github.com/ARMmbed/sal-driver-lwip-k64f-eth/issues/8

lws-team commented 8 years ago

Ok I try to understand it

lws-team commented 8 years ago

wuh I'll leave that to you guys to implement... it's possible this is related since he comes here for ack processing. In that case the race it's waiting to 'win' is exact timing of the ack coming vs where it is in the send processing, it's credible.

bremoran commented 8 years ago

If you try out the PR I've just added, I suggest adding back the comment I mentioned above for sys_check_timeouts. I left it out of that PR, since it is a separate issue.

lws-team commented 8 years ago

I still have that change but it just kills even ARP atm

bremoran commented 8 years ago

I think that is probably due to EthernetInterface::init(), which is blocking. I tried static config instead, but there's something that's still broken.

bremoran commented 8 years ago

With defered input packet processing, this crash appears to be fixed. I had to use static configuration of the IP address, since the current implementation of DHCP is blocking. I'm working on a nonblocking DHCP.

bremoran commented 8 years ago

I am seeing a related problem where the blinky stops and it looks like sys_check_timeouts is in an endless loop

lws-team commented 8 years ago

I started looking at this from the perspective of the symptom, I have a bit of new info.

Where gdb lands is misleading, it's telling the next instruction after the problem. In our case that's the end of some nested } so it's ambiguous.

In fact we are on the path to dying because the last send, unlike any of the others pcb->unacked is not empty.... it's because we already sent this exact same segment already and it's correctly added itself to the packet's unacked list.

   if (TCP_TCPLEN(seg) > 0) {
      seg->next = NULL;
      /* unacked list is empty? */
      if (pcb->unacked == NULL) {
        pcb->unacked = seg;
        useg = seg;
      /* unacked list is not empty? */
      } else {
<<<<------- we come here when we are just about to crash, because segment already processed

I confirmed I can work around this crash symptom (it's not a fix for the root cause) by checking if it's our segment that's already on the list, and skipping the unacked processing if so.

   if (TCP_TCPLEN(seg) > 0) {
      seg->next = NULL;
      /* unacked list is empty? */
      if (pcb->unacked == NULL) {
        pcb->unacked = seg;
        useg = seg;
      /* unacked list is not empty? */
      } else {
        if (pcb->unacked == seg) {
                printf("well, shit\r\n");
                goto more;
        }

...
more:
    seg = pcb->unsent;
  }

This doesn't deal with WHY the segment was resent but it lets the test become (somewhat?) immune to the fallout from it.

Starting on port 80...
statuscb: lilinkcb: link up, if up
nup, if up
up
linkup 1, ifup 1
linkup 0, ifup 1
well, shit
well, shit
well, shit
well, shit

That's after running for 20 mins which is a world record. Maybe it's a clue the resends are coming once every 5m.

However after 22m he died somewhere new

#0  HardFault_Handler () at /home/agreen/projects/mbed/test1/yotta_modules/mbed-hal-k64f/source/bootstrap_gcc/startup_MK64F12.S:259
#1  <signal handler called>
#2  __REV16 (value=<error reading variable: Cannot access memory at address 0xe48c000e>)
    at /home/agreen/projects/mbed/test1/yotta_modules/cmsis-core/cmsis-core/core_cmInstr.h:428
#3  k64f_enetif_input (netif=0x1fffa2fc <eth>, idx=3) at /home/agreen/projects/mbed/test1/yotta_modules/sal-driver-lwip-k64f-eth/source/k64f_emac.c:480
#4  0x000078e2 in enet_mac_rx_isr (enetIfPtr=<optimized out>) at /home/agreen/projects/mbed/test1/yotta_modules/sal-driver-lwip-k64f-eth/source/k64f_emac.c:786
#5  0x000079f8 in ENET_Receive_IRQHandler () at /home/agreen/projects/mbed/test1/yotta_modules/sal-driver-lwip-k64f-eth/source/k64f_emac.c:829
#6  <signal handler called>
#7  minar::SchedulerData::start (this=0x20006a40) at /home/agreen/projects/mbed/test1/yotta_modules/minar/source/minar.cpp:357
#8  0x00008438 in minar::Scheduler::start () at /home/agreen/projects/mbed/test1/yotta_modules/minar/source/minar.cpp:295
#9  0x00002526 in main () at /home/agreen/projects/mbed/test1/yotta_modules/mbed-drivers/source/retarget.cpp:458

Since this is in the new code, I'll back out all the pending changes and try it again. However it may just be the same problem coming with a different symptom.

lws-team commented 8 years ago

Whoa... downloading the mbed3-dump-http-test app from a few days ago and building from scratch builds but no longer works (I assume due to changes in stuff in the "public registry"). This is without any patches on top at all, just restarting with yt build on freshly cloned sources (that certainly worked a few days ago and have no changes).

He can't acquire dhcp, if I force him to use static IP he takes it, responds to ARP but won't accept any connection.

lws-team commented 8 years ago

This is broken

┣━ sockets 1.1.0
┣━ sal 1.1.0
┗━ sal-stack-lwip 1.1.0
  ┣━ sal-driver-lwip-k64f-eth 1.0.2 yotta_modules/sal-driver-lwip-k64f-eth
  ┗━ sal-iface-eth 1.0.1 yotta_modules/sal-iface-eth

That's the oldest versions with support for setNagle()

The listed versions in my working tree

┗━ sockets 1.0.2
  ┗━ sal 1.0.2 yotta_modules/sal
    ┗━ sal-stack-lwip 1.0.4 yotta_modules/sal-stack-lwip
      ┣━ sal-driver-lwip-k64f-eth 1.0.2 yotta_modules/sal-driver-lwip-k64f-eth
      ┗━ sal-iface-eth 1.0.1 yotta_modules/sal-iface-eth

Do not have the fixes for listening sockets to work.

So there's no tagged packageset that works as far as I can see, if you take the packages from cold. My old versions with the manual patches works with static IP at least.

lws-team commented 8 years ago

I took the approach to back out the last two patches on my working tree, these were bremoran's patches for making rx handling to be serialized in the event loop.

It lasted 38 min (survived 17 double sends) spamming the 8-byte packets then and stopped: when I looked with gdb it was here

sys_timeout (msecs=, handler=0x446d , arg=0x0 <__isr_vector>) at /home/agreen/projects/mbed/test1/yotta_modules/sal-stack-lwip/source/lwip/core/timers.c:296 296 if (t->next == NULL || t->next->time > timeout->time) { (gdb) bt

0 sys_timeout (msecs=, handler=0x446d , arg=0x0 <__isr_vector>)

at /home/agreen/projects/mbed/test1/yotta_modules/sal-stack-lwip/source/lwip/core/timers.c:296

1 0x0000451e in sys_check_timeouts () at /home/agreen/projects/mbed/test1/yotta_modules/sal-stack-lwip/source/lwip/core/timers.c:389

2

3 sys_timeout (msecs=, handler=0x4439 , arg=0x0 <__isr_vector>)

at /home/agreen/projects/mbed/test1/yotta_modules/sal-stack-lwip/source/lwip/core/timers.c:296

4 0x0000451e in sys_check_timeouts () at /home/agreen/projects/mbed/test1/yotta_modules/sal-stack-lwip/source/lwip/core/timers.c:389

5 0x00008264 in call (arg=, this=) at /home/agreen/projects/mbed/test1/yotta_modules/core-util/core-util/FunctionPointerBase.h:80

6 call (this=) at /home/agreen/projects/mbed/test1/yotta_modules/core-util/core-util/FunctionPointerBind.h:44

7 operator() (this=) at /home/agreen/projects/mbed/test1/yotta_modules/core-util/core-util/FunctionPointerBind.h:104

8 minar::SchedulerData::start (this=0x20006a50) at /home/agreen/projects/mbed/test1/yotta_modules/minar/source/minar.cpp:471

9 0x0000828c in minar::Scheduler::start () at /home/agreen/projects/mbed/test1/yotta_modules/minar/source/minar.cpp:295

10 0x000024b6 in main () at /home/agreen/projects/mbed/test1/yotta_modules/mbed-drivers/source/retarget.cpp:458

He's infinitely looping in here

    for(t = next_timeout; t != NULL; t = t->next) {
      timeout->time -= t->time;
      if (t->next == NULL || t->next->time > timeout->time) {
... this clause was false.....
      }
    }

t->next seems to have gotten pointed to t, currently 0x1fffb7c0

In the backtrace, he has reentered sys_timeout, maybe after an exception in the dns timer.

I tried again thisafternoon, after hacking the loop to this

for(t = next_timeout; t != NULL && t->next != t; t = t->next) {

he stayed up nearly 4h before stopping without obviously crashing, that session had dozens of the "double send" workaround hits.

lws-team commented 8 years ago

I left it up last night, he kept going for just over 4h then died in the same way. He's not crashed, he shows afterwards "Socket Error: Connection aborted (20)", but how he died on the network is a bit more involved

...
01:33:17.469470 IP 192.168.2.205.80 > 192.168.2.233.34694: Flags [P.], seq 397437585:397437593, ack 3, win 2918, length 8: HTTP
01:33:17.469508 IP 192.168.2.233.34694 > 192.168.2.205.80: Flags [.], ack 397437593, win 29200, length 0
01:33:23.469326 IP 192.168.2.205.80 > 192.168.2.233.34694: Flags [P.], seq 397437585:397437593, ack 3, win 2918, length 8: HTTP
01:33:23.469366 IP 192.168.2.233.34694 > 192.168.2.205.80: Flags [.], ack 397437593, win 29200, length 0
01:33:35.469053 IP 192.168.2.205.80 > 192.168.2.233.34694: Flags [P.], seq 397437585:397437593, ack 3, win 2918, length 8: HTTP
01:33:35.469089 IP 192.168.2.233.34694 > 192.168.2.205.80: Flags [.], ack 397437593, win 29200, length 0
01:33:59.468484 IP 192.168.2.205.80 > 192.168.2.233.34694: Flags [P.], seq 397437585:397437593, ack 3, win 2918, length 8: HTTP
01:33:59.468528 IP 192.168.2.233.34694 > 192.168.2.205.80: Flags [.], ack 397437593, win 29200, length 0
01:34:04.487358 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:34:05.489358 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:34:06.491351 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:34:47.467363 IP 192.168.2.205.80 > 192.168.2.233.34694: Flags [P.], seq 397437585:397437593, ack 3, win 2918, length 8: HTTP
01:34:47.467412 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:34:48.469361 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:34:49.471357 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:36:23.465136 IP 192.168.2.205.80 > 192.168.2.233.34694: Flags [P.], seq 397437585:397437593, ack 3, win 2918, length 8: HTTP
01:36:23.465178 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:36:24.467361 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:36:25.469357 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:39:35.460639 IP 192.168.2.205.80 > 192.168.2.233.34694: Flags [P.], seq 397437585:397437593, ack 3, win 2918, length 8: HTTP
01:39:35.460681 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:39:36.463351 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:39:37.465360 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:42:47.456111 IP 192.168.2.205.80 > 192.168.2.233.34694: Flags [P.], seq 397437585:397437593, ack 3, win 2918, length 8: HTTP
01:42:47.456153 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:42:48.457349 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:42:49.459357 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:45:59.451608 IP 192.168.2.205.80 > 192.168.2.233.34694: Flags [P.], seq 397437585:397437593, ack 3, win 2918, length 8: HTTP
01:45:59.451649 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:46:00.453348 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:46:01.455357 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:49:11.447106 IP 192.168.2.205.80 > 192.168.2.233.34694: Flags [P.], seq 397437585:397437593, ack 3, win 2918, length 8: HTTP
01:49:11.447153 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:49:12.447365 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:49:13.449359 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:52:23.442610 IP 192.168.2.205.80 > 192.168.2.233.34694: Flags [P.], seq 397437585:397437593, ack 3, win 2918, length 8: HTTP
01:52:23.442650 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:52:24.443355 ARP, Request who-has 192.168.2.205 tell 192.168.2.233, length 28
01:52:24.443599 ARP, Reply 192.168.2.205 is-at 00:02:f7:f0:00:00, length 50
01:52:24.443609 IP 192.168.2.233.34694 > 192.168.2.205.80: Flags [.], ack 397437593, win 29200, length 0
01:52:24.443859 IP 192.168.2.205.80 > 192.168.2.233.34694: Flags [R.], seq 397437594, ack 3, win 2920, length 0
01:52:24.443886 IP 192.168.2.233.34694 > 192.168.2.205.80: Flags [.], ack 397437593, win 29200, length 0
01:52:24.444136 IP 192.168.2.205.80 > 192.168.2.233.34694: Flags [R.], seq 397437594, ack 3, win 2920, length 0

He felt that his last packet did not get acked (it did) and the peer cannot get an ARP response to re-ack it when he sees the dupe. After a while the stalemate ends with the connection termination.

Basically it looks like the network stack on K64F side stops receiving input.

This may be related to the problem bremoran found with locking on the K64F side over the weekend, but his fix dies quicker than this underlying problem occurs atm.

bremoran commented 8 years ago

I have a slightly different failure pattern now. This is with my serialization changes:

18:02:39.028308 IP 192.168.0.106.80 > 192.168.0.105.59059: Flags [P.], seq 2170169:2170177, ack 3, win 2918, length 8
18:02:39.028337 IP 192.168.0.105.59059 > 192.168.0.106.80: Flags [.], ack 2170177, win 65535, length 0
18:02:39.044275 IP 192.168.0.106.80 > 192.168.0.105.59059: Flags [P.], seq 2170177:2170185, ack 3, win 2918, length 8
18:02:39.044304 IP 192.168.0.105.59059 > 192.168.0.106.80: Flags [.], ack 2170185, win 65535, length 0
18:02:39.060504 IP 192.168.0.106.80 > 192.168.0.105.59059: Flags [P.], seq 2170185:2170193, ack 3, win 2918, length 8
18:02:39.060532 IP 192.168.0.105.59059 > 192.168.0.106.80: Flags [.], ack 2170193, win 65535, length 0
18:02:39.183972 IP 192.168.0.106.80 > 192.168.0.105.59059: Flags [P.], seq 2170193:2170201, ack 3, win 2918, length 8
18:02:39.183975 IP 192.168.0.106.80 > 192.168.0.105.59059: Flags [P.], seq 2170193:2170201, ack 3, win 2918, length 8
18:02:39.184017 IP 192.168.0.105.59059 > 192.168.0.106.80: Flags [.], ack 2170201, win 65535, length 0
18:02:39.184027 IP 192.168.0.105.59059 > 192.168.0.106.80: Flags [.], ack 2170201, win 65535, length 0
18:02:39.188220 IP 192.168.0.106.80 > 192.168.0.105.59059: Flags [P.], seq 2170193:2170201, ack 3, win 2918, length 8
18:02:39.188245 IP 192.168.0.105.59059 > 192.168.0.106.80: Flags [.], ack 2170201, win 65535, length 0
18:02:40.784329 ARP, Request who-has 192.168.0.1 tell 192.168.0.106, length 46
18:02:40.784677 ARP, Request who-has 192.168.0.1 tell 192.168.0.106, length 46
18:02:40.784841 ARP, Request who-has 192.168.0.1 tell 192.168.0.106, length 46
18:02:40.785131 ARP, Request who-has 192.168.0.1 tell 192.168.0.106, length 46

After a large number of ARP requests, I get this:

18:03:10.585766 IP 192.168.0.106.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:02:f7:f0:00:00, length 308
18:03:10.586354 IP 192.168.0.106.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:02:f7:f0:00:00, length 308
18:03:10.586883 IP 192.168.0.106.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:02:f7:f0:00:00, length 308
18:03:10.587446 IP 192.168.0.106.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:02:f7:f0:00:00, length 308
18:03:10.588216 IP 192.168.0.106.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:02:f7:f0:00:00, length 308
18:03:10.588855 IP 192.168.0.106.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:02:f7:f0:00:00, length 308
18:03:10.589401 IP 192.168.0.106.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:02:f7:f0:00:00, length 308
18:03:10.590063 IP 192.168.0.106.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:02:f7:f0:00:00, length 308
18:03:10.590763 IP 192.168.0.106.68 > 255.255.255.255.67: BOOTP/DHCP, Request from 00:02:f7:f0:00:00, length 308
18:03:10.591338 IP 192.168.0.106.68 > 0.0.0.0.67: BOOTP/DHCP, Request from 00:02:f7:f0:00:00, length 308
18:10:59.231676 ARP, Request who-has 192.168.0.106 (00:02:f7:f0:00:00) tell 192.168.0.105, length 28
18:10:59.902023 ARP, Request who-has 192.168.0.106 (00:02:f7:f0:00:00) tell 192.168.0.105, length 28
18:11:01.007683 ARP, Request who-has 192.168.0.106 (00:02:f7:f0:00:00) tell 192.168.0.105, length 28
18:11:02.914816 ARP, Request who-has 192.168.0.106 (00:02:f7:f0:00:00) tell 192.168.0.105, length 28
18:11:06.397176 ARP, Request who-has 192.168.0.106 (00:02:f7:f0:00:00) tell 192.168.0.105, length 28
18:11:13.212464 ARP, Request who-has 192.168.0.106 tell 192.168.0.105, length 28
18:11:23.867181 ARP, Request who-has 192.168.0.106 tell 192.168.0.105, length 28
18:11:44.955585 ARP, Request who-has 192.168.0.106 tell 192.168.0.105, length 28
18:12:06.118853 ARP, Request who-has 192.168.0.106 tell 192.168.0.105, length 28
18:12:27.214227 ARP, Request who-has 192.168.0.106 tell 192.168.0.105, length 28
18:13:10.376958 ARP, Request who-has 192.168.0.106 tell 192.168.0.105, length 28
18:13:31.871648 ARP, Request who-has 192.168.0.106 tell 192.168.0.105, length 28
18:13:53.006823 ARP, Request who-has 192.168.0.106 tell 192.168.0.105, length 28

At the end of this, the k64f backtrace is:

#0  HardFault_Handler ()
    at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/mbed-hal-k64f/source/bootstrap_gcc/startup_MK64F12.S:259
#1  <signal handler called>
#2  0x0000a716 in dhcp_create_msg (netif=0x1fffa2fc <eth>, dhcp=0x20003a20 <ram_heap+25032>, message_type=1 '\001')
    at /Users/bremor01/dev/yotta/import/sal-stack-lwip/source/lwip/core/dhcp.c:1692
#3  0x00009856 in dhcp_discover (netif=0x1fffa2fc <eth>)
    at /Users/bremor01/dev/yotta/import/sal-stack-lwip/source/lwip/core/dhcp.c:877
#4  0x00009388 in dhcp_timeout (netif=0x1fffa2fc <eth>)
    at /Users/bremor01/dev/yotta/import/sal-stack-lwip/source/lwip/core/dhcp.c:405
#5  0x0000934e in dhcp_fine_tmr () at /Users/bremor01/dev/yotta/import/sal-stack-lwip/source/lwip/core/dhcp.c:381
#6  0x00007a78 in dhcp_timer_fine (arg=0x0 <__isr_vector>)
    at /Users/bremor01/dev/yotta/import/sal-stack-lwip/source/lwip/core/timers.c:167
#7  0x00007c92 in sys_check_timeouts () at /Users/bremor01/dev/yotta/import/sal-stack-lwip/source/lwip/core/timers.c:389
#8  0x000028b8 in mbed::util::FunctionPointer0<void>::staticcaller (object=0x7c25 <sys_check_timeouts>, member=0x2002f098, 
    arg=0x2002f0ac)
    at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/core-util/core-util/FunctionPointer.h:113
#9  0x00003896 in mbed::util::FunctionPointerBase<void>::call (this=0x2002f08c, arg=0x2002f0ac)
    at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/core-util/core-util/FunctionPointerBase.h:80
#10 0x000157ce in mbed::util::FunctionPointerBind<void>::call (this=0x2002f08c)
    at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/core-util/core-util/FunctionPointerBind.h:44
#11 0x0001522a in mbed::util::FunctionPointerBind<void>::operator() (this=0x2002f08c)
    at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/core-util/core-util/FunctionPointerBind.h:104
#12 0x00014dca in minar::SchedulerData::start (this=0x200068b0)
    at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/minar/source/minar.cpp:471
#13 0x00014a74 in minar::Scheduler::start ()
    at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/minar/source/minar.cpp:295
#14 0x00003fb0 in main ()
    at /Users/bremor01/dev/experiments/mbed3-dumb-http-test/yotta_modules/mbed-drivers/source/retarget.cpp:458
bremoran commented 8 years ago

I believe you are correct that something in the LwIP stack is breaking and we're no longer getting input.

There may still be some code that calls the network stack in a re-entrant way, but I think I've now eliminated the bulk of it. It seems that we have subtly different network conditions which change the timing of the failure.