ARMmbed / mbed-client

DEPRECATED: C++ API for mbed-client
https://cloud.mbed.com/docs/current
Apache License 2.0
53 stars 85 forks source link

Long runs fail #392

Closed betzw closed 7 years ago

betzw commented 7 years ago

I am running the following setup:

Both are configured for a 6LoWPAN setup with PSK security and INFO tracing enabled. This setup runs for quite long periods (more than one day) but it seems that prior or later something goes wrong and the client is not anymore listed on the Connector under "Connected Devices". Looking at the standard output traces of the two programs, I cannot see nothing suspicious on the border router side, while the client shows typically the following trace. Note, this is obviously only a piece of the trace somewhere in the middle of the execution.

Any idea what's going wrong and why?

ciarmcom commented 7 years ago

ARM Internal Ref: IOTCLT-1365

teetak01 commented 7 years ago

Hi @betzw looks like to be connectivity issue. Why is the Client showing a link local address there? That does not look to be correct IP. Could the border router be giving the Client wrong IP at some point?

teetak01 commented 7 years ago

Well, I'm not mesh-specialist, maybe @SeppoTakalo can comment on that.

yogpan01 commented 7 years ago

Hi, [ERROR:] M2MInterface::SecureConnectionFailed is the error that I can spot from your traces. We have noticed that error when client is trying reconnection during long runs. One of the findings was client was leaking memory for every reconnection sequence on client side which cumulatively caused the TLS handshake to eventually fail as mbedTLS fails to allocate memory for further handshakes. We have a PR https://github.com/ARMmbed/mbed-client-c/pull/117 but we need to merge this to mbed-os for the fix to be in.

SeppoTakalo commented 7 years ago

Would it be possible to have ping tests done for this kind of networks? From client logs you can see following lines:

[INFO][6Bo ]: Received MLE Advertisement from fe80::f2f1:f2f3:f4f5:f6f7

That means that it still talks to border router.

However, that is not guarantee that the ethernet connectivity still works.

So I would suggest that when starting up, both node and client, capture their public IPv6 addreses.

Then start ocassionally sending ping requests to both, just to see if they still are reachable from outside network.

When debugging these, its important to know whether:

  1. Gateway providing IPv6 access is reachable from global IPv6 network.
  2. Border router is reachable from global IPv6 network.
  3. 6LoWPAN/Thread nodes are reachable from global IPv6 network.
betzw commented 7 years ago

Thanks @yogpan01, so if I understand correctly, this behavior is a already known issue?!? If yes, when do you think this PR will be merged into mbed-client-c and in the end also in mbed-os?

betzw commented 7 years ago

cc @screamerbg

JanneKiiskila commented 7 years ago

@betzw - if you're willing to trial, you can replace the mbed-client-c in mbedOS with the actual originating mbed-client-c from Github - https://github.com/ARMmbed/mbed-client-c, delete that folder and replace it with the latest contents of that repository. You would be able to rule that issue out then.

betzw commented 7 years ago

Well, this is exactly what I am doing in these days (not the latest version on master but v4.0.0 of mbed-client-c). Currently the counter has arrived at 18173.

JanneKiiskila commented 7 years ago

Hi @betzw - excellent, please keep us posted and thank You for the support!

betzw commented 7 years ago

Well, this morning the example client device unfortunately was disconnected again. The tail of the trace reported:

simulate button_click, new value of counter is 28352
[INFO][6Bo ]: Received MLE Advertisement from fe80::f2f1:f2f3:f4f5:f6f7
simulate button_click, new value of counter is 28353
simulate button_click, new value of counter is 28354
simulate button_click, new value of counter is 28355
simulate button_click, new value of counter is 28356
simulate button_click, new value of counter is 28357
[ERR ][buff]: buffer_get failed: alloc(zd)
[ERR ][nsif]: socket_sendto: error=-2

In contrast to previous runs, the client application at this point didn't continue to work at network connection level (i.e. no further trace output), while the green LED was still blinking. On the BR side I couldn't see anything suspicious in the trace. The latest counter values tell that the application ran for about 118 hours (~5 days).

JanneKiiskila commented 7 years ago

Potential memory fragmentation? Or running out of memory due to memory leak? Do we know who is trying to allocate at "buffer_get failed: alloc(zd)" and how much? Valgrind would be our friend here.

yogpan01 commented 7 years ago

@SeppoTakalo I think the nanostack or the socket adaption layer might be having the allocation failure here. Can you check where we are getting this logs from and will such a case result in the failure of the whole thread on which its running ?

SeppoTakalo commented 7 years ago

buffet_get is internal function of Nanostack. So we have ran out of heap space allocated for Nanostack and its event-loop.

Its not always fatal. It became fatal if events starts dropping.

I wonder if there is possibility to ping that device during that test? Did it got dropped from network? Or was it just client abandoned the connection because it could not send?

betzw commented 7 years ago

Unfortunately, I have already switched off the boards. From what I remember, both traces (i.e. client & br) didn't report any advertisement information anymore after the allocation failure. Not sure if this helps?

SeppoTakalo commented 7 years ago

After these changes to driver: https://github.com/ARMmbed/stm-spirit1-rf-driver/commit/ac7a4f4772228d856b9c88709ad54066f396c5d2 I would propose that this test is to be repeated as the driver was known to be unstable before that.

betzw commented 7 years ago

Correct! I am currently performing exactly this test and am close to closing this issue as - as of today - we have an up time of ~19 days :smile:!

JanneKiiskila commented 7 years ago

Excellent, thanks!

betzw commented 7 years ago

Uptime >20 days => closing issue :smiley:

JanneKiiskila commented 7 years ago

Yes!-)