Closed betzw closed 7 years ago
ARM Internal Ref: IOTCLT-1365
Hi @betzw looks like to be connectivity issue. Why is the Client showing a link local address there? That does not look to be correct IP. Could the border router be giving the Client wrong IP at some point?
Well, I'm not mesh-specialist, maybe @SeppoTakalo can comment on that.
Hi, [ERROR:] M2MInterface::SecureConnectionFailed is the error that I can spot from your traces. We have noticed that error when client is trying reconnection during long runs. One of the findings was client was leaking memory for every reconnection sequence on client side which cumulatively caused the TLS handshake to eventually fail as mbedTLS fails to allocate memory for further handshakes. We have a PR https://github.com/ARMmbed/mbed-client-c/pull/117 but we need to merge this to mbed-os for the fix to be in.
Would it be possible to have ping tests done for this kind of networks? From client logs you can see following lines:
[INFO][6Bo ]: Received MLE Advertisement from fe80::f2f1:f2f3:f4f5:f6f7
That means that it still talks to border router.
However, that is not guarantee that the ethernet connectivity still works.
So I would suggest that when starting up, both node and client, capture their public IPv6 addreses.
Then start ocassionally sending ping requests to both, just to see if they still are reachable from outside network.
When debugging these, its important to know whether:
Thanks @yogpan01,
so if I understand correctly, this behavior is a already known issue?!?
If yes, when do you think this PR will be merged into mbed-client-c
and in the end also in mbed-os
?
cc @screamerbg
@betzw - if you're willing to trial, you can replace the mbed-client-c in mbedOS with the actual originating mbed-client-c from Github - https://github.com/ARMmbed/mbed-client-c, delete that folder and replace it with the latest contents of that repository. You would be able to rule that issue out then.
Well, this is exactly what I am doing in these days (not the latest version on master
but v4.0.0
of mbed-client-c
).
Currently the counter has arrived at 18173
.
Hi @betzw - excellent, please keep us posted and thank You for the support!
Well, this morning the example client device unfortunately was disconnected again. The tail of the trace reported:
simulate button_click, new value of counter is 28352
[INFO][6Bo ]: Received MLE Advertisement from fe80::f2f1:f2f3:f4f5:f6f7
simulate button_click, new value of counter is 28353
simulate button_click, new value of counter is 28354
simulate button_click, new value of counter is 28355
simulate button_click, new value of counter is 28356
simulate button_click, new value of counter is 28357
[ERR ][buff]: buffer_get failed: alloc(zd)
[ERR ][nsif]: socket_sendto: error=-2
In contrast to previous runs, the client application at this point didn't continue to work at network connection level (i.e. no further trace output), while the green LED was still blinking. On the BR side I couldn't see anything suspicious in the trace. The latest counter values tell that the application ran for about 118 hours (~5 days).
Potential memory fragmentation? Or running out of memory due to memory leak? Do we know who is trying to allocate at "buffer_get failed: alloc(zd)" and how much? Valgrind would be our friend here.
@SeppoTakalo I think the nanostack or the socket adaption layer might be having the allocation failure here. Can you check where we are getting this logs from and will such a case result in the failure of the whole thread on which its running ?
buffet_get
is internal function of Nanostack.
So we have ran out of heap space allocated for Nanostack and its event-loop.
Its not always fatal. It became fatal if events starts dropping.
I wonder if there is possibility to ping that device during that test? Did it got dropped from network? Or was it just client abandoned the connection because it could not send?
Unfortunately, I have already switched off the boards. From what I remember, both traces (i.e. client & br) didn't report any advertisement information anymore after the allocation failure. Not sure if this helps?
After these changes to driver: https://github.com/ARMmbed/stm-spirit1-rf-driver/commit/ac7a4f4772228d856b9c88709ad54066f396c5d2 I would propose that this test is to be repeated as the driver was known to be unstable before that.
Correct! I am currently performing exactly this test and am close to closing this issue as - as of today - we have an up time of ~19 days :smile:!
Excellent, thanks!
Uptime >20 days => closing issue :smiley:
Yes!-)
I am running the following setup:
mbed-os-client-example
running on top of _NUCLEOF429ZI & Spirit1 Sub-GHz RFk64f-border-router
adapted to run on top of another _NUCLEOF429ZI & Spirit1 Sub-GHz RFBoth are configured for a 6LoWPAN setup with PSK security and INFO tracing enabled. This setup runs for quite long periods (more than one day) but it seems that prior or later something goes wrong and the client is not anymore listed on the Connector under "Connected Devices". Looking at the standard output traces of the two programs, I cannot see nothing suspicious on the border router side, while the client shows typically the following trace. Note, this is obviously only a piece of the trace somewhere in the middle of the execution.
Any idea what's going wrong and why?