contiki-os / contiki

The official git repository for Contiki, the open source OS for the Internet of Things
http://www.contiki-os.org/
Other
3.71k stars 2.58k forks source link

Issue with IPv6 ND for non-RPL nodes. #2018

Open tscheffl opened 7 years ago

tscheffl commented 7 years ago

I am experiencing a bug in the current Contiki IPv6 ND code, where I can no longer reach a Contiki minimal-net Process running on Linux (RPL not enabled).

The NA send from Contiki in response to the NS from Linux goes to the wrong LL-Adress and appears to be truncated.

As you can see in the screenshot:

bildschirmfoto 2016-12-22 um 14 10 08

The following settings have been applied to platform/minimal-net/contiki_conf.h:

#undef UIP_CONF_IPV6_RPL
#define UIP_CONF_IPV6_RPL 0
#define RPL_BORDER_ROUTER 0
#define HARD_CODED_ADDRESS   "aaaa::10"

I have traced the bug back to the following commit: https://github.com/contiki-os/contiki/commit/b3ea124958d9f3a63aa8e77a89919cb95f6625b5

However, this may not be the root cause of the bug, because I also see the MAC address truncation in versions prior to this commit. However, previously the asking host simply send another NS-request and Contiki answered correctly on the second try. Now it is consistent in always sending the NA to the wrong MAC address. :-(

joakimeriksson commented 7 years ago

Hello. Yes I think this might be due to the fact that we have not used Contiki a lot with Ethernet as the main interface so things might have turned bad. Are you using the plain minimal-net config as of upstream today? I have recently worked a bit with mixing 6 bytes and 8 bytes link layer addresses in the tables and that gave a lot of interesting experience... So I think we should be able to fix this also - somehow.

tscheffl commented 7 years ago

Hi Joakim,

I tried it yesterday (Dec, 22) with a fresh clone from the Github repository and than slowly progressed backwards by checking out older and older versions until hitting the commit, where the issue first manifested itself: https://github.com/contiki-os/contiki/commit/b3ea124958d9f3a63aa8e77a89919cb95f6625b5

I seems to me however, that I can see the issue before then. Even when ND seems to be working, Contiki sends the very first NA to the wrong MAC address, but answers correctly on the second try.

It would be very good, if this could be fixed. I use the minimal-net config very often for service development and in classes, to get my students up to speed. The turn-around is just so much better...

Regards, Thomas

joakimeriksson commented 7 years ago

Ok, good then I will easily be able to repeat the issue just by running minimal-net on a linux.

tscheffl commented 7 years ago

Yes, it should be easy to replicate.

I just compiled the hello-world example and tried to ping the address. I could not get through for both the link-local and the global address.

joakimeriksson commented 7 years ago

Ok, just replicated the issue - it is actually storing 2-byte address so this have been broken since we moved neighbor MAC addresses into the nbr-table as it only stores either 2 or 8 bytes. I will try to see if I can figure out a "trivial" fix - but now we know what is likely the issue.

tscheffl commented 7 years ago

A short while ago I had a student building a 6LoWPAN border router based on Contiki. He extended the routing to make multi-interface routing possible. I think he simply stored the EUI-64 extended address in the nbr-table. However, this would require to 'know' the type of the interface and translate it as necessary...

You can find the paper here: https://www.researchgate.net/publication/269319668_Development_of_a_Contiki_border_router_for_the_interconnection_of_6LoWPAN_and_Ethernet And I may also have the code laying arround somewhere, if this helps.

joakimeriksson commented 7 years ago

Thanks for the reference - it is on the long term todo-list to enable multiple interfaces! Will take a look.

BTW: Just got some ping now for minimal net so NA/NS now works: user@instant-contiki:~$ ping6 fd00::ff:fe00:10 PING fd00::ff:fe00:10(fd00::ff:fe00:10) 56 data bytes 64 bytes from fd00::ff:fe00:10: icmp_seq=1 ttl=64 time=0.070 ms 64 bytes from fd00::ff:fe00:10: icmp_seq=2 ttl=64 time=0.069 ms 64 bytes from fd00::ff:fe00:10: icmp_seq=3 ttl=64 time=0.091 ms 64 bytes from fd00::ff:fe00:10: icmp_seq=4 ttl=64 time=0.091 ms

So there is hope. Will not be a PR before Christmas - but soon!

joakimeriksson commented 7 years ago

I did a PR now - so if @tscheffl could test that is solves the issue it would be great! See #2028 .

tscheffl commented 7 years ago

@joakimeriksson I tested the code and it solves the reported problem. The NA from Contiki now goes out to the correct Ethernet-Adress on minimal-net

There are however two things I noticed. The first is very minor:

  1. The comment in Line 57 in core/net/linkaddr.c should be changed as follows: #endif /*LINKADDR_SIZE == 6*/

  2. I also tested the code with a Contiki MQTT-Client that initiates an outgoing TCP connection and posts regular messages to a broker (one MQTT message every 30 seconds).\ The program starts fine, but after a while Contiki seems to time-out the entry for the broker from the Neighbor Cache and is unable to acquire a new one. There are NS/NA messages send, however, it seems as if Contiki is unable to use the advertised address to refresh the Neighbor Cache and send messages (see the attached screenshot).\ I can work around the issue by constantly pinging the Contiki Client, in order to keep the entry in the Neighbor Cache fresh. As soon as I stop pinging, the described behaviour sets in.

It seems to me that problem 2 is an altogether different issue. Do you recommend to post it in the issue tracker? bildschirmfoto 2016-12-29 um 22 16 01

joakimeriksson commented 7 years ago

Ok, that is probably due to a short lifetime of the neighbor and possibly due to other configuration issues. Default is 30 seconds if the platform is not setting anything. And there is no longer any update of the neighbor timeout when a NS / NA is received since it was assumed to be updating that for any unicast. But many of these get sent to solicited multicast address which will not update the cache. Not 100% sure it is due to that but could be. I would rather keep the update at both places - I will try adding that to my PR also and notify you when I fixed that.

tscheffl commented 7 years ago

It is not the timeout that is bothering me. It is expected that NC entries time out. When this happens a new NS/NA message exchange should take place.

However, Contiki seems not to update the NC with the 'new' information from the received NA and therefore can not continue sending data via the TCP connection. There clearly is some error here. It keeps asking for aaaa::1

The NA for aaaa::1 goes to a Unicast-Address. The only thing slightly odd is, that both Contiki and Linux set the Router and Override Flag in their NA...

joakimeriksson commented 7 years ago

Yes, and looking at the code I am not really sure - from what I see there are unicast NA coming in and they should update the entry. But the flow is correct in wireshark? E.g. Contiki should ask for aaaa::1?

I guess I have to set up the same thing myself - as a ping from the outside will send of a unicast message and trigger an update while a TCP transmission from the Contiki node will in fact send an NS instead of the TCP packet and then get the NA back (and add). Contiki will drop the first TCP packet since the NS will over write the TCP packet in the buffer. But that should only cause a retransmission and not a complete loss of data.

tscheffl commented 7 years ago

Yes, aaaa::1 is correct. It is the tap0 interface and was happy sending packets before (earlier in the trace).

I am currently doing some more tests and all of a sudden it looks, as if the network code is fine. It must be something in the application that blocks and keeps the network code from updating the NC. It is running now for almost 15 minutes and I see TCP messages and NS/NA exchanges happen all in good order.

I will have a closer look at the application tomorrow. It is a modified MQTT app, based on the code from TI that some people had issues with before (https://github.com/contiki-os/contiki/issues/1858).

joakimeriksson commented 7 years ago

Ok, that would be good. I did not see any cases where the NC should not be updated on incoming unicast messages so that would explain it I guess. Keep us informed on the results of your tests!

tscheffl commented 7 years ago

@joakimeriksson

I did some hunting around in the MQTT code from TI and could not detect anything wrong. In order to narrow the problem down, I decided to write a simple TCP client using the new TCP API (the MQTT code is using it as well).

After some lengthy testing, it now looks very much like the TCP API is causing the problems. I can reproduce the behavior that I see with the MQTT code. I can start a connection, but later on the entry in the NC expires and things simply come to a halt.

You can find my code here, if you want to give it a try: https://github.com/tscheffl/Contiki-Examples/tree/master/Examples/TCP-client

I must say that I find the new API very poorly documented and I am not very sure that I have used it correctly. It also seems that very few projects make outgoing TCP connections...

joakimeriksson commented 7 years ago

Have you seen anything that gives you an idea about the reason for not updating the entry? I find it very odd - will have to try your client soon - so that I can see what happens.

tscheffl commented 7 years ago

I looked briefly into contiki/core/net/ip/tcp-socket.c. It seems to be build around tcpip_event, but I did not see anything suspicious.

Next in line would be: contiki/core/net/ip/tcpip.c. There appears to be some code to handle IPv6 ND. I have not had the time to make sense of the code in there. Sorry.

I stumbled upon the following message from the Zephyr Project indicating trouble with Client TCP connections. They appear to be using the Contiki IP stack: https://lists.zephyrproject.org/archives/list/devel@lists.zephyrproject.org/message/WM4VFMD7SWNPTK6N2J2ST7GIODGIJ7XQ/

PureEngineering commented 7 years ago

We are running into the same issue on our end. Everything works for about 10 minutes, then the node disconnects. Anything we can look at to solve the issue?