contiki-os / contiki

The official git repository for Contiki, the open source OS for the Internet of Things
http://www.contiki-os.org/
Other
3.72k stars 2.58k forks source link

New NBRs are not considered (even if table full of non-reachable NBRs) #1380

Open pablocorbalan opened 8 years ago

pablocorbalan commented 8 years ago

When the Neighbors Cache Table is full and new neighbours are detected, these new NBRs are dropped because they can not be added to the NBRs table. This happened even if the state of the entries available in the table is not REACHABLE, for instance, DELAY or STALE. With the default configuration with NDP disabled, this could happen (in a high network density scenario) as there is no mechanism in the whole RPL code that may remove a neighbor entry from the cache table. Also, neighbours that perhaps are not parents and neither children, may appear in the table with a state that is not REACHABLE and take the space of a possible decent parent or child. Notice that a neighbor with a state that is not REACHABLE may actually be reachable, but the entry is not updated.

When this happens, we could check the state of the current neighbors and perhaps replace the most outdated neighbor with the new neighbor detected. I do not know if this is non-compliant with neither IPv6 NDP or 6LoWPAN-ND, but I think we could consider something like this. On the other hand, this could lead to the removal of Neighbor Entries that for some reason are not refreshed even when used (this could happen with the preferred RPL parent if its DIO has been suppressed and we do not consider Link Layer ACKs valid to refresh neighbor cache entries).

Also, a probing mechanism (similar to the RPL probing) to do NUD could be considered, but instead of doing this perhaps we should just use the NDP protocol available.

I could implement the mechanism to replace an outdated neighbor with a new one, but I would prefer to hear some comments before, in case I am missing something or I am mistaken.

Notice that imilar issues are discussed and explained in #1063

joakimeriksson commented 8 years ago

Quick question: Do you know if this is due to locking or the default deletion policy of the nbr-table module? From what I remember I usually have seen the opposite problem - throwing out perfectly valid neighbors when new possibly less interesting neighbors were detected. I know that the routing module locks any neighbor that is nexthop and I also the preferred parent is locked. So I guess maybe this is a case where all the neighbors are either the default parent or a nexthop? Should they then be thrown out? (if they get in a bad state - that should be fixed separately - by better interaction between NDP and RPL- as you mention in the other issue #1379 ).

simonduq commented 8 years ago

This happened even if the state of the entries available in the table is not REACHABLE, for instance, DELAY or STALE. With the default configuration with NDP disabled, this could happen

Right. I think DELAY and STALE don't really make sense with NDP disabled, as no NS/NA will be sent to go back to REACHABLE. I would be in favor of always being REACHABLE when NDP is off (neighbors will be removed according to the nbr-table replacement policy anyway).

there is no mechanism in the whole RPL code that may remove a neighbor entry from the cache table.

Right, RPL does not remove neighbors from rpl_parent_t. But deletion will happen anyway, when the table is full, by the nbr-table module. Currently picks the least recently used, unlocked entry. https://github.com/contiki-os/contiki/pull/1213 defines configurable policies.

Notice that a neighbor with a state that is not REACHABLE may actually be reachable, but the entry is not updated.

Exactly, that's why I think we shouldn't use DELAY and STALE when ND is disabled anyway. We should instead simply leave it to RPL. Nodes leaving the network will become unused and therefore unlocked by RPL eventually. Happens when a parent stops responding and we switch away, or when a child does not send a DAO and its route expires.

When this happens, we could check the state of the current neighbors and perhaps replace the most outdated neighbor with the new neighbor detected.

Right I also feel RPL should more actively select which neighbors can be deleted. https://github.com/contiki-os/contiki/pull/1213 enables this.

Also, a probing mechanism (similar to the RPL probing) to do NUD could be considered, but instead of doing this perhaps we should just use the NDP protocol available.

AFAICT ND in Contiki includes NUD (probing via NS/NA)

pablocorbalan commented 8 years ago

@joakimeriksson @simonduq I was mistaken, as you were saying, the deletion policy within the nbr-table module will try to find space (by removing a neighbor if the table is full) to store a new neighbor. I am sorry I did not realise about this deletion policy and I was mainly obsessed about when the uip_ds6_nbr_rm is called. This explains the network behavior when the possible maximum number of neighbors is greater than the number we can store.

Regarding the usage of STALEand DELAYwith NDP disabled, I agree that with the current configuration and functionality perhaps it does not make much sense to support these neighbor states. Also, functions like uip_ds6_neighbor_periodic with the current implementation/configuration does not do much neither and could be perhaps only enabled when NDP is enabled.

On the other hand, as I feel we all agree it would be nicer if RPL would interact more with the neighbor cache management. A starting point would be to use DAO ACK received as a clear indication of the reachability of a parent (this is compliant with both RFC 4861 & 6775). Then, after receiving the ACK, the child could (re)set its parent as REACHABLE, avoiding sending "extra" NS/NA messages with NDP enabled. As far as I know, this is neither included in the current code and neither supported by #1213. If I am not wrong, this could be a great addition to #1213. Also, the RPL Probingmechanism could be updated to also perform similar NUDfunctionality. After these changes, maybe the Neighbor Deletion Policycould be updated to take into account as well the neighbor state somehow.

Also, if we keep using STALEand DELAY (no matter if with or without NDP), I think not DIO and neither DAOinputs should be used to state that a neighbor is REACHABLE, as it only verifies that a node is able to receive messages from a certain neighbor, but it does not assert that the node is able to send anything to that neighbor. That would be (I think) compliant with the mentioned RFCs, but perhaps would increase the overhead of NDP.

What do you think of these proposed changes?

Thanks for the clarifications regarding the neighbor deletion policy and neighbor cache management. It is really helpful for me.

simonduq commented 8 years ago

I like your proposed additions to have RPL update the REACHABLE state. I also agree that the ability to receive from a nbr does not mean the nbr is reachable.

Maybe we could simply and set the state to REACHABLE whenever an outgoing unicast gets ACKed? (independent on whether this is a RPL DAO-ACK or something else)

I'm wondering more and more if disabling ND by default (https://github.com/contiki-os/contiki/pull/1063) was a good move... what's your take on this?

joakimeriksson commented 8 years ago

I think that ND is ok - it will keep "probing" all neighbours that are worth having in the table. The only problem with ND is that we loose one packet each time we send the first NS to a IPv6 address that is not in REACHABLE state. But this should probably be fairly easy to fix. Other than that I do not see any very significant problems with having ND enabled.

simonduq commented 8 years ago

But the question is, what does ND give us that RPL+probing doesn't?

RPL takes care of selecting links that are usable. It does it with finer granularity than ND, using link metrics rather than binary reachability. It probes more efficiently because it knows which links are used in the toplogy, and it does proactive probing rather than reactive.

simonduq commented 8 years ago

One more note: keeping link information about bad neighbors is also desirable (when space allows). Else, you have to re-learn the link metric the hard way (packet drops) whenever you re-add the neighbor to your table. Happens when e.g. you have the Root in reach but with a very poor link (you receive a DIO every now and then from it).

sdawans commented 8 years ago

@simonduq your last comment is especially important when the root uses high TX power and links become asymetric. you might even consider protecting the "rpl root" entry in neighbor tables, which you can infer from the rank.

simonduq commented 8 years ago

Welcome back @sdawans! Very good point. Locking the root makes sense (at least as a compile-time feature)

mcr commented 8 years ago

Simon Duquennoy notifications@github.com wrote:

But the question is, what does ND give us that RPL+probing doesn't?

1) at present, RPL doesn't give one 6CO, but there is a proposal in the ROLL WG to change that. IF you want that to happen, saying so on the WG mailing list is important.

2) no communication with 6lowpan-only nodes, which technically at this point may violate IETF specifications, but is really an open debate. ND-only communication may be valuable for really really stupid sensors like battery operated things. (I think of windows sensors that are part of an alarm system)

] Never tell me the odds! | ipv6 mesh networks [ ] Michael Richardson, Sandelman Software Works | network architect [ ] mcr@sandelman.ca http://www.sandelman.ca/ | ruby on rails [

simonduq commented 8 years ago

Thanks for the heads up @mcr, I need to catch up on the latest discussions in the WG.

pablocorbalan commented 8 years ago

@simonduq I do agree we could set the state to REACHABLE whenever an outgoing unicast gets ACKed. I think this would be compliant with both RFC 4861 & 6775. According to RFC 4861:

A neighbor is considered reachable if the node has recently received a confirmation that packets sent recently to the neighbor were received by its IP layer. Positive confirmation can be gathered in two ways: hints from upper-layer protocols that indicate a connection is making "forward progress", or receipt of a Neighbor Advertisement message that is a response to a Neighbor Solicitation message.

Regarding the decision about disabling NDP, I think it was a right decision. At the moment, the interaction between RPL and NDP is really poor, which could create a lot of unneeded overhead in stable networks for no valuable reason. For instance, nodes with 30 or 40 neighbours could be sending/receiving several NA/NS messages every 10 min and only change the STATE of the neighbours without that affecting at all (for instance, in routing decisions or other parameters for data applications or something). However, in scenarios in which neighbors information is really necessary (for whichever reason) NDP could provide a lot of interesting information and could support the RPL decisions for instance by improving the link metrics stored. For the moment, as long as there is no good interaction between RPL and NDP, I think we can leave it disabled, because there is no good reason for the overhead.

My main problem with disabling NDP, is that in certain situations/scenarios/applications, you could believe to have a neighbor which you don't have anymore and "never" realise about it only with RPL.

simonduq commented 8 years ago

Right, here is my proposal:

Anything else?

I'll look into the two RPL-only items shortly. Let's avoid duplicate effort and announce here whatever we plan to work on.

mcr commented 8 years ago

Pablo: Your excellent points would be very welcome in the ROLL WG. May I forward?

] Never tell me the odds! | ipv6 mesh networks [ ] Michael Richardson, Sandelman Software Works | network architect [ ] mcr@sandelman.ca http://www.sandelman.ca/ | ruby on rails [

simonduq commented 8 years ago

OK one problem with using ACK to update reach-ability information: the ACK is only link-layer, and does not verify connectivity between IP stacks (as mandated by the RFC you're quoting :/)

simonduq commented 8 years ago

From the same RFC :/

In some cases, link-specific information may indicate that a path to a neighbor has failed (e.g., the resetting of a virtual circuit). In such cases, link-specific information may be used to purge Neighbor Cache entries before the Neighbor Unreachability Detection would do so. However, link-specific information MUST NOT be used to confirm the reachability of a neighbor; such information does not provide end-to-end confirmation between neighboring IP layers.

joakimeriksson commented 8 years ago

@simonduq yes it is not fully ok to only rely on the link-layer ACKs. But I guess DAO <-> DAO-ACK can be used to keep neighbors updated - and unicast DIS probes could replace NS/NA since they will be RPL and will also give valuable information back. What about increasing neighbor lifetime to something like an hour and to do "RPL probing" to keep neighbors in the cache (if there were no DAO/DAO ACK, etc)? And then combining with some way to move nodes that seems to be hard to reach (many lost packets over a specific time) out of the cache? (Which seems more ok with the RFC - and 60 minutes lifetimes is not violating the RFC even if the default is 30 seconds - but that is for Ethernet).

simonduq commented 8 years ago

@joakimeriksson that sounds like a clever design to me. We can already enable DIS probing in the codebase, so getting there shouldn't be too much work :) Most of the probing would be done by RPL, which is (1) better targeted and (2) more efficient as packets also piggybacks rank updates -- ND would kick in whenever needed and maintain compatibility with any other ND-enabled device.

pablocorbalan commented 8 years ago

@mcr you can forward this conversation to the ROLL WG if you consider it appropriate.

@simonduq @joakimeriksson Your proposals seems good to me and definitely a good starting point to work on the RPL and NDP interaction. I also read that part of the RFC you mentioned, but somehow I kind of disagree with the RFC in this matter (but that's my opinion). After all, most of the LL ACKs (if not all) are sent due to received IP packets and we are also using constrained networks...

In any case, I ran a series of simulations with different network sizes and densities and by having enabled UIP_DS6_LL_NUD the amount of traffic sent by NDP is insignificant if any. Without accepting LL ACKs to confirm the reachability of a neighbor, the traffic of NDP is quite high in similar experiments. These results may vary with the experiment settings I guess, but in any case, they suggest that with a proper interaction, the NDP traffic could be reduced to a minimum.

I think that by using DAO ACKs and also RPL Probing to confirm the reachability of the NBRs we will be highly reducing the traffic sent by NDP and as you guys are suggesting getting more valuable information like rank updates and better link metrics that RPL can also use for routing decisions. We could even check if TCP ACKs or Echo Reply messages are coming from neighbours to confirm reachability in some applications that may use these types of messages.

simonduq commented 8 years ago

Yes but look, if we use link-layer ACK to refresh the ND state, and a neighbor changes IPv6 address, we might never notice and keep the outdated MAC<->IP mapping forever. That being said I fully agree with what you're saying from a performance point of view.

joakimeriksson commented 8 years ago

Another thing that can happen is that the nodes enter a buggy state and the radio will still ACK incoming link-frames but the packets will never reach the IP layers. In that case it is also good to now and then do probing over IP since the node is not anymore doing anything useful (very bad to route via it for example).