espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.47k stars 7.25k forks source link

BLE Mesh Nodes should not relay messages multiple times (IDFGH-13180) #14119

Open jtmullen opened 3 months ago

jtmullen commented 3 months ago

Is your feature request related to a problem?

I have a BLE Mesh network where a high TTL is needed due to the large physical size, but there are often groups of nodes relatively close together. In this case, I find that these bunches of nodes relay the same messages over and over again to the point where they are unable to actually operate reliably. They are completely overwhelmed by the number of messages, but most of the messages are ones they have received before. Each node ends up seeing the message multiple times at each TTL level - potentially hundreds of times.

I am still trying to fully understand the mesh code, but I see that the relay function is implemented at the network layer and the replay check is at the transport layer. So a message will already have been queued to relay independent of the replay check. Additionally, the nodes only seem to store sequence numbers for devices that send messages directly to their address or an address they subscribe to, not all nodes. So they couldn't be doing replay checks for messages from nodes that haven't sent anything destined for them (yet).

Describe the solution you'd like.

In their BLE Mesh Networking Overview for developers I see that the Bluetooth SIG specifies where relay should be implemented (Network Layer) but not where replay protection should be implemented. Maybe there is other documentation I have not found that has details on that. But this application note from Silicon Labs about Network Optimization notes that sequence numbers should be checked before Relaying to help prevent the issues I am seeing. Which means there would need to be an ability to do replay checks in the network layer instead of the transport layer. I would like this to possible with the ESP-IDF - maybe not default behavior, but definitely an option.

Describe alternatives you've considered.

The application note above also offers other optimizations that are possible. We are tuning the TTL based on network size as a first step, which has somewhat reduced but not eliminated the problem.

Our product is installed in the field by 3rd parties in significantly varied environments so network topology information is unique for every install and difficult to capture with enough detail for some other optimizations. We cannot expect these people to be networking experts. We are looking into ways to map the network to reduce the number of relay nodes and/or further reduce TTL, however doing this mapping may increase the cost of installation and the network may not be static for the life of the product meaning we'd actually require a way to monitor this all the time. These are far more difficult problems to solve and less reliable solutions than just not relaying each message effectively [TTL][Node Count] times by checking sequence number before relaying.

Additional context.

No response

forx157 commented 2 months ago

Hi, @jtmullen Thank you very much for your research, regarding what you said about not relaying RPL messages is already supported in IDF 4.4 and later versions, it can be enabled in Menuconfig via the following path Top) → Component config → ESP BLE Mesh Support → Make BLE Mesh experimental features visible and Top) → Component config → ESP BLE Mesh Support → Not relay replayed messages in a mesh network.

Additionally, the impact of message flooding can be reduced by increasing the RPL list size when there are too many messages forwarded by the network (Top) → Component config → ESP BLE Mesh Support → Network message cache size ) to reduce the impact of message flooding.

chegewara commented 2 months ago

Hi, maybe not related and even it may be useless, but i hope it may help:

jtmullen commented 2 months ago

Hi @forx157 thanks for the information and sorry for the slow response. This seems to be what I was looking for. Not sure why it does not come up when searching in the documentation.

Is there a reason this is experimental? Any limitations or known issues we should be aware of?