bluerange-io / bluerange-mesh

BlueRange Mesh (formerly FruityMesh) - The first completely connection-based open source mesh on top of Bluetooth Low Energy (4.1/5.0 or higher)
https://bluerange.io/
Other
287 stars 109 forks source link

Intermittent node disconnect issue #108

Closed rmn388 closed 5 years ago

rmn388 commented 5 years ago

I have an issue with nodes intermittently dropping their connections, causing the mesh to have to reform resulting in an unstable. This becomes increasingly problematic as the number of nodes in the mesh increases.

Here is the cluster size of a 21 node network over a 24 hour period. The nodes are all together in a box. So signal strength should not be an issue. Cluster over time I tested the same scenario with the firmware from the github-beta branch, and there were no disconnects in a 24 hour period.

Here is the debug output during one of these disconnect events, in this case 6 nodes get dropped, from the 21 node network and then reconnect one by one.

{"type":"live_report","nodeId":10027,"module":3,"code":107,"extra":10024,"extra2":2}
{"type":"cluster_update","size":15,"newId":572270383,"masterBit":0}
{"type":"update_joinme","clusterId":572270383,"clusterSize":15}
{"type":"live_report","nodeId":29007,"module":3,"code":102,"extra":10021,"extra2":1139429078}
{"type":"live_report","nodeId":10037,"module":3,"code":106,"extra":10024,"extra2":1}
{"type":"cluster_update","size":16,"newId":572270383,"masterBit":0}
{"type":"update_joinme","clusterId":572270383,"clusterSize":16}
{"type":"live_report","nodeId":10036,"module":3,"code":106,"extra":10039,"extra2":1}
{"type":"cluster_update","size":17,"newId":572270383,"masterBit":0}
{"type":"update_joinme","clusterId":572270383,"clusterSize":17}
{"type":"update_joinme","clusterId":572270383,"clusterSize":17}
{"type":"live_report","nodeId":29007,"module":3,"code":102,"extra":10022,"extra2":1996984368}
{"type":"live_report","nodeId":10039,"module":3,"code":106,"extra":10030,"extra2":1}
{"type":"cluster_update","size":18,"newId":572270383,"masterBit":0}
{"type":"update_joinme","clusterId":572270383,"clusterSize":18}
{"type":"live_report","nodeId":29007,"module":3,"code":103,"extra":3,"extra2":1996984368}
{"type":"live_report","nodeId":29007,"module":3,"code":102,"extra":10022,"extra2":1996984368}
{"type":"live_report","nodeId":29007,"module":3,"code":103,"extra":4,"extra2":1996984368}
{"type":"update_joinme","clusterId":572270383,"clusterSize":18}
{"type":"live_report","nodeId":10023,"module":3,"code":106,"extra":10021,"extra2":1}
{"type":"live_report","nodeId":29007,"module":3,"code":104,"extra":4,"extra2":22}
{"type":"cluster_update","size":19,"newId":572270383,"masterBit":0}
{"type":"update_joinme","clusterId":572270383,"clusterSize":19}
{"type":"live_report","nodeId":29007,"module":3,"code":106,"extra":10022,"extra2":1}
{"type":"cluster_handshake","winner":1,"size":20}
{"type":"mesh_connect","partnerId":10022}
{"type":"update_joinme","clusterId":572270383,"clusterSize":20}
{"type":"live_report","nodeId":10036,"module":3,"code":106,"extra":10038,"extra2":1}
{"type":"cluster_update","size":21,"newId":572270383,"masterBit":0}
{"type":"update_joinme","clusterId":572270383,"clusterSize":21}
mariusheil commented 5 years ago

Hi, which version is the problematic one? The github-beta branch is actually the older version, correct? You posted the log where they reconnect again, but the interesting part would be where the connection is lost. The live_reports should indicate why the disconnect happened, code 104 is e.g. the GAP_DISCONNECT. There is also a command "action 0 status get_errors" to get the stored error log from all nodes after they have reconnected. This should contain an error code for why the disconnect happened. Can you check if connection reestablishment is enabled? This is a feature to reconnect a GAP connection while still maintaining the Mesh Connection and it should prevent situations like these in most cases.

rmn388 commented 5 years ago

Thanks for the quick response. Yes, the most recent release 0.8.8 is the one with the issue, and I used the github-beta branch as a control to make sure this behavior was new.

I have mad some changes, but nothing that should affect the meshing or connection code. To make sure I plan on re-running this test today with the 0.8.8 release and no modifications. I'll report back.

I see the code for reestablishing a connection and have not touched it. As for whether it's enabled, I don't see a configuration setting to enable or disable it.

I caught a disconnect and ran the get_errors command after the mesh was reconnected.

{"type":"live_report","nodeId":10033,"module":3,"code":107,"extra":10024,"extra2":2}
{"type":"cluster_update","size":20,"newId":572270383,"masterBit":0}
{"type":"update_joinme","clusterId":572270383,"clusterSize":20}
{"type":"live_report","nodeId":10022,"module":3,"code":106,"extra":10024,"extra2":1}
{"type":"cluster_update","size":21,"newId":572270383,"masterBit":0}
{"type":"update_joinme","clusterId":572270383,"clusterSize":21}
action 0 status get_errors
{"type":"error_log_entry","nodeId":29007,"module":3,"errType":2,"code":19,"extra":4,"time":650}
{"type":"error_log_entry","nodeId":29007,"module":3,"errType":2,"code":22,"extra":27,"time":761}
{"type":"error_log_entry","nodeId":29007,"module":3,"errType":2,"code":20,"extra":0,"time":833}
{"type":"error","code":0,"text":"OK"}
{"type":"error_log_entry","nodeId":10021,"module":3,"errType":2,"code":19,"extra":146,"time":53489}
{"type":"error_log_entry","nodeId":10021,"module":3,"errType":2,"code":17,"extra":10,"time":53490}
{"type":"error_log_entry","nodeId":10021,"module":3,"errType":2,"code":22,"extra":34,"time":53547}
{"type":"error_log_entry","nodeId":10021,"module":3,"errType":2,"code":4,"extra":1,"time":53657}
{"type":"error_log_entry","nodeId":10021,"module":3,"errType":2,"code":18,"extra":1,"time":53659}
{"type":"error_log_entry","nodeId":10021,"module":3,"errType":1,"code":19,"extra":1,"time":53660}
{"type":"error_log_entry","nodeId":10021,"module":3,"errType":2,"code":20,"extra":0,"time":53672}
{"type":"error_log_entry","nodeId":10037,"module":3,"errType":2,"code":19,"extra":217,"time":226615}
{"type":"error_log_entry","nodeId":10037,"module":3,"errType":2,"code":17,"extra":15,"time":226617}
{"type":"error_log_entry","nodeId":10037,"module":3,"errType":2,"code":22,"extra":32,"time":226777}
{"type":"error_log_entry","nodeId":10037,"module":3,"errType":2,"code":20,"extra":0,"time":226799}
{"type":"error_log_entry","nodeId":10040,"module":3,"errType":2,"code":19,"extra":3,"time":226383}
{"type":"error_log_entry","nodeId":10033,"module":3,"errType":2,"code":19,"extra":70,"time":226960}
{"type":"error_log_entry","nodeId":10022,"module":3,"errType":2,"code":19,"extra":81,"time":229270}
{"type":"error_log_entry","nodeId":10022,"module":3,"errType":2,"code":22,"extra":22,"time":229296}
{"type":"error_log_entry","nodeId":10022,"module":3,"errType":2,"code":4,"extra":2,"time":229434}
{"type":"error_log_entry","nodeId":10033,"module":3,"errType":1,"code":8,"extra":1,"time":227111}
{"type":"error_log_entry","nodeId":10033,"module":3,"errType":2,"code":2,"extra":10024,"time":227111}
{"type":"error_log_entry","nodeId":10040,"module":3,"errType":2,"code":22,"extra":42,"time":226545}
{"type":"error_log_entry","nodeId":10022,"module":3,"errType":2,"code":18,"extra":3,"time":229437}
{"type":"error_log_entry","nodeId":10022,"module":3,"errType":2,"code":5,"extra":1,"time":229437}
{"type":"error_log_entry","nodeId":10022,"module":3,"errType":1,"code":19,"extra":1,"time":229447}
{"type":"error_log_entry","nodeId":10040,"module":3,"errType":2,"code":20,"extra":0,"time":226566}
{"type":"error_log_entry","nodeId":10027,"module":3,"errType":2,"code":19,"extra":199,"time":227656}
{"type":"error_log_entry","nodeId":10027,"module":3,"errType":2,"code":17,"extra":33,"time":227657}
{"type":"error_log_entry","nodeId":10022,"module":3,"errType":2,"code":20,"extra":0,"time":229453}
{"type":"error_log_entry","nodeId":10024,"module":3,"errType":2,"code":19,"extra":3,"time":229307}
{"type":"error_log_entry","nodeId":10026,"module":3,"errType":2,"code":19,"extra":159,"time":227724}
{"type":"error_log_entry","nodeId":10027,"module":3,"errType":2,"code":22,"extra":47,"time":227813}
{"type":"error_log_entry","nodeId":10033,"module":3,"errType":2,"code":3,"extra":10024,"time":227121}
{"type":"error_log_entry","nodeId":10033,"module":3,"errType":2,"code":4,"extra":1,"time":227125}
{"type":"error_log_entry","nodeId":10027,"module":3,"errType":2,"code":20,"extra":0,"time":227840}
{"type":"error_log_entry","nodeId":10023,"module":3,"errType":2,"code":19,"extra":3,"time":229349}
{"type":"error_log_entry","nodeId":10023,"module":3,"errType":2,"code":22,"extra":22,"time":229510}
{"type":"error_log_entry","nodeId":10033,"module":3,"errType":2,"code":18,"extra":1,"time":227127}
{"type":"error_log_entry","nodeId":10033,"module":3,"errType":1,"code":19,"extra":1,"time":227128}
{"type":"error_log_entry","nodeId":10030,"module":3,"errType":2,"code":17,"extra":2,"time":227275}
{"type":"error_log_entry","nodeId":10030,"module":3,"errType":2,"code":22,"extra":26,"time":227436}
{"type":"error_log_entry","nodeId":10028,"module":3,"errType":2,"code":19,"extra":66,"time":227563}
{"type":"error_log_entry","nodeId":10028,"module":3,"errType":2,"code":22,"extra":22,"time":227576}
{"type":"error_log_entry","nodeId":10028,"module":3,"errType":2,"code":20,"extra":0,"time":227746}
{"type":"error_log_entry","nodeId":10030,"module":3,"errType":2,"code":20,"extra":0,"time":227458}
{"type":"error_log_entry","nodeId":10038,"module":3,"errType":2,"code":19,"extra":3,"time":226594}
{"type":"error_log_entry","nodeId":10039,"module":3,"errType":2,"code":19,"extra":79,"time":226535}
{"type":"error_log_entry","nodeId":10031,"module":3,"errType":2,"code":19,"extra":3,"time":227230}
{"type":"error_log_entry","nodeId":10031,"module":3,"errType":2,"code":22,"extra":12,"time":227283}
{"type":"error_log_entry","nodeId":10031,"module":3,"errType":2,"code":20,"extra":0,"time":227413}
{"type":"error_log_entry","nodeId":10036,"module":3,"errType":2,"code":19,"extra":79,"time":226892}
{"type":"error_log_entry","nodeId":10038,"module":3,"errType":2,"code":22,"extra":20,"time":226756}
{"type":"error_log_entry","nodeId":10038,"module":3,"errType":2,"code":20,"extra":0,"time":226778}
{"type":"error_log_entry","nodeId":10039,"module":3,"errType":2,"code":22,"extra":17,"time":226696}
{"type":"error_log_entry","nodeId":10039,"module":3,"errType":2,"code":20,"extra":0,"time":226718}
{"type":"error_log_entry","nodeId":10036,"module":3,"errType":2,"code":20,"extra":0,"time":227075}
{"type":"error_log_entry","nodeId":10035,"module":3,"errType":2,"code":19,"extra":3,"time":226909}
{"type":"error_log_entry","nodeId":10034,"module":3,"errType":2,"code":22,"extra":18,"time":226933}
{"type":"error_log_entry","nodeId":10034,"module":3,"errType":2,"code":20,"extra":0,"time":227115}
{"type":"error_log_entry","nodeId":10029,"module":3,"errType":2,"code":19,"extra":3,"time":227337}
{"type":"error_log_entry","nodeId":10035,"module":3,"errType":2,"code":22,"extra":12,"time":226990}
{"type":"error_log_entry","nodeId":10035,"module":3,"errType":2,"code":20,"extra":0,"time":227092}
{"type":"error_log_entry","nodeId":10029,"module":3,"errType":2,"code":22,"extra":45,"time":227498}
{"type":"error_log_entry","nodeId":10029,"module":3,"errType":2,"code":20,"extra":0,"time":227521}

From digging into the codes, it seems node 10024 disconnected (107) occurred because of a reconnect timeout (extra2:2). This node shows up twice in the error logs, but I'm unsure of where to find the definitions of these error codes.

mariusheil commented 5 years ago

Hi,

thanks. If it was a reconnect timeout, then the feature is enabled but did not succeed in setting up the connection again for some reason. I have created a ticket in our system to try to reproduce this issue. One other important question: are you using nrf51 or nrf52?

rmn388 commented 5 years ago

It's on the nRF52832. Thanks, I'll follow up with any new information I find.

rmn388 commented 5 years ago

Tested this on unmodified 0.8.8 firmware and the disconnect issue was not present over a 24 hour period. So it must have been introduced through my configuration or code changes.

I'll continue investigating, and comment back if the issue is caused by something others may run into.

mariusheil commented 5 years ago

Thanks for telling. I'll close our internal ticket then 🙂