NordicPlayground / nRF51-ble-bcast-mesh

Other
324 stars 121 forks source link

Issue with nodes not propagating data when writing from certain modules #146

Open thedjnK opened 7 years ago

thedjnK commented 7 years ago

I've discovered a strange issue with the mesh network: I can communicate fine from android to android and android to iOS, however if I try from a Laird nRF5* module (with smartbasic) to either android or another Laird module the data does not propagate to the other device. I connect, enable notifications for the mesh value characteristic and then try writing to this characteristic, any time I write data e.g. 0000000122 I get a command success response on that module (110080) but the other module doesn't get the data. I've done a sniff and can see that the data isn't sent to the module (so is not a case of the module getting the data but not handling/displaying it) and have been trying to compare it to an android sniff but cannot see any differences aside from the GATT table listing performed by the android device. Both devices write the same data to the same handle IDs.

This sniff shows: both connect and enable notifications then one writes and gets the success message but the other node never receives the data, however the node that doesn't receive the data cannot write information to that node and instead gets an error back saying that the node is busy.

Node A (First to send): https://www.dropbox.com/s/vvji56hniawp1l7/Node_A_openmesh.pcapng?dl=0

Node B (Second to send which fails): https://www.dropbox.com/s/p8hf1hygrv6me3h/Node_B_openmesh.pcapng?dl=0

trond-snekvik commented 7 years ago

We've seen this behavior when the connection interval goes below 10ms. This prevents the mesh from running, as the Softdevice is spending too much time on the radio. For the HW configuration on our devkits (with regards to the LFCLK drift), this problem doesn't happen in testing anymore, but the Softdevice may apply different margins based on the drift of your timers, which causes it to reject the mesh timeslot between connection events. I'm not familiar with smartbasic, and which role it plays here, could this affect your timeslot usage too?

Are you able to increase the connection intervals you're using, or otherwise confirm this theory, so that we can try to mitigate the problem? Which Softdevice versions and chips have you tested with?

thedjnK commented 7 years ago

I was thinking today about the connection interval as it's set to 7.5ms so it sounds like this is the problem, I'll try a larger interval sometime to see if it fixes it and report back. As this scenario can lead to an accidental or purposeful denial of service, would a better idea be to not allow connection intervals below a certain value? When the slave connects they can negotiate and use e.g. 15ms thus avoiding this problem.

trond-snekvik commented 7 years ago

That would be the best, yes. Unfortunately, we can't really enforce usage of the Softdevice from this framework, as the mesh doesn't interfere with those calls. The solution up until now has been to accept whatever time the Softdevice gives us, but as proven by your issue, this is error prone, and suspect to change with different hardware configurations. The way I see it, there are three options for the framework:

  1. Reduce the amount of time we request to fit all possible hardware configurations. This means that we'll run sub-optimal timeslots on some devices, but at least it'll work for everyone.
  2. Wrap or intercept the Softdevice interface somehow, to prevent the application from initiating short-interval connections, and renegotiate the cases where short intervals are forced upon us. This is out of scope, and quite intrusive from a framework that aims to integrate nicely with existing projects. I'm also not sure if it'll work very well for the multi-connection case in S130.
  3. Keep the framework the way it is, and document the error. This will generate fewer tickets, but the problem won't ever go away. I also prefer not telling people to RTFM when it's a problem that should have been solved automatically.

Perhaps a combination of the first and last option - we can't really test all kits out there, but we have accurate numbers on what sort of drift these devices can operate with according to the Bluetooth specification, and should be able to tune accordingly. Documenting the problem is never a bad idea either.

thedjnK commented 7 years ago

It looks like the connection interval was the problem, but it seems that even a 20ms interval isn't enough as it still won't propagate data however it works with a connection interval of 30ms. I agree it's not an easy problem to solve.