NordicPlayground / nRF51-ble-bcast-mesh

Other
323 stars 121 forks source link

rbc_mesh_stop sometimes crashes my program #194

Open asoftplus opened 6 years ago

asoftplus commented 6 years ago

Dear all,

I am working on a battery-powered application. In order to save power, my program has a duty cycle. In other words, my program has to use rbc_mesh_stop and rbc_mesh_start regularly.

However, by doing so, my program keeps crashing after an indefinite time. If I turn off the duty cycle, meaning turning on the mesh radio all the time, such problem does not appear. So I suspect that rbc_mesh_stop causes the problem.

I would like to ask, is it possible for the command rbc_mesh_stop to have some bugs, crashing the chip randomly?

If yes, are there any ways for me to fix the command?

Thanks for your attention.

trond-snekvik commented 6 years ago

Hi,

We're not aware of any current problems with this command, but if it happens infrequently, you might have come across some bug we haven't seen. What's the nature of the crash? Does it happen exactly when stop is called? Is it a hardfault, a softdevice error or an app error? Do you call the rbc mesh from a single interrupt level, or could there be race conditions induced by the api usage?

asoftplus commented 6 years ago

We're not aware of any current problems with this command, but if it happens infrequently, you might have come across some bug we haven't seen.

I tried to set my nodes to run for a long time. However, all of the nodes (that with duty cycle) would have their mesh function stopped within 2 days. However, my Android phone can still see their existences using nRT Connect. So I thought, it was only the mesh part of the code that had stopped to work. The overall program did not crash.

What's the nature of the crash? Does it happen exactly when stop is called?

I connected a node to uart, and printf all the return code of every rbc_mesh function in my code, so that when the mesh function stopped, I would be able to know which line had caused the problem.

However, when the mesh fucntion stopped, the uart part had also crashed, preventing me from seeing any logs. So I guessed besides the mesh part, other parts of my program had also crashed.

Also, since my mesh-always-on nodes had no such problem, and the only code difference between them and the duty-cycle nodes are rbc_mesh_start and rbc_mesh_stop, I suspected that one of them has caused the problem.

Is it a hardfault, a softdevice error or an app error?

Sorry, due to my limited experience, I am not sure. Are there any methods to find it out?

Do you call the rbc mesh from a single interrupt level, or could there be race conditions induced by the api usage?

I just use the api function rbc_mesh_start and rbc_mesh_stop.

In case you can give me some hints on how to pinpoint the real cause, it would be highly appreciated.

Thanks a lot for your time.

matheusrfdesign commented 6 years ago

Hi @asoftplus ,

I use a very similar concept in my mesh; I have one node always on while all others operate the mesh for only 500ms every 13s. I use a duty cycle to reduce power consumption. And I have also been noticing some nodes crashing now and then, and I am trying to understand why.

I suspect there is an issue on rbc_mesh_stop that causes the node to become unresponsive. Maybe @trond-snekvik can help us here, so please correct me if any of the following assumptions are wrong. When rbc_mesh_stop() is called, it sets m_mesh_state to MESH_STATE_STOPPED, before checking if a timeslot is currently in progress. This means you would be able to restart the mesh before it was properly stopped. So, if rbc_mesh_start() is called while a timeslot is ongoing, the function timeslot_resume() would return NRF_ERROR_INVALID_STATE but that error is currently not being handled in rbc_mesh_start(). The function rbc_mesh_start() would then return NRF_SUCCESS, even though timeslot_resume() has failed. As I understand, this would lead the node to become unresponsive.

Please, let me know your thoughts. Maybe I am missing something.