espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.59k stars 7.27k forks source link

BLE Notifications stop working after SPI init #1035

Closed lucashutch closed 7 years ago

lucashutch commented 7 years ago

Hi All,

I have a weird one for you.

I have a program that communicates over BLE (peripheral and central, connected as a peripheral) to a computer. The computer sends some commands and then sends a message to enable/initialise the SPI Master.

Before the SPI is enabled, the notifications work well, as expected. After the SPI is initialised (by calling spi_bus_initialize and spi_bus_add_device) the notifications cease to work. I have discovered that the esp_ble_gatts_send_indicate function returns ESP_FAIL and with a bit of modification to the esp_ble_gatts_send_indicate function I seem to be getting the BT_STATUS_NOMEM error.

I have checked the amount of free heap to be above 80K at all times. I have also tried increasing the BT task stack size to 8192. This didn't help.

The strange part that I mentioned earlier is that this only seems to happen when I build on windows (with the latest pre-compiled toolchain). If I build with the exact same source code (IDF at the same commit with no changes, and Project at the same commit with no changes) on fedora (with the latest pre-compiled toolchain), the program works fine. All notifications are sent as expected, no BT_STATUS_NOMEM occurs.

We have tried to get to the bottom of this for a couple of days now to no success.

Is there anything that could be causing this? I am sure it is something small or stupid that is causing this..

Please let me know if there is any more information that you would like.

@projectgus @igrr

negativekelvin commented 7 years ago

Suggest that you post the two sets of bin/elf files

projectgus commented 7 years ago

I've seen bugs that appear on certain machines only due to memory corruption of static data combined with the build order & layout (ie on different systems the object files are linked in a different order, leads to different order of static memory addresses in RAM). So for some builds the memory corruption (buffer overflow, etc.) corrupts something harmless or lands in padding, but for other builds it may break something critical.

There is an item in our ticketing system to make IDF builds more reproducable to avoid this kind of phantom problem, but there are some technical sticking points before we can achieve this.

Unfortunately the heap debugging features don't extend to static memory, so if this is indeed static memory being corrupted then they're not useful. But you could try enabling heap poisoning and calling heap_caps_check_integrity() anyhow, just in case: https://esp-idf.readthedocs.io/en/latest/api-reference/system/heap_debug.html#configuration

You can also manually look at the linker map files or symbol dumps (via objdump) from each of the ELF files, and look for anything which might stick out.

The best thing you can otherwise do is track down when in your firmware the corruption happens, and try to zero in on some particular pieces of code that are running at this time.

lucashutch commented 7 years ago

Hi all,

Thanks for your responses. It seems to have been related to a globally declared variable that was not declared as static. It seems to have a name the same as found in lots of the ble stack (ret). Declaring this variable locally in a function or adding static to the global declaration fixed the issue.

projectgus commented 7 years ago

Hi @lucazader,

Glad you sorted this out.

It seems to have been related to a globally declared variable that was not declared as static. It seems to have a name the same as found in lots of the ble stack (ret).

If there's a part of IDF that includes a globally declared variable with a generic name like "ret" then this is also a bug which we should fix. I had a quick grep of the BT stack code and can't see any global symbol named "ret" (lots of local variables using this name). If you think there may be such a bug here then please reopen the issue.

lucashutch commented 7 years ago

Hi @projectgus

It was a global variable called "ret" in my code, however it seemed to conflict with the local ret variables, or at least one of them. Not 100% sure what was going on. l But it definitely was to do with that variable.