espressif / esp-idf

Espressif IoT Development Framework. Official development framework for Espressif SoCs.
Apache License 2.0
13.89k stars 7.33k forks source link

NimBLE scan stops generating results, then crashes when resumed. (IDFGH-2001) #4196

Closed jimmo closed 4 years ago

jimmo commented 5 years ago

Environment

Problem Description

Using NimBLE, running a ble_gap_disc, the stack eventually stops generating new scan results (i.e. the ble_gap_event_fn passed to ble_gap_disc no longer gets called).

If I stop scanning (i.e. ble_gap_disc_cancel), then start again, it fails with:

ASSERT_PARAM(64 0), in rwble.c at line 235
Guru Meditation Error: Core  0 panic'ed (Interrupt wdt timeout on CPU0)
Core 0 register dump:
PC      : 0x40087aad  PS      : 0x00060034  A0      : 0x8008b964  A1      : 0x3ffbe4d0  
A2      : 0x00000001  A3      : 0x00000000  A4      : 0x00000000  A5      : 0x60008054  
A6      : 0x3ffbdb60  A7      : 0x60008050  A8      : 0x80087aad  A9      : 0x3ffbe4b0  
A10     : 0x00000004  A11     : 0x00000000  A12     : 0x6000804c  A13     : 0xffffffff  
A14     : 0x00000000  A15     : 0xfffffffc  SAR     : 0x00000004  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x400879e5  LEND    : 0x400879ec  LCOUNT  : 0x00000000  
Core 0 was running in ISR context:
EPC1    : 0x401c2722  EPC2    : 0x00000000  EPC3    : 0x00000000  EPC4    : 0x40087aad

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x40087aaa:0x3ffbe4d0 0x4008b961:0x3ffbe4f0 0x4008c54f:0x3ffbe510 0x40085ea5:0x3ffbe530 0x401c271f:0x3ffbbcb0 0x40152d2f:0x3ffbbcd0 0x40095495:0x3ffbbcf0 0x40093fe9:0x3ffbbd20

Core 1 register dump:
PC      : 0x401c2722  PS      : 0x00060f34  A0      : 0x80152d32  A1      : 0x3ffbc430  
A2      : 0x3ffc289d  A3      : 0x00060f20  A4      : 0x800941df  A5      : 0x3ffcfee0  
A6      : 0x00000003  A7      : 0x00060023  A8      : 0x8010f168  A9      : 0x3ffbc420  
A10     : 0x00000003  A11     : 0x00060f23  A12     : 0x00060f20  A13     : 0x3ffcfec0  
A14     : 0x00000003  A15     : 0x00060023  SAR     : 0x0000001d  EXCCAUSE: 0x00000005  
EXCVADDR: 0x00000000  LBEG    : 0x00000000  LEND    : 0x00000000  LCOUNT  : 0x00000000  

ELF file SHA256: 0000000000000000000000000000000000000000000000000000000000000000

Backtrace: 0x401c271f:0x3ffbc430 0x40152d2f:0x3ffbc450 0x40095495:0x3ffbc470 0x40093fe9:0x3ffbc4a0

I have been investigating this issue that was reported by MicroPython users. The typical time to failure was several hours, but I was able to repro it much faster with some devices sending advertisements at a high rate (50 per second). Usually takes about 10-15 mins for the issue to occur.

The most important detail about my configuration is that it's running inside MicroPython, but I can repro this if I turn gap_scan_cb into a no-op, so that should hopefully exclude any issues with what we're doing in the callback. No other BLE operations were active. It's a passive scan, with interval=30ms, window=30ms, duration=forever. Another detail is that NimBLE is pinned to core 1 (same core as the MicroPython task), but I saw this problem when it was on the default core (0). The main MicroPython task is blocked on the UART (for the REPL) via ulTaskNotifyTake.

MicroPython doesn't use IDF's makefiles or cmake, but it does use sdkconfig. I've attached the input and generated sdkconfig.h below.

I wonder if this might be related to https://github.com/espressif/esp-idf/issues/4001 (but that was on Bluedroid, not NimBLE). I can repro with both v4.0-beta1 and v4.1-dev.

Here's a disassembly of r_rwble_isr which seems to be raising the assert above https://gist.github.com/jimmo/43ba1da440fcbebfab89c70e345f368b (I've marked the relevant line with ****) I've attached the elf below. Built from https://github.com/jimmo/micropython/tree/ble-fixes

Other items if possible

application.elf.zip sdkconfig.zip

copercini commented 5 years ago

I'm not 100% sure if this is related, but it occurs in bluedroid too, if you have many results, the scan stuck after a while, so a workaround is to scans a short period of time (1 minute), stop some time (10 secs) and back to scan again instead try scan it forever

prasad-alatkar commented 5 years ago

Hi @jimmo I am trying to reproduce this issue with the above mentioned settings using blecent example from $IDF_PATH/examples/bluetooth/nimble/blecent, unfortunately even after stressing it for >48Hrs I have not yet been able to see the crash.

Using NimBLE, running a ble_gap_disc, the stack eventually stops generating new scan results (i.e. the ble_gap_event_fn passed to ble_gap_disc no longer gets called).

Can you please confirm if you have not set disc_params.filter_duplicates = 1?

If I stop scanning (i.e. ble_gap_disc_cancel), then start again, it fails with:

Can you please tell me where exactly you are calling ble_gap_dic_cancel?

The main MicroPython task is blocked on the UART (for the REPL) via ulTaskNotifyTake.

I guess this might be differentiating factor between our setup, I will try to replicate this at my end and update you.

jimmo commented 5 years ago

@prasad-alatkar Thanks for looking into this!

Can you please confirm if you have not set disc_params.filter_duplicates = 1?

We set filter_duplicates to 0.

Can you please tell me where exactly you are calling ble_gap_dic_cancel?

We call cancel from the main MicroPython task.

To elaborate -- the MicroPython firmware gives me a REPL prompt on the device, from which I call a Python function which internally calls ble_gap_disc, with interval and window both set to 30ms.

When the disc callback is invoked, it copies the data into a ringbuffer and notifies the MicroPython task, which then schedules a Python callback that prints out the scan data. (When I referred to making the callback a no-op, this is the code I'm talking about)

After some time interval (usually 10-15 mins in my testing at ~50 results/s, but I've had reports from users that this can take many hours at a slower rate), I stop seeing the print output from my Python callback.

So that's where I interactively stop the scan from the REPL, then re-start it, at which point I see the crash described in rwble.c.

The main MicroPython task is created from app_main().

    xTaskCreatePinnedToCore(mp_task, "mp_task", MP_TASK_STACK_LEN, NULL, MP_TASK_PRIORITY, &mp_main_task_handle, MP_TASK_COREID);

MP_TASK_COREID is 1. We also pin NimBLE to core 1.

prasad-alatkar commented 5 years ago

Hi @jimmo, apologies for late reply. FYI, I was able to reproduce the issue with high advertisement rate. Though I am still looking into the root-cause, can you try disabling Bluetooth Modem Sleep ? From menuconfig just follow this path : Component config > Bluetooth > Bluetooth controller > MODEM SLEEP Options. Alternatively you can comment out CONFIG_BTDM_MODEM_SLEEP and CONFIG_BTDM_MODEM_SLEEP_MODE_ORIG from your sdkconfig.

Please share your valuable observations.

jimmo commented 5 years ago

@prasad-alatkar Great that you were able to repro the issue, and thanks again for looking into it!

As you can probably see in our sdkconfig.h, we're using the default value of the sleep params (FYI MicroPython still uses tools/kconfig_new/confgen.py, so I just added CONFIG_BTDM_MODEM_SLEEP=n to our input config https://github.com/micropython/micropython/blob/master/ports/esp32/boards/sdkconfig.ble). Confirmed that this removed those two lines from the generated sdkconfig.h

Unfortunately I was still able to repro the issue. I actually got it to happen after a very short amount of time (less than a minute), and then again after about 15 minutes.

0neblock commented 5 years ago

hi @prasad-alatkar, as @jimmo mentioned, this seems very similar to my issue #4001 which includes a coredump of the esp_32 when the error occurs which may give you greater insight into the operation of the BT system when it fails. If you need any further information please let me know.

prasad-alatkar commented 5 years ago

Hi @jimmo @0neblock apologies for delayed response. @csushantk is debugging this issue internally. I will update you once the issue is fixed.

0neblock commented 4 years ago

Hi @prasad-alatkar I noticed that a few commits were pushed to the Github repo recently, can you please confirm if your fix for this issue was included?

I am tracking release/v3.3 for my IDF toolchain. I am testing this latest release now. Thanks.

prasad-alatkar commented 4 years ago

Hi @0neblock the issue is not yet fixed, I will update you as soon as it is fixed.

pschlang commented 4 years ago

@prasad-alatkar Apologies for pushing again, but is there any news here? Any workaround is highly appreciated.

prasad-alatkar commented 4 years ago

Hi @pschlang , @csushantk is working on fixing the issue in controller. I am also working on workarounds in host, however it may take some time to fix this issue. I will update you as soon as it is fixed.

pschlang commented 4 years ago

Hi @prasad-alatkar, we also observe this issue in 4.0-rc1. We urgently need a workaround for this. Can you give some more details on how to avoid this issue or how we could detect and recover from it?

ikalogic commented 4 years ago

+1 if a workaround exists, that would be great until we have a solid fix.

dmartauz commented 4 years ago

@prasad-alatkar Any news regarding the fix?

prasad-alatkar commented 4 years ago

Hi @dmartauz Can you please try and test on latest master ? commits 48bd2d74b4209bbad7c29d2d856810ee43a1a92c and 74c7db3f49e8605d7ec3736a3ba508a9f8d3fd9a are expected to resolve the crash issue. Please let me know if you still face the issue .

dmartauz commented 4 years ago

@prasad-alatkar Great, thanks. I will test in a week or two.

0neblock commented 4 years ago

@prasad-alatkar is this excepted to fix bluedroid as well?

jimmo commented 4 years ago

@prasad-alatkar thanks for this. When you say "are expected to resolve the crash issue." does that mean that your previously-failing test case now works?

Any chance you'd be able to back-port to v4.0? I don't have the time right now to move MicroPython to v4.1-beta1 (and we'd rather stick to the release version anyway). A few people have contacted us asking about this issue.

prasad-alatkar commented 4 years ago

Hi @jimmo , the changes are backported to v4.0 as well.

When you say "are expected to resolve the crash issue." does that mean that your previously-failing test case now works?

These fixes address the bugs that were found while debugging this issue. Please let me know if it resolves your issue.

@0neblock sorry for delayed response. As the fix is in controller, I believe it will be applicable to bluedroid as well (CC: @csushantk )

0neblock commented 4 years ago

Thanks @prasad-alatkar, Do you know when/if these changes would be backported to v3.3?

jimmo commented 4 years ago

@prasad-alatkar ah I was looking out for a new tag, but it seems that you have just moved the existing v4.0 tag? That's going to create some confusion. Can you do point releases in the future?

Thanks for the confirmation, I will try and get some people to test it ASAP. I can confirm that MicroPython builds cleanly against the new v4.0 revision tag, I will update here when I hear back from our users.

prasad-alatkar commented 4 years ago

@0neblock actually the changes are backported to release/v3.3 as well. @jimmo by v4.0 I was referring to release/v4.0, sorry if there was any confusions because of this.

0neblock commented 4 years ago

Hi @prasad-alatkar, the esp32-bt-lib On release/v3.3 hasn’t been updated for a month, is the fix definitely in there ?

prasad-alatkar commented 4 years ago

Hi @0neblock I think commit 72fb22ed has the corresponding fix (cc: @csushantk @chinmaychhajed )

jimmo commented 4 years ago

by v4.0 I was referring to release/v4.0, sorry if there was any confusions because of this.

The issue is that it's now not clear what IDF 4.0 is?

prasad-alatkar commented 4 years ago

@Alvin1Zhang Can you please comment on @jimmo 's concerns ? @jimmo as per my understanding, "Manually checking out the release/v4.0 branch" should be ok, however Alvin can better guide you first 2 points.

projectgus commented 4 years ago

Hi @jimmo,

That's right, we don't move the v4.0 release tag. When @prasad-alatkar talks about "backported to v4.0", he means that the fix was merged into the release/v4.0 branch and will be part of the v4.0.1 bugfix release once that's available.

We use these "release branches" to maintain bugfix updates for each major & minor version, so in this case release/v4.0 is the branch that contains the v4.0 tag, and future commits on this branch will be tagged for v4.0.1, v4.0.2, etc.

This page in the ESP-IDF docs has more details about our versioning scheme: https://docs.espressif.com/projects/esp-idf/en/stable/versions.html

At this stage, if you plan to stick with stable releases then the only option is to wait for the v4.0.1 release. However if you want to check out the latest commit on the release/v4.0 branch then you can test this fix before it's released.

jimmo commented 4 years ago

Thanks for the clarification @projectgus -- makes sense!

0neblock commented 4 years ago

Hi @prasad-alatkar, I'm not sure how everyone else has gone with the testing, but in the limited testing i've been able to do with the current tip of release/v3.3 I am experiencing the same issue. I am not able to check with other branches of the idf for now. Do you have any further confirmation that this should be fixed in v3.3?

prasad-alatkar commented 4 years ago

Hi @0neblock as per as NimBLE host is concerned, issue ASSERT_PARAM(64 0), in rwble.c is fixed. The controller has fix for ASSERT_PARAM(131072 0) (CC: @Sushant-Espressif ). I believe you are getting this issue with Bluedroid stack and is related to https://github.com/espressif/esp-idf/issues/4001.

I will try to reproduce issue(#4001) and revert back if I find anything. Just to be sure, Can you confirm that your application just does scanning and not advertising? Can you please also try to disable CONFIG_BLE_ADV_REPORT_FLOW_CONTROL_SUPPORTED and see if you still face the issue ?

prasad-alatkar commented 4 years ago

Hi @0neblock commit: 55877684 fixes ASSERT_PARAM(64 0) issue. Could you please check if the issue gets resolved with latest master branch ?

Edit: Sorry, the fix is not yet available on github. However Issue has been fixed and merged into internal repo, will update you once it is available on github.

0neblock commented 4 years ago

I will try to reproduce issue(#4001) and revert back if I find anything. Just to be sure, Can you confirm that your application just does scanning and not advertising? Can you please also try to disable CONFIG_BLE_ADV_REPORT_FLOW_CONTROL_SUPPORTED and see if you still face the issue ?

Sorry @prasad-alatkar for my delay, I can confirm that disabling 'CONFIG_BLE_ADV_REPORT_FLOW_CONTROL_SUPPORTED' does not fix the issue in #4001, anything else i can try?

Alvin1Zhang commented 4 years ago

Thanks for reporting and sorry for slow turnaround, the fix is available at https://github.com/espressif/esp-idf/commit/2f6f842d0f2a0e57541ab8a71a1ad5ba382a956c. Feel free to reopen if the issue still happens. Thanks.

GianlucaLoi commented 4 years ago

Hello @prasad-alatkar ,

I'm having the same problem of @0neblock (issue #4001 ) using ESP-IDF v4.3-dev-907-g6c17e3a64. There is any workaround or fix available for this esp-idf version?

Regards,

Gianluca.