adafruit / circuitpython

CircuitPython - a Python implementation for teaching coding with microcontrollers
https://circuitpython.org
Other
4.04k stars 1.2k forks source link

BLE start_scan on 5.3.0 dies with MemoryError: memory allocation failed, allocating 65536 bytes - also 42241 bytes #3119

Closed kevinjwalters closed 10 months ago

kevinjwalters commented 4 years ago

I had a spate of these a fortnight ago and I just got one now on a CLUE running 5.3.0 and fairly recent libs (20200625). I've touched nothing near this part of my application recently. This is with active=False if it makes any difference and I have been using buffer_size=1536 for some time.

Listening for (<class 'Advertisement'>,)
Traceback (most recent call last):
  File "code.py", line 1520, in <module>
  File "code.py", line 914, in broadcastAndReceive
  File "code.py", line 691, in startScan
  File "adafruit_ble/__init__.py", line 235, in start_scan
MemoryError: memory allocation failed, allocating 65536 bytes

That number is clearly suspicious for being so large and for being 2^16.

Minor possibility this is data driven, perhaps a rogue received BLE packet could cause the library to do some unwise size allocations? Three other devices running same code at same time weren't affected.

dhalbert commented 4 years ago

There are several allocations that might be happening at that line. I looked at the code and don't see anything obviously wrong.

If I set up two boards running your code and just let it run for a few hours, might I see this, or is it rarer than that? Could you supply some sample code? Thanks.

kevinjwalters commented 4 years ago

Wouldn't this be originating from one of the m_malloc calls buried in the C code?

It does not occur with long runs which is partly why I suspect it's not my app and could be either a peculiar buried bug in CP for size (like an absent initialisation) or a roguish random BLE packet in the ether, i.e. it could be when someone with a particular high power BLE device walks down my street broadcasting unusual Advertisement packets.

This might be worth running up briefly but I wouldn't spend more than a minute or two once it's running. Two CLUEs is a good combination.

The program:

Unrelated to this, it's currently blowing up with MemoryError: memory allocation failed, allocating 1784 bytes occasionally on an enumerate over a list of tuples at line 1230. I'm starting to wonder how efficient those are. A 1784 allocation does not seem reasonable given the small content!

kevinjwalters commented 4 years ago

Oops, there were a few sound samples missing, try this rps directory: https://github.com/kevinjwalters/circuitpython-examples/tree/master/clue/rps

kevinjwalters commented 4 years ago

Unrelated to all this, in the CP code I spotted a few variable length arrays are being allocated on the stack with a size based on values that were coming back from the bluetooth library. I wasn't sure if those were all sanity checked to make sure the size was reasonable/within expected range. Has this been discussed before?

kevinjwalters commented 4 years ago

I have six devices now running. I just got one of those ultra rare 65536 MemoryErrors.

I might be doing one atypical thing, I have concurrent start_advertising and start_scan in my code, btw. Oh, and I also have a looping, mono sound sample playing!

kevinjwalters commented 4 years ago

Probably unrelated but I've been plagued by frequent MemoryError exceptions of 1784 bytes too for a while.

These occur all over the place in my code after the BLE comms have completed. I'd initially assumed this was related to dict enlargement based on the peculiar value and seeing an occurrence of it in https://github.com/jczic/MicroWebSrv/issues/5. I've been rewriting bits of the code to pinpoint where it occurs and oddly it will get reported even on a variable = None line. For start_scan elsewhere I had previously used a buffer_size of 1536. I've bumped up this value to 1800 now and based on 20 mins of testing the 1784 errors have gone. Might just be fragmentation issues but still puzzling as to why 1784 allocation gets reported just for assigning None to a variable.

tannewt commented 4 years ago

It sounds like the error could either be generated by 1) code running the background between VM byte codes (audio runs during this time) or 2) setting a new local variable will increase the locals dictionary size and potentially cause a reallocation.

kevinjwalters commented 4 years ago

The MemoryError: memory allocation failed, allocating 1784 bytes problems sound like a good match for symbol table addition as it explains why they occur within a certain area of code and shuffle around with code changes.

The start_scan buffer of 1800 seems to offer some useful if werid protection here. I may also "declare" all the variables outside the main loop.

dhalbert commented 4 years ago

The normal way we would debug this is to set a breakpoint on the exception handler and look at the backtrace. That requires connecting a J-Link or similar to the board. That's why I was hoping for a simpler example that occurs often enough to catch this, but I'm not sure that's possible here. Do you think any of that is possible?

The other thing I am looking at is generating and printing a stack trace when these exceptions happen. It looks like this is possible, if we add -funwind-tables or the equivalent: https://www.google.com/search?q=backtrace+arm-none-eabi-gcc https://github.com/red-rocket-computing/backtrace etc.

kevinjwalters commented 4 years ago

I'd guess it's happening roughly once every thousand executions and it's not always the same device so it would take a lot of luck to catch it.

Another option here is to pass some or all of __FILE__ / __func__ / __LINE__ but maybe the overhead isn't worth it.

kevinjwalters commented 4 years ago

I've made a test version of the application which just calls the function where the 65536 MemoryError occurs in a loop. I'll leave it going for a few hours to see if this can recreate the issue.

kevinjwalters commented 4 years ago

1700 iterations so far and no problems so whatever the magic combination is it's being elusive...

I was having a browse of the BLE code and I was curious about this code:

https://github.com/adafruit/circuitpython/blob/c1bcc25b88ff76dcbd7e365b3e9be7a94806f66b/ports/nrf/common-hal/_bleio/Adapter.c#L472-L480

That's changing the value of self->scan_results before it passes it to ble_drv_remove_event_handler(). Will that work ok? ble_drv_remove_event_handler looks like it keys off both values?

On 9th May I did experience one of these exceptions while fiddling around on REPL (either 5.1.0 or 5.2.0) so they do happen.

>>> import adafruit_ble
>>> ble = adafruit_ble.BLERadio()
>>> from adafruit_ble.advertising.standard import Advertisement
>>> scan = ble.start_scan(Advertisement, timeout=5)
>>> print(scan)
<generator object 'start_scan' at 20016600>
>>> for adv in scan:
...     print(adv.complete_name)
...
...
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "adafruit_ble/__init__.py", line 235, in start_scan
KeyboardInterrupt:
>>>
>>> for adv in ble.start_scan(Advertisement, timeout=5):
...     print(adv.complete_name)
...
...
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "adafruit_ble/__init__.py", line 235, in start_scan
_bleio.BluetoothError: Scan already in progess. Stop with stop_scan.
>>> ble.stop_scan()
>>> for adv in ble.start_scan(Advertisement, timeout=5):
...     print(adv.complete_name)
...
...
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "adafruit_ble/__init__.py", line 235, in start_scan
_bleio.BluetoothError: Unknown soft device error: 0008

I don't think that has any relevance to my current code as I'm not catching exceptions around start_scan so the application would just terminate.

Ooops, just hit close by mistake!

kevinjwalters commented 4 years ago

For the raw_data m_malloc here:

https://github.com/adafruit/circuitpython/blob/c1bcc25b88ff76dcbd7e365b3e9be7a94806f66b/ports/nrf/common-hal/_bleio/Adapter.c#L449-L453

What is the GC approach that stops this being GC'ed? I've seen discussions about this in the past and I thought it scanned python objects, the stack, registers and some explicit mechanism which may be MP_STATE_VM/MP_STATE_PORT ?

dhalbert commented 4 years ago

Thanks for thinking about this. A pointer to raw_data is copied into the singleton _bleio.adapter object. The adapter object should be on the root pointers list, but it might not be (at first glance). I have to do something else for a while but I will check that later.

dhalbert commented 4 years ago

adapter is not on the root pointer list, but it is gc'd explicitly in main.c in gc_collect(), so the storage it holds should be marked as not garbage when gc happens.

kevinjwalters commented 4 years ago

My simple test program has run for 2367 rounds and 3917 rounds without a problem. I'll leave it running for rest of today. Added later: another 4847 error free.

Does the nordic software device code do any malloc'ing, btw?

dhalbert commented 4 years ago

Does the nordic software device code do any malloc'ing, btw?

No, the nrfx library is a wrapper for the SoftDevice, and does no allocation itself. (I double-checked.) The SoftDevice uses a fixed reserved area of RAM for its storage needs

kevinjwalters commented 4 years ago

Left test code running on one CLUE today using random buffer_size for each listening period to start_scan between 512 and 1800. That ran fine for 5207 runs.

kevinjwalters commented 4 years ago

I've just been doing a substantial amount of testing using 20200825 on 5.3.0. I've probably done about 200 tests of my game across all the devices. One CPB with TFT Gizmo just spat out the classic 65536 failure:

TXing <RpsRoundEndAdvertisement manufacturer_data=<ManufacturerData company_id=0822 data=03 43 fe 01 03 03 00 15 03 51 fe 14 > > interval 0.042
Listening for (<class 'RpsEncDataAdvertisement'>, <class 'RpsKeyDataAdvertisement'>, <class 'RpsRoundEndAdvertisement'>)
Traceback (most recent call last):
  File "code.py", line 448, in <module>
  File "rps_comms.py", line 340, in broadcastAndReceive
  File "rps_comms.py", line 115, in startScan
  File "adafruit_ble/__init__.py", line 263, in start_scan
MemoryError: memory allocation failed, allocating 65536 bytes

The buffer_size used with start_scan() is 1800 now.

kevinjwalters commented 4 years ago

A CLUE just blew up with another distinctive number 42241. I checked my notes and I've had two of these before on a CLUE.

TXing <RpsRoundEndAdvertisement manufacturer_data=<ManufacturerData company_id=0822 data=03 43 fe 03 03 03 00 1b 03 51 fe 1a > > interval 0.042
Listening for (<class 'RpsEncDataAdvertisement'>, <class 'RpsKeyDataAdvertisement'>, <class 'RpsRoundEndAdvertisement'>)
Traceback (most recent call last):
  File "code.py", line 448, in <module>
  File "rps_comms.py", line 340, in broadcastAndReceive
  File "rps_comms.py", line 115, in startScan
  File "adafruit_ble/__init__.py", line 263, in start_scan
MemoryError: memory allocation failed, allocating 42241 bytes

Press any key to enter the REPL. Use CTRL-D to reload.
kevinjwalters commented 4 years ago

That 42241 (0xa501) number is peculiar. It doesn't have many factors, 53 * 797. The last time I had a peculiar number it's highly likely it was related to dict additions for symbol table but I don't see how such a large quantity of whatever could suddently be needed.

I've left a small version of the essence of my game code running to see if it trips up on background BLE (someone owns a very promiscuous Cycliq Fly12 bike light near me) and some occasional 5 player game testing.

kevinjwalters commented 4 years ago

@dhalbert In practical terms, I think I need either the new stack trace functionality you mentioned further up in the comments to help pinpoint what's doing the memory allocation or crash dump functionality to catch this. It's been very infrequent recently on 5 devices, but perhaps more frequent when I use 6 together but I can't predict which of the 6 it will occur on and it takes maybe 200+ game rounds to occur on 6 so I'd have to be extremely lucky to catch it with a hardware debugger or get 6 of them!

Have crash dumps been discussed for hard faults and optionally for MemoryError? Those could be enabled with something in boot.py and a special contiguous pre-made crash dump file, perhaps? One mention in forums: Adafruit Forums: Any method to find out cause of last reboot?.

kevinjwalters commented 4 years ago

I upgraded from 5.3.0 to 5.3.1. I've been using 20200825 lib bundle for a few days now. I ran 20 games with 3 rounds per game on 6 devices, so that's 360 executions of the code in main loop where this has previously occurred - curiously it's always beenon the third and final transmission. All fault free, but I'm not reading too much into that as this is such a rarely triggered bug.

Same setup the next day with six devices, this errored on a CLUE within a few minutes with the 42241 bytes! I then ran 20 games across the day on four devices and it was fine.

I do test with a CPB only and it's never occurred on that so perhaps the display has some influence or the general memory usage and GCing has something to do with it. All very puzzling.

kevinjwalters commented 4 years ago

Could this be related to falling off the end of the stack? I've got two functions in CircuitPython with huge arg lists so perhaps I'm using a lot and then it goes off into the CircuitPython C code plus whatever the nordic library code gets up to.

Is there a way in a CircuitPython to check the MP_MAX_STACK_USAGE_SENTINEL_BYTE are intact?

Does the 52840 have a basic MMU?

dhalbert commented 4 years ago

CircuitPython checks for stack overflow in Python code, and there's 1k of extra stack space to handle exceptions, etc. The Nordic SoftDevice uses pre-allocated RAM in general instead of allocating things on the stack. Usually a stack overflow causes some other terrible problem. I think it's more likely this is just a bug somewhere.

There is some simple memory protection, but given that the writeable heap and stack are adjacent, we'd have to insert a guard region and would lose that RAM.

kevinjwalters commented 4 years ago

I spotted a few comments some different versions of soft device's use of stack: https://github.com/bbcmicrobit/micropython/issues/429#issuecomment-291836976 and https://github.com/bbcmicrobit/micropython/issues/71#issuecomment-194816801 and https://github.com/micropython/micropython/issues/4744#issuecomment-489381961

dhalbert commented 4 years ago

We are allocating a 24kB stack on the nRF52480. The micro:bit uses nRF51822, which has much less RAM, and they are trying to get away with a much smaller stack.

kevinjwalters commented 4 years ago

The comments like

consumes up to 2K more on the stack when any deep interrupts fire.

started me pondering this but 24k sounds like a lot to chew through.

BTW, perhaps this is not a fair test but I ran this with 6 args on a CLUE with 5.3.1 and got RuntimeError: maximum recursion depth exceeded which is the sort of thing I was expecting. I doubled it up to 12 and got 50 prints per function and a trip to safe mode.

>>> def stack_test_longer_arg(a, b, c, d, e, f,  g, h, i, j, k, l):
...     print("A")
...     stack_test_longer_kwarg(a, b, c, d, e, f,  g, h, i, j, k, l)
...
...
...
>>> def stack_test_longer_kwarg(aa=None, bb=None, cc=None, dd=None, ee=None, ff=None,  gggggg=None, hhhhhh=None, iiiiii=None, jjjjjj=None, kkkkkk=None, llllll=None):
...     print("KW")
...     stack_test_longer_arg(aa, bb, cc, dd, ee, ff,  gggggg, hhhhhh, iiiiii, jjjjjj, kkkkkk, llllll)
...
...
...
>>> stack_test_longer_kwarg()
KW
A
KW
A
...etc...
kevinjwalters commented 4 years ago

2885 is another unexplained and unexpected MemoryError issue but very different in simplicity and it's reproducible every time.

kevinjwalters commented 4 years ago

Adafruit Forums: Pulse Oximeter Wireless Data Logger Example Getting Crazy Re (third page) shows some 3 byte and one 4 byte corruption to complete_name data which was noticed because the high bit set can cause UnicodeError exceptions from str() conversion on the bytes. Small chance the sending device is generating this, of course.

dhalbert commented 10 months ago

Closing this since many versions have passed, and nRF BLE has been working. A new issue can be opened if necessary.