Cordio AppDb eventually runs out of space and then becomes unusable

analogdevicesinc / msdk

Software Development Kit for Analog Device's MAX-series microcontrollers

Apache License 2.0

60 stars 75 forks source link

Cordio AppDb eventually runs out of space and then becomes unusable #798

Closed sullivanmj closed 2 months ago

sullivanmj commented 7 months ago

When dealing with bonded Bluetooth devices, the AppDb that is part of Cordio can run out of space. This happens if AppDb records are written enough times, which can happen under typical use for a long enough time. Once this happens, a WSF_ASSERT will be hit during the next attempted read or write. This causes the core to halt. A WSF_ASSERT will also be hit during subsequent (post-reboot) calls to WsfNvmInit(), which will prevent the Bluetooth application from initializing.

Jake-Carter commented 7 months ago

@EdwinFairchild can you take a look?

EdwinFairchild commented 7 months ago

To clarify , are you establishing bonded connections with a lot of devices and running out of NVM space? If so then I agree we should handle running out of NVM gracefully.

sullivanmj commented 7 months ago

Yes, this can happen as a result of just repeatedly creating new bonds.

You can use all NVM space faster by it by reducing PAL_NVM_SIZE, or by increasing APP_DB_NUM_CCCD to increase the size of each record that gets stored in the DB.

sullivanmj commented 3 months ago

Can you please provide an ETA on a fix for this issue? We would like to avoid going into production with this bug in place.

Jake-Carter commented 3 months ago

Hi @sullivanmj, Eddie has transitioned out of ADI

I will reach out internally to re-assign this ASAP to get you an updated ETA

Jake-Carter commented 3 months ago

We've opened a preliminary fix in #975, where instead of an assertion error we will return an error code and log a trace message.

@EricB-ADI and @yc-adi have started to pick up where Eddie left off addressing the core memory issue.

@sullivanmj - do you have a timeline when you're targeting a move into production?

EricB-ADI commented 3 months ago

@Jake-Carter I have a PR open right now where the code will return an error if the flash fills up. I am currently working on trying to make some code to de-fragment the flash. If you delete a record, that space basically becomes unusable until you wipe all WSF allocated flash.

sullivanmj commented 3 months ago

Thanks, all. I will be happy to help test the MR when the time comes.

EricB-ADI commented 3 months ago

@sullivanmj The code is ready on #975. I have tested it and it all works, but feel free to test it in your use case. I am pulling it in to main, Friday 4/5, if you do not get a chance to take a peak.

sullivanmj commented 3 months ago

Thanks. I tested out the code that checks for space when writing to the NVM. I left a couple comments on the MR. I'm working my way through testing the defrag code now. I found a tengentially related issue while looking into that. PalNvmGetTotalSize returns the size in ints rather than the size in bytes, because of its pointer math and the fact that those symbols are defined as 32-bit integers. Not sure if we should file another issue for that or whether it could be fixed in your PR as well.

EDIT: this appears to be a problem for pal_flash.c in both MAX32655 and MAX32665 folders.

EricB-ADI commented 3 months ago

That is a very subtle and important bug. I am tacking it into this PR.

EricB-ADI commented 2 months ago

Fixed in #975