[Bug] InternalFS analysis: LFS requirements, coherency guarantees

henrygab commented 4 years ago

original analysis -- useful but may not be cause

**Describe the bug** [ARMmbed/littlefs](https://github.com/ARMmbed/littlefs) allows the `prog` and `erase` functions to return LFS_ERR_CORRUPT. However, it does not allow the underlying implementation to ignore errors. Unfortunately, it appears that there are multiple paths where Adafruit's InternalFS can hit an error, but not report that error to LFS. This bug opened to have clear bug title corresponding to underlying issue. **Set up (please complete the following information)** This thread is based on manual code review and analysis of **BSP v0.14.0**. Corruptions have been reported on both **nRF52832** and **nRF52840**. See linked issue #325 for an example serial debug log that shows the relevant code paths were hit just prior to a corruption being detected. Relevant part: **11:07:35.692** -> [SOC ] **NRF_EVT_FLASH_OPERATION_ERROR** 11:07:36.308 -> [BLE ] BLE_GAP_EVT_DISCONNECTED : Conn Handle = 1 11:07:36.308 -> [GAP ] Disconnect Reason 0x08 11:07:36.308 -> [SOC ] NRF_EVT_FLASH_OPERATION_SUCCESS 11:07:36.475 -> [SOC ] NRF_EVT_FLASH_OPERATION_SUCCESS **11:07:36.509** -> assertion "head >= 2 && head <= lfs->cfg->block_count" failed: file "C:\Users\cc\AppData\Local\Arduino15\packages\adafruit\hardware\nrf52\0.11.1\libraries\Adafruit_LittleFS\src\littlefs\lfs.c", line 1144, function: lfs_ctz_find **To Reproduce** Manually review the code. Note the multiple code paths where an error is ignored. Details follow for easy reference (also written at #325). **Expected behavior** When an error is reported when programming or erasing the flash, InternalFS must either retry the operation (e.g., up to 5x), or report that the write/erase/flush operation has failed. This will allow LittleFS to avoid using that page of cache, and maintain data consistency.. **Additional context** There is a history of hard-to-reproduce file system corruptions (e.g., #325, #227, #226, #222). Not all have indications of this bug, but then the current BSP has commented-out the trace when errors are reported.

See below for update

henrygab commented 4 years ago

NRF_EVT_FLASH_OPERATION_ERROR source

The above log line indicates some flash operation (likely a write) failed. The code that outputs that line is in the Bluefruit52Lib library: https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/8dbe7d950363dc17ed6c571be8dd66f53507feca/libraries/Bluefruit52Lib/src/bluefruit.cpp#L708-L712

Nordic API for writing to flash

Here are some rules for using the Nordic API, based on their own header file documentation.

Nordic provides the API that does the actual write to flash as `sd_flash_write()`

https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/8dbe7d950363dc17ed6c571be8dd66f53507feca/cores/nRF5/nordic/softdevice/s140_nrf52_6.1.1_API/include/nrf_soc.h#L913-L948

1. Exactly one write operation may be pending on the flash device at a time:

https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/8dbe7d950363dc17ed6c571be8dd66f53507feca/cores/nRF5/nordic/softdevice/s140_nrf52_6.1.1_API/include/nrf_soc.h#L943

2. Must handle NRF_ERROR_BUSY response by retrying the operation

OK, technically it's not required to retry, but the write would not be successful. Retry just makes sense, as this does not indicate an error writing to that area of flash. https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/8dbe7d950363dc17ed6c571be8dd66f53507feca/cores/nRF5/nordic/softdevice/s140_nrf52_6.1.1_API/include/nrf_soc.h#L943

3. sd_flash_write() is asynchronous when softdevice is enabled, but synchronous otherwise

This adds some fun to the API. When the softdevice is enabled, a return value of success only means that the write was accepted / queued, not that the write succeeded. When the softdevice is disabled, then the write occurs synchronously. https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/8dbe7d950363dc17ed6c571be8dd66f53507feca/cores/nRF5/nordic/softdevice/s140_nrf52_6.1.1_API/include/nrf_soc.h#L946

4. On NRF_SUCCESS, when softdevice is enabled, the buffer cannot be modified until event NRF_EVT_FLASH_OPERATION_*

The buffer used for the write **_must remain unmodified_** until the write completes. Otherwise, the software is just asking for corrupt data. https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/8dbe7d950363dc17ed6c571be8dd66f53507feca/cores/nRF5/nordic/softdevice/s140_nrf52_6.1.1_API/include/nrf_soc.h#L931-L932

How the Adafruit InternalFileSystem library writes to flash

write function is defined to be _internal_flash_prog

See line 106. `Prog` in this case means 'program the flash memory', aka write. https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/8dbe7d950363dc17ed6c571be8dd66f53507feca/libraries/InternalFileSytem/src/InternalFileSystem.cpp#L101-L120

_internal_flash_prog() calls flash_nrf5x_write()

It just adds an offset so that callers "see" the space as starting at address zero, while supporting different chipsets with different start address for the flash memory. https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/8dbe7d950363dc17ed6c571be8dd66f53507feca/libraries/InternalFileSytem/src/InternalFileSystem.cpp#L56-L68 https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/8dbe7d950363dc17ed6c571be8dd66f53507feca/libraries/InternalFileSytem/src/InternalFileSystem.cpp#L28-L32

flash_nrf5x_write() calls flash_cache_write()

https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/8dbe7d950363dc17ed6c571be8dd66f53507feca/libraries/InternalFileSytem/src/flash/flash_nrf5x.c#L68-L72

flash_cache_write() calls flash_cache_flush() to write previous cache page

A small aside. The cache layer caches all writes until the next write to a different page is requested. This is great for performance. However, it means that callers of this function cannot "see" if the write actually made it to the flash or not when using program. ~~ARMmbed LFS allows for caching, but requires that flush() report a failure if it cannot flush the data successfully.~~ Thus, the following analysis also applies when ARMmbed LFS calls flush()... https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/8dbe7d950363dc17ed6c571be8dd66f53507feca/libraries/InternalFileSytem/src/flash/flash_cache.c#L44-L61

flash_cache_flush() calls fc->erase() and fc->program()

First, it checks if the contents were actually altered, if there's a verification function. Else, it first calls `erase()` to ensure the page can be written, and then it calls `program()` to actually write the page with the (modified) data.

fc->program() is fal_program()

https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/8dbe7d950363dc17ed6c571be8dd66f53507feca/libraries/InternalFileSytem/src/flash/flash_nrf5x.c#L51-L58

fal_program() has much of the logic

https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/8dbe7d950363dc17ed6c571be8dd66f53507feca/libraries/InternalFileSytem/src/flash/flash_nrf5x.c#L115-L151

Does Adafruit InternalFileSystem library follow the Nordic rules for reliably writing?

The rules (listed above) were:

Exactly one write operation may be pending on the flash device at a time:
Must handle NRF_ERROR_BUSY response by retrying the operation
sd_flash_write() is asynchronous when softdevice is enabled, but synchronous otherwise
On NRF_SUCCESS, when softdevice is enabled, the buffer cannot be modified until event NRF_EVT_FLASHOPERATION*

First, it's clear that this code handles NRF_ERROR_BUSY, and retries the operation after a delay. Thus, at least rules 1 and 2 appear to be handled, which is good.

Next, let's dig a little deeper to see what happens on errors

When softdevice disabled, and writing to flash fails

When the softdevice is disabled, no event is generated and the call to `sd_flash_write()` occurs synchronously. Therefore, only the code in the function needs to be reviewed. https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/8dbe7d950363dc17ed6c571be8dd66f53507feca/cores/nRF5/nordic/softdevice/s140_nrf52_6.1.1_API/include/nrf_soc.h#L923-L924 #### BUG 1 -- Code does not report write failures Here, if the call to `sd_flash_write()` fails with any status other than NRF_ERROR_BUSY, then the only thing that occurs is that a debug log message is generated via VERIFY_STATUS(). Critically, the parameter `len` is still returned to the caller, and the return value of the function is defined to represent the number of bytes written. https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/8dbe7d950363dc17ed6c571be8dd66f53507feca/libraries/InternalFileSytem/src/flash/flash_nrf5x.c#L127-L131

When softdevice disabled, and writing to flash fails

When the softdevice is enabled, the call to `sd_flash_write()` is asynchronous. Therefore, if the call to `sd_flash_write()` returns NRF_SUCCESS, that only means the command was queued, not that it actually succeeded. The code must then wait for the event to be generated to know if the write succeeded or failed. This means that there are two failure paths to consider. First, when the call to `sd_flash_write()` fails, and second, when that succeeds, but the event generated is `NRF_EVT_FLASH_OPERATION_ERROR`. **Both appear to have bugs.** **BUG2 -- Code does not report failures from `sd_flash_write()`** Identical to bug 1, but with softdevice enabled. It's also unclear from the documentation if an event will be generated when the call to `sd_flash_write()` fails with an error. If not, then this could also cause a hard-lock when this function attempts to take the semaphore. **BUG3 -- Code does not detect or report failures from event** If the call to `sd_flash_write()` returned NRF_SUCCESS, then the command was queued, and the success or failure is determined by the event generated (`NRF_EVT_FLASH_OPERATION_SUCCESS` or `NRF_EVT_FLASH_OPERATION_ERROR`). The overall event handler is `adafruit_soc_task()`: https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/8dbe7d950363dc17ed6c571be8dd66f53507feca/libraries/Bluefruit52Lib/src/bluefruit.cpp#L686-L712 `adafruit_soc_task()` calls `flash_nrf5x_event_cb()` at line 711. That function is: https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/8dbe7d950363dc17ed6c571be8dd66f53507feca/libraries/InternalFileSytem/src/flash/flash_nrf5x.c#L37-L41 Critically, neither of those functions in any way stores an indication that the write failed. This prevents `fal_program()` from propagating the failure, and is thus appears to be a bug.

Possible semaphore race condition?

OK, this is not as sure, as there may be other FreeRTOS restrictions that prevent this. However, it at least smells bad.... 1. `fal_program()` only takes the semaphore **AFTER** the write command has been queued. 2. the soc event handler unconditionally calls xSemphoreGive() for either `NRF_EVT_FLASH_OPERATION_SUCCESS` or `NRF_EVT_FLASH_OPERATION_ERROR` events. 3. a semaphore cannot be given unless it was first taken. From the [FreeRTOS documentation](https://www.freertos.org/a00123.html): > An error can occur if there is no space on the queue to post a message – indicating that the > semaphore was not first obtained correctly. Therefore, the soc event handler presumes that the semaphore is already owned. However, the semaphore is not taken until **after** the command is queued. Therefore, it appears that the following might be possible to occur: 1. write is queued from `fal_program()` 2. **_callback for soc event interrupts ..._** 3. callback `flash_nrf5x_event_cb()` calls `xSemaphoreGive()`, which returns an error because there is no owner, but returns silently. 4. **_callback for soc event completes ..._** 5. `fal_program()` acquires semaphore 6. deadlock?

Summary

It appears that InternalFileSystem currently completely ignores flash errors, both when reported synchronously and when reported via NRF_EVT_FLASH_OPERATION_ERROR. ~~As a result, LFS believes that the prog / erase / flush operation succeeded, when in reality, it did not.~~

UPDATE: LFS does not rely on an error being reported by `prog`.

henrygab commented 4 years ago

GOOD NEWS (). After further review, it appears that ARMmbed's LFS is designed to allow the underlying flash write operations to fail (even silently). LFS will read back data it has written, to ensure it's valid. Presumably, this will occur prior free'ing the data. () Good in that my analysis was wrong, and the above problems may not be accurate. Bad, because it means the underlying problem may be more ... subtle.

I'm now wondering if there's a re-entrancy problem. Is it possible that if two FreeRTOS tasks each attempt to write, then one tasks gets pre-empted in the midst of the flash_cache_* functions, that internal state becomes corrupt?

TBD: Can this be brute-force validated by simply having a Semaphore (recursive mutex?) wrap all the entry points here?

hathach commented 4 years ago

@henrygab thank you again for putting lots of effort into analyzing and writing it down here. I don't even remember most of it. You are right that prog failed status isn't return to LFS now. I will update the codebase to do so, hopefully it will help to narrow down the issue a bit more.

henrygab commented 4 years ago

@hathach I do not think it is necessary to report the failure status to LFS. By design, LFS reads back data that was written to verify it successfully wrote. So, this is not (by itself) the problem.

I do have one possible cause in mind:

As you know, AdaFruit has configured FreeRTOS to use the preemptive RTOS scheduler by defining configUSE_PREEMPTION as 1. Therefore, a task can be interrupted by other tasks. Therefore, it appears possible that two tasks can simultaneously call into the flash cache functions.

The flash cache uses and modifies global state (e.g., flash_cache_t* fc). Critically, I have not found any protection against multiple tasks writing to the flash simultaneously. While I have not found a scenario where LFS should end up in a corrupt state, I can describe a scenario where data is lost.

An example scenario:

1. at start, fc->cache_addr is the invalid page. 2. Task A causes a write to page X 3. Task A completes line 61 (call to `flash_cache_flush(fc)`) when the scheduler pre-empts task A for task B. 4. Task B causes a write to page Y 5. Task B completes line 61, when the scheduler pre-empts task B for task A. 6. Task A continues with line 62 (sets `fc->cache_addr` to page X). 7. Task A reads page X into the cache buffer 8. Task A fills the cache buffer with page X's data 9. Task A is pre-empted for Task B. 10. Task B continues with line 62 (sets `fc->cache_addr` to page Y). 11. Task B fills the cache buffer with page Y's data (**WITHOUT FLUSHING … loss of data written by Task A**) https://github.com/adafruit/Adafruit_nRF52_Arduino/blob/b9d586b56c2cdf024cccb2a17ac4ec402227d135/libraries/InternalFileSytem/src/flash/flash_cache.c#L59-L68 However, even in the above scenario, I think LFS should still discover the lost data, because when it attempts to verify by reading the data back, the data will not have made it to the flash.

I have an early prototype that serializes the flash access (not clean, but shows intent) on a private branch. This compiles, but has not been tested. The commit of interest wraps all entry / use of the flash cache with a recursive mutex. It also will log (at level 2) if there was ever an attempt to call into the function simultaneously by always first attempting to acquire the mutex with a zero timeout. Thus, with this logging enabled, if that message ever appears, it will confirm at least that the cache layer needs to have this protection.

Do you think multiple tasks calling into flash writing is a potential cause of the corruption?

Do you think this type of change would be appropriate, given that the softdevice tasks clearly are writing to flash?

czuvich commented 4 years ago

Do you think multiple tasks calling into flash writing is a potential cause of the corruption?

As a general rule of thumb, disk I/O should be serialized. I submitted issue 227. I since re-wrote my applications on the nRF52840 to use semaphores for my reads/writes (and a queue for all writes) and my flash issues went away. Your mutex solution will work as well.

I've been running my code for several months and haven't had any corruption issues.

henrygab commented 4 years ago

@czuvich -- Thanks for the confirmation that serializing your IO solved your issues, which also seems to validate the above analysis could resolve this tough issue for others..

Do you have a method to cause the corruption issue to occur at least somewhat reliably? (e.g., within 24 hours of running a sketch?) If so, given the difficulty to repro this issue, it would be extremely helpful if either (a) you could share such a sketch, or (b) verify the corruption and that the PR above avoids that corruption.

Thank you again!

czuvich commented 4 years ago

@henrygab -- Unfortunately, I did not create a repo project for this particular issue. I just ended up designing around it assuming it's a problem. I'm willing to bet if you created a small sketch that wrote some large(ish) files while also running some higher priority tasks in RTOS that try to read those files you would see similar behavior.

henrygab commented 4 years ago

Update: Was contacted by someone from a Canadian University, who may have a repro scenario. Hoping to see something from them here soon (repro sketch & steps, etc.)

mattratto commented 4 years ago

Yes, that’s me! Will send testing sketch and steps ASAP. Want to do some experiments with henrygab’s branch also. Henry, can you confirm that your changes are confined to the flash_cache.c file?

henrygab commented 4 years ago

Hi @mattratto! Yes, relevant changes are confined. See PR #372, which includes a "Files Changed" tab at the top.

I've also taken a quick look to see where I'd want to add additional debugging information. I have some ideas, but want to better understand where the tasks end up getting stuck (stack traces per task).

Thanks!

mattratto commented 4 years ago

Hi all, further testing with my test software: I ran a InternalFS.format, and then ran a loop which wrote 100 files of 50 bytes each over and over again to the flash. This routine creates a file called 'data0,data1, data2, etc and saves 50 bytes to it. If the file already exists, it first runs an InternalFS.remove in order to prevent appends. This ran successfully for two days with no obvious issues.

But as soon as I stopped the loop and attempted to load the files I received the following errors:

lfs error:494: Corrupted dir pair at 54 53 lfs error:494: Corrupted dir pair at 54 53 assertion "pcache->block == 0xffffffff" failed: file "[directory]\0.14.6\libraries\Adafruit_LittleFS\src\littlefs\lfs.c", line 183, function: lfs_cache_prog

I will paste the code I used in a comment to follow. This appears to be the file corruption issue under discussion. Further testing revealed that I can pretty much reproduce the corrupted directory pair error simply by writing and overwriting files repeatedly - though the duration differs. The bad part of this if that the corruption is not detected on write but only on read.

mattratto commented 4 years ago

edit - removed inline code due to length. Replaced with gist created by henrygab - my thanks!

https://gist.github.com/henrygab/8fa35df56889d64b240e4f64d164f731

My apologies for the length of the code - I am trying to reproduce the conditions in the production software where I am seeing the issues.

mattratto commented 4 years ago

One last thing - I am going to replace the flash_cache.c file in the current Adafruit InternalFS library with the one created by HenryGab and see what I get. Will report back here

mattratto commented 4 years ago

when switched to henrygab's flash_cache.c, and attempt to format the filesystem (still with error listed above, I get the following error:

lfs error:494: Corrupted dir pair at 54 53 number of items: 2 current Number: 0 Formatting ... [IFLASH] Blocked parallel write attempt ... waiting for mutex

mattratto commented 4 years ago

Henry, any thoughts on how I can resolve this?

mattratto commented 4 years ago

One other note - I receive the above 'waiting for mutex' error on any attempted file operation when using henrygab's flash_cache.c. I'm trying to parse what generates this, but I'm afraid my coding is not up to snuff!

henrygab commented 4 years ago

@mattrato --

Below, I ask for an edit, comment on your posts, answer your questions, and then suggest some additional actions that you might take in parallel with any time I can spend on this. Taking each in turn....

Request for an edit

Can you edit your earlier post to remove the code? Having such a long post makes reading the thread difficult. BTW, GitHub allows you to create "GIST"s ... just goto gist.github.com. I did this with the above and created the following GIST for the above code: https://gist.github.com/henrygab/8fa35df56889d64b240e4f64d164f731

Useful info about repro

when switched to henrygab's flash_cache.c, and attempt to format the filesystem (still with error listed above, I get the following error: lfs error:494: Corrupted dir pair at 54 53 number of items: 2 current Number: 0 Formatting ... [IFLASH] Blocked parallel write attempt ... waiting for mutex ... One other note - I receive the above 'waiting for mutex' error on any attempted file operation when using henrygab's flash_cache.c. I'm trying to parse what generates this, but I'm afraid my coding is not up to snuff!

That's actually VERY good information, and helpful in the extreme. If my branch breaks that rapidly, there's likely a simple error in my proposed fix. That should then be able to get fixed quickly, or point to a deeper flow in the code that will highlight a more difficult problem.

Answer

Henry, any thoughts on how I can resolve this? No, as the issue isn't known yet.

I first need to repro locally, then figure out what might be going wrong, then add some tracing, hope it's not a timing-sensitive bug, and get a repro with tracing to either validate or invalidate my hypothesis. If we're lucky, and my real life allows the time,

Possible things to do in parallel

Also, the above has a lot going on within it, and requires an RTC library. Not having seen any repro before, this is better than what we have. However, having a repro scenario with less stuff in it is preferable.

Does this sketch require any external hardware to be attached to the base Adafruit Feather nrf52840 Express board?
Flash will eventually have failures when written to repeatedly. Is it possible that your flash has simply been written to beyond its failure point? (e.g., does this repro on a brand new Feather?)
How are you powering the board while reproducing this? e.g., battery; computer front panel USB 3.0 port to 7ft. cable to feather; computer back-panel USB3.0 port to 18 inch cable to feather; 5V 2Amp wall wart to 5ft USB cable to feather; etc.
What other hardware (if any) is attached to the feather during a repro?

mattratto commented 4 years ago

@henrygab - msg received and understood. My apologies.

I have replaced the inline code with the gist you provided - my thanks. I will use this mechanism in the future.

Answers to the questions posed:

Also, the above has a lot going on within it, and requires an RTC library. Not having seen any repro >before, this is better than what we have. However, having a repro scenario with less stuff in it is >preferable. I will edit the test code and reduce its complexity. When done I will post a new gist.

Does this sketch require any external hardware to be attached to the base Adafruit Feather >nrf52840 Express board? No - but note that I am carrying out these tests on base Adafruit Feather NRF52832 boards as well as custom NRF52 boards. In both cases I am using the Adafruit 14.6 BSP.

Flash will eventually have failures when written to repeatedly. Is it possible that your flash has >simply been written to beyond its failure point? (e.g., does this repro on a brand new Feather?) Perhaps, but then these failures are occurring much earlier in the flash lifecycle than expected. I have seen this error on new boards that have only had a few hundred files written to them.

How are you powering the board while reproducing this? e.g., battery; computer front panel USB >3.0 port to 7ft. cable to feather; computer back-panel USB3.0 port to 18 inch cable to feather; 5V >2Amp wall wart to 5ft USB cable to feather; etc. three conditions; 1)an approx 3 ft long USB cable attached to one of the two USB ports on my dell xps 15 laptop; 2) the above USB cable as well as a 380mah lipoly battery plugged into the JST connector on the feather, and 3) the battery by itself.

What other hardware (if any) is attached to the feather during a repro? none.

Will continue to troubleshoot and post any results here. Perhaps this is an 53840 vs 53832 issue? Or a BSP version mismatch?

mattratto commented 4 years ago

in an effort to troubleshoot the 'waiting for mutex' error, I added a number of additional logging statements to henrygab's flash_cache.c. I then ran the 'Internal_ReadWrite' code from the Adafruit examples (from 14.6). Here's my logging output:

---log start--- Internal Read Write File Example

Enter to any keys to continue:

[IFLASH] started takeflashcacheserialization [IFLASH] SemaphoreHandle_t newSemaphore = xSemaphoreCreateMutex(); [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization Open /adafruit.txt file to write ... [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization [IFLASH] releaseflashcacheserialization [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] started takeflashcacheserialization [IFLASH] Blocked parallel write attempt ... waiting for mutex

mattratto commented 4 years ago

FYI, in that code, the file system is initialized, then there is an attempt to read an file called "/adafruit.txt". If this fails, the code attempts to write a file of that name. From the log above, you can see that the file system appears to initialize, and the read fails (without the system halting), and the write is started. There are then a series of sequences of 'takeflashcacheserialization', "pdTRUE== xSemaphoreTake(_serializeFlashAccess, 0)" and 'releaseflashcacheserialization'. But towards the end these appear to get out of sequence with a double "releaseflashcacheserialization" followed by two starts and then the "blocked parallel write attempt - waiting for mutex" which locks up the system.

I'm not sure what this means, but hopefully it is useful in figuring out the 'waiting for mutex' issue.

mattratto commented 4 years ago

By the way, additional logs all show this same pattern at the end: [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] releaseflashcacheserialization [IFLASH] releaseflashcacheserialization [IFLASH] started takeflashcacheserialization [IFLASH] pdTRUE == xSemaphoreTake(_serializeFlashAccess, 0) [IFLASH] started takeflashcacheserialization [IFLASH] Blocked parallel write attempt ... waiting for mutex

henrygab commented 4 years ago

Fascinating. I cannot see any code path where it should be possible to have a double-release, and thus a double output of [IFLASH] releaseflashcacheserialization.

@hathach, @mattratto --

Can either of you see a code path where a double-release would exist?

If not, then this suggests that the problem could be something entirely different. For example:

Stack corruption
Stack overflow (a type of stack corruption)
Other memory corruption
Some type of compiler optimization related "error" (unexpected, even if valid, and previous behavior hid or prevented result)

mattratto commented 4 years ago

my only other thought is that it is not a 'real' duplicate' but is being caused by serial buffering or something in the serial port/arduino IDE serial console... The last time I ran it I actually saw two duplicates - one duplicate of 'release' and one of 'take'. am planning to add a counter to check this...

mattratto commented 4 years ago

hmmm. counter shows out of order release. Full log here: https://gist.github.com/mattratto/308afcad01081aa41a68da16f6579653

relevant part: [IFLASH] started takeflashcacheserialization= 45 [IFLASH] releaseflashcacheserialization = 45 [IFLASH] started takeflashcacheserialization= 46 [IFLASH] releaseflashcacheserialization = 46 [IFLASH] releaseflashcacheserialization = 47 [IFLASH] started takeflashcacheserialization= 47 [IFLASH] started takeflashcacheserialization= 48 [IFLASH] Blocked parallel write attempt ... waiting for mutex

up until 47 they were in sync...

mattratto commented 4 years ago

OK. Now I get the double release - they are from different activities - reading and writing. I added more logs, noting specifically what function was calling the takes and the releases. in the above scenario the release 46 is from the take 46 (read), while the release 47 is the start of a write:

[IFLASH] started takeflashcacheserialization= 46 [IFLASH] take from top of read [IFLASH] releaseflashcacheserialization = 46 [IFLASH] release from bottom of read [IFLASH] releaseflashcacheserialization = 47 [IFLASH] release from top of write [IFLASH] started takeflashcacheserialization= 47 [IFLASH] take from top of write [IFLASH] started takeflashcacheserialization= 48 [IFLASH] Blocked parallel write attempt ... waiting for mutex

so what we are seeing here is a 'take' (47) that is never released before another 'take' (48)

mattratto commented 4 years ago

ah, I think I see it:

[IFLASH] releaseflashcacheserialization = 47 [IFLASH] release from top of write [IFLASH] started takeflashcacheserialization= 47 [IFLASH] take from top of write [IFLASH] reached flush [IFLASH] started takeflashcacheserialization= 48 [IFLASH] Blocked parallel write attempt ... waiting for mutex

when the write starts, there is a release (47) followed by a take (47). Then, in the middle of the write, flush is run which includes another take (48) this causes a mutex error because the write take (47) was not released prior to starting the flush take (48).

I added a ReleaseFlashCacheSerialization(); right before the flash_cache_flush(fc); in flash_cache_write - I am now getting successful file operations.

mattratto commented 4 years ago

henry, please let me know what you think about the above. I am now moving on to testing the file corruption issues. Will update you with any news.

mattratto commented 4 years ago

well, after about an hour of constant writes, deletes, and rewrites I encountered the error again. Full log here: https://gist.github.com/mattratto/2b74c4d5317bf1c286b2c3c1f7bf0ba4 In it you see the corruption error: lfs error:494: Corrupted dir pair at 154 153 and this other error that co-occurs: assertion "pcache->block == 0xffffffff" failed: file "C:\Users\mattr\Documents\Arduino\hardware\Adafruit\Adafruit_nRF52_Arduino\libraries\Adafruit_LittleFS\src\littlefs\lfs.c", line 183, function: lfs_cache_prog

Full disclosure - this may have happened due to an unforeseen power interruption - I was trying to move the board and may have interrupted power in the middle of a file operation. I'll keep testing to see if that caused it.

mattratto commented 4 years ago

I ran my test software for about 8 hours - that is approx. 9600 read, writes, and removes. Following this, I restarted the software and checked the file system. I had no file corruption issues. I will continue to test and am planning to implement this change in my production software to see whether a more varied set of software activities will influence this. The software includes BLE transfers and pairing, so that should be a more rigorous test.

henrygab commented 4 years ago

@mattratto I now see the recursive acquire. This is a bug in my patch, and unrelated to the cause of the corruption. Unfortunately, adding a release() before flushing the cache in the write path is not a safe operation. However, I've updated the branch (commit 89c46fe9) to fix this error.

While this improves the fix, it doesn't indicate the underlying problem. I recommend immediately updating to at least commit 89c46fe9.

If the assertion in lfs.c @ line 183 still occurs with commit 89c46fe9, then that will be interesting.

mattratto commented 4 years ago

I have updated to 89c46fe - the fix runs properly with my test code. will continue to test to see if I see any errors. Henrygab, do you expect this fix will solve file corruption issues associated with unexpected power failures? I believe that was the cause of the lfs.c @line 183 error.

henrygab commented 4 years ago

I do not think the power failure should ever cause corruption. The design of LFS is such that the file system does not trust writes have made it to the media without doing more. Those "more" steps are:

Send a flush operation (to ensure any cached data is written), and invalidate cache
Send a read operation to validate the written data.

I understand that LFS does this for at least the file system metadata, which would be sufficient to avoid the metadata corruption that is typically discovered at boot in the error reports. I do not know LFS internals sufficiently to know if it also performs this validation for the user data.

In other words, being robust in the face of arbitrary power failures is a design goal of LFS. This is why I reviewed the cache implementation as the more likely source of any bug.

Data corruption scenario addressed by this patch

For the data corruption to occur, multiple tasks must be writing to the file system at the same time. It is known that the softdevice writes Bluetooth pairing information, and may do so via a high priority task (no source is available). Here's an example flow that would cause data corruption, without the serialization: 1. Task Omega is executing the `flash_cache_write()` function, and has just ensured the cached page == N, the page it wants to update. 2. Task Omega's next instructions to be executed correspond to the following line (which updates the cache's buffer): https://github.com/henrygab/Adafruit_nRF52_Arduino/blob/89c46fe9c1507cfcf7bd6413a342096fa5223c21/libraries/InternalFileSytem/src/flash/flash_cache.c#L122 3. For whatever reason, Task Omega is pre-empted by Task Alpha. One example reason is that Alpha is a higher-priority task that just became ready to execute (e.g., notified to write updated Bluetooth pairing information?) 4. Task Alpha needs to write this data to page M (which is a different page than page N). Without serialization, when the `flash_cache_write()` function is executed, it checks and determines that the cached page is not the one that Alpha is trying to write to. Therefore, it will flush that page and read page M into the cache buffer https://github.com/henrygab/Adafruit_nRF52_Arduino/blob/89c46fe9c1507cfcf7bd6413a342096fa5223c21/libraries/InternalFileSytem/src/flash/flash_cache.c#L113-L120 5. At some future time, Task Omega resumes execution. Thus, it executes line 122 at last, which modifies the cache's buffered page. Unfortunately, Task Omega believes this contains page N, when in reality, it now contains page M. Thus, Task Omega corrupts a page entirely different than the one it intended to write to... https://github.com/henrygab/Adafruit_nRF52_Arduino/blob/89c46fe9c1507cfcf7bd6413a342096fa5223c21/libraries/InternalFileSytem/src/flash/flash_cache.c#L122 Of course, with the serialization patch, when Task Alpha gets to the mutex, it is blocked until task Omega finishes its operations and releases the mutex, which prevents the corruption.

Thus, should any test of this branch (compiled at debug level 2 or greater) EVER print "Blocked parallel write attempt", the scenario described is shown to actually happen, and validate that this patch fixes an actual problem.

mattratto commented 4 years ago

ah, I see - thanks Henry. I will do some more robust testing then. I will say that I have been running read,write, removes all day with no errors. Will report back

henrygab commented 4 years ago

@mattratto -- Good luck. For your repro scenario, may I recommend the following?

Change the critical log entry to be entirely unconditional as follows: Old Line: LOG_LV2("IFLASH", "Blocked parallel write attempt ... waiting for mutex"); New Line Serial.print("IFLASH: Blocked parallel write attempt ... waiting for mutex");
Set debug level to 0 (release), as this is a bug that requires specific timing
Rebuild your repro scenario sketch (the one with Bluetooth enabled)
Test that build & monitor serial output

The hope is that, eventually, the above line will show up on the serial monitor. If it does, then this fixes at least one cause of the bug.

hathach commented 4 years ago

Thanks @henrygab and @mattratto for extensively fixing and testing this annoying issue. I am almost out of breath catching up with your conversation. Sorry for the huge delaying, I was on vacation for the last whole week.

I will also try to run test sketch https://gist.github.com/henrygab/8fa35df56889d64b240e4f64d164f731 with PR #372 locally here on Feather nRF52840 to see If there is any issues :)

hathach commented 4 years ago

@mattratto does your test sketch requires user input 'W' to write each sample ? I think I may need to write my own stress test for this issue.

mattratto commented 4 years ago

yes, I am afraid so - I actually edited that code to just run the writes repeatedly for overnight runs. Otherwise, I enter long sequences of WWWWWWWWW when I am running these and watching debug

hathach commented 4 years ago

@mattratto thanks for the sketch, I will make it write 1 every 1 second, and leave it run for a day or two for now. We could simplify it into one of stock example as Internal_StressTest later on :)

mattratto commented 4 years ago

great idea - I know my code is a bit too complicated for this - but I was trying to simulate my production environment a bit.

czuvich commented 4 years ago

I thought I'd chime in. I've been following this thread, and I added your mutex code to my own production software. I did not modify the flash code, but instead I replaced my serialized queue with the mutex. So far it's been working great. I've been wanting to move away from the queue for flash access, and I wasn't entirely sure of how to properly implement the mutex. Thanks for this update!

hathach commented 4 years ago

thanks to @henrygab patch, the result is very positive, I run the sketch overnight and there is no issue or whatsoever. As Henry pointed out, serialization access to flash is the solution to this issue. I will do more testing with more scenario (write new, overwrite, moving, removing) file operation in a stress test sketch. And hopefully we could fix this once and for all. Thanks @czuvich for testing out the patch, things seem to be very promising now :)

henrygab commented 4 years ago

Thanks @czuvich for the testing and validation that it looks good! 👍

Thanks @hathach for writing a sketch. I was looking at doing just that, but haven't had time.

If I can recommend? Create many tasks at lower priority each writing (different) files, and then one task at high priority that awakes on a timer, and does a single small write to its own file once per wake? With short timer times, this should increase chances of hitting this, because then the many tasks at lower priority may end their timeslice in the critical area of code, and the high priority task will definitely interrupt the lower priority tasks....

henrygab commented 4 years ago

@czuvich , I just wanted to note that I am recommending an update to use the PR referenced in #397.

Turns out the LFS is not safe to use from multiple contexts. Therefore, PR #397 serializes at the LFS wrapper API.

Wrapping all LFS APIs will prevent the LFS code itself from getting into internally inconsistent states, very similar to the types of issues as the cache layer serialization addressed (only LFS issues would be much more complex to debug).

I have run the stress test sketch in PR #397, and have not seen corruption ... even without the cache layer synchronization.

If you also test this PR, of course I'd appreciate knowing any results you get. I am quite hopeful that this will fix the remaining potential corruption issues.

czuvich commented 4 years ago

@henrygab Thanks for the update! I'll take a look at merging into my code. Just curious, what changed your mind to go from xSemaphoreCreateMutex to xSemaphoreCreateMutexStatic? I've been running xSemaphoreCreateMutex with great success so far.

henrygab commented 4 years ago

Thanks for the extra set of eyes.

Why static....

Use of statically allocated buffers ...

avoids additional error paths when the initialization cannot fail (as here)
prevents runtime reliance on heap state
avoids heap fragmentation (if allocating and freeing, not really applicable here)
avoids bugs that require a multi-day repro (e.g., due to specific heap state required)

Thus, when possible and architecturally reasonable, the use of statically-allocated resources can improve stability and reduce code complexity. Lower complexity == Fewer bugs and easier to reason about the behavior of the system.

A good goal is to attempt to write IoT / embedded systems code that never allocates memory after startup. Anything that has to allocate memory, or otherwise relies upon a global heap, to enable forward progress is a failure point waiting to happen.

See also rule #3 in NASA's JPL's Rules for Developing Safety Critical Code -- I find these guidelines useful for all projects, even when I relax / ignore them, as they give an ideal to strive for.

Hope that helps!

czuvich commented 4 years ago

Thanks for the detailed explanation and links! I’m new to hardware programming (but not programming), so this is really useful. I’ll try out the new code and let you know if I run into issues. Again.. thank you @henrygab.

adafruit / Adafruit_nRF52_Arduino