eclipse-threadx / filex

Eclipse ThreadX - FileX is a high-performance, FAT-compatible file system that’s fully integrated with Eclipse ThreadX RTOS
https://github.com/eclipse-threadx/rtos-docs/blob/main/rtos-docs/filex/index.md
MIT License
27 stars 22 forks source link

Fault tolerance issue in FAT12 #14

Open smithBraun opened 2 years ago

smithBraun commented 2 years ago

background: Upon deleting file, there is deletion of the FAT chain of the file from the end to start (in _fx_utility_FAT_flush called by _fx_fault_tolerant_cleanup_FAT_chain). upon power down, the fault tolerance know just the beginning of the FAT chain, and it searching again till the end of it (which may be shorter now if before the power down it started to be deleted) and continue deleting from end to beginning.

the bug: in FAT12, FAT entries may be divided into 2 sectors, if the power down occur between writing one sector to the other, after the power down when looking on the chain this entry may point on wrong place, which will cause erasing another non related entries.

smithBraun commented 2 years ago

Hi, I understand that it may be long time for investigation/solve this issue. So I will appreciate if you can update if you agree/disagree it is real issue, when you have option for solution to heare about it, and to get early drop of it.

TiejunMS commented 2 years ago

@smithBraun , thanks for reporting the issue. We are working on reproducing the issue and will keep you updated.

smithBraun commented 2 years ago

HI @TiejunMS Thank you. Just want to mention that if one part of the FAT entry is 0 (not matter if it is the part in the first sector or in the second), there won't be issue.

smithBraun commented 2 years ago

Similar issue can happen when FAT chain is written when fx_utility_FATflush called by fx_utility_FAT_entrywrite (when _FX_FAULT_TOLERANT_STATE_SET_FATCHAIN)

smithBraun commented 2 years ago

Hi @TiejunMS , Any success with reproducing?

TiejunMS commented 2 years ago

@smithBraun , did you encounter this issue by analysis or run into this issue in application? Here is my analysis on this issue.

Let's say the bytes per sector is 512 and sector per cluster is 1. On FAT12, each sector can hold 341 FAT entries. The original FAT chain of the file is as below. 700(3)->400(2)->800(3)->END The FAT entries of this file start from the third FAT sector, pointers to second FAT sector, then third sector.

When this file is deleted, in fx_fault_tolerant_cleanup_FAT_chain.c, all these three FAT entries will be cached and deleted from back to front. FAT entry 800 will be deleted first. Due to the sector of FAT entry 400 is different from 800, changes to FAT entries (from 800->END to 800->FREE) will be flushed to disk. If the power off happens before deleting FAT entry 400, the FAT chain will be like this. 700(3)->400(2)->800(3)->FREE

On next power on, we will do nothing to FAT entry 800 due to it is already freed. Only FAT entries 700 and 400 will be deleted.

after the power down when looking on the chain this entry may point on wrong place

I'm not sure about the entry pointing to wrong place. Did you mean FAT entry 400 still pointers to 800?

If this example is not suitable for the issue you described, could you share the FAT chain and where the power off happens during deleting the FAT chain?

smithBraun commented 2 years ago

@TiejunMS sorry for being not clear enough, I see you understand wrongly the bug I described.

did you encounter this issue by analysis or run into this issue in application I ran into this issue while running power down tests on FILEX

If this example is not suitable for the issue you described, could you share the FAT chain and where the power off happens during deleting the FAT chain? Sure, let take your example of bytes per sector is 512 and sector per cluster is 1, I have two chains: FAT(0x155) == 0x014->FAT(0x014) == 0xfff->END FAT(0x010) == 0xfff->END Looking at the entry sitting in 0x155, as 512 bytes sectors contain 0x155+1/3 FAT entries, so mapping the entries to sectors - this entry is separated into two, the 0x004 is in sector 1 and the 0x010 is in sector 2: (1,2) FAT(0x155) == 0x014 ->(1) FAT(0x014) == 0xfff->END (1) FAT(0x010) == 0xfff->END Now let say the delete process of the first chain is beginning, from back to front as you mentioned, so first sector 1 will be updated so FAT entry 0x014 will be freed but entry 0x155 will be just partially updated!! : (1,2) FAT(0x155) == 0x010 -> (1) FAT(0x010) == 0xfff->END FAT(0x014) == 0x000 Power down in this state will cause corruption, as now the FAT chain clear will restart, now FAT entry 0x155 pointing to wrong place, so FAT entry 0x10 is going to get free.

You can simulate the power down in - https://github.com/azure-rtos/filex/blob/89976978ff0ae62588e1871ea82fe05c67614c85/common/src/fx_utility_FAT_flush.c#L154-L155 Where the code detects place when FAT entry was separated to two and the first part written already.

TiejunMS commented 2 years ago

@smithBraun , thanks for sharing the details! I confirm this is an issue and will come with a solution. I will keep you posted.

smithBraun commented 2 years ago

Great, thanks @TiejunMS . I will be happy to get the fix as soon as it implemented and not wait to official release, to re-run my tests and ensure I can't find more corner cases.

smithBraun commented 2 years ago

Hi @TiejunMS , Any updates with this issue?

TiejunMS commented 2 years ago

@smithBraun , the fix is working in progress. Could you send an email to Azure RTOS support (azure-rtos-support@microsoft.com)? Once it is ready for test, I can share the source code with you.