dlbeer / dhara

NAND flash translation layer for low-memory systems
Other
409 stars 120 forks source link

DHARA+FATFS: boot sector is remapped without a copy #23

Closed gedworker closed 2 years ago

gedworker commented 3 years ago

Hello,

I have faced with a problem which i can't solve for more than two weeks. I am using dhara with FatFS file system and a nand flash. I wrote a simple test which opens a file writes 64 bytes of data and closes it. The test spins for about 1000-3000 times. The flash is large enough to hold all written data. Everything works fine except that sometimes the fat fs function open file (f_open) crashes. After that I can't mount the file system with f_mount function. I have started digging into dhara and found more or less what is going on. The f_mount function fails because it can't find the boot sector for the fat file system. It is the zero sector which has 0xAA55 signature. I added several logging functions into my test to track what happens with zero sector. I have added an output log to dhara_nand_copy function.Also into dhara_nand_prog function to track all write to pages. Also i check the validity of the zero sector reading it content through dhara_map_find function. So my test sequence looks like the following.

  1. format flash
  2. open file
  3. write 64bytes
  4. close file
  5. read zero sector and check that it has 0xAA55 signature
  6. goto 2

So i see that on f_close function dhara starts copying pages. And sometimes i see that boot sector is copied also. After that i see that boot sector is mapped to a new location. Everything looks fine but when error happens the boot sector is mapped to a new location but I don't see that it was copied. And after that my boot sector check function fails and file system is crashed. But if i read the sector from using old location it is there. So dhara remaps the boot sector but does not copy it to that location. How that could happen? Maybe my configuration is incorrect. What should I check?

Here is a log example:

--- File closed --- BOOT SECTOR is mapped to page 31334 --- file opened ---write 64 bytes ---close file --- -copy from ---- to ---- (A LOT OF OTHER PAGES ARE COPYED) --- -copy from ---- to ---- --- -copy from ---- to ---- --- -copy from 31334 to 31484 (HERE BOOT SECTOR IS COPYED) --- -copy from ---- to ---- --- File closed --- BOOT SECTOR is mapped to page 31484 (HERE BOOT SECTOR IS MAPPED TO A NEW LOCATION) --- file opened ---write 64 bytes ---close file --- -copy from ---- to ---- (A LOT OF OTHER PAGES ARE COPYED) --- -copy from ---- to ---- --- -copy from ---- to ---- --- -copy from ---- to ---- (BOOTSECTOR is not copyed here) --- -copy from ---- to ---- --- -copy from ---- to ---- --- File closed --- BOOT SECTOR is mapped to page 31498 (HERE BOOT SECTOR IS MAPPED TO A NEW LOCATION) --- BOOT SECTOR is invalid

As you see from the last log the boot sector is at the new location but was not copied. Although if i read the previous sector (31484) the boot sector will be there. Why dhara maps a page but does not copy it? Thanks,

dlbeer commented 3 years ago

On Tue, Jun 22, 2021 at 06:28:35AM -0700, gedworker wrote:

I have faced with a problem which i can't solve for more than two weeks. I am using dhara with FatFS file system and a nand flash. I wrote a simple test which opens a file writes 64 bytes of data and closes it. The test spins for about 1000-3000 times. The flash is large enough to hold all written data. Everything works fine except that sometimes the fat fs function open file (f_open) crashes. After that I can't mount the file system with f_mount function. I have started digging into dhara and found more or less what is going on. The f_mount function fails because it can't find the boot sector for the fat file system. It is the zero sector which has 0xAA55 signature. I added several logging functions into my test to track what happens with zero sector. I have added an output log to dhara_nand_copy function.Also into dhara_nand_prog function to track all write to pages. Also i check the validity of the zero sector reading it content through dhara_map_find function. So my test sequence looks like the following.

  1. format flash
  2. open file
  3. write 64bytes
  4. close file
  5. read zero sector and check that it has 0xAA55 signature
  6. goto 2

So i see that on f_close function dhara starts copying pages. And sometimes i see that boot sector is copied also. After that i see that boot sector is mapped to a new location. Everything looks fine but when error happens the boot sector is mapped to a new location but I don't see that it was copied. And after that my boot sector check function fails and file system is crashed. But if i read the sector from using old location it is there. So dhara remaps the boot sector but does not copy it to that location. How that could happen? Maybe my configuration is incorrect. What should I check?

Here is a log example:

---<info> File closed ---<info> BOOT SECTOR is mapped to page 31334 ---<info> file opened ---<info>write 64 bytes ---<info>close file ---<info> -copy from ---- to ---- (A LOT OF OTHER PAGES ARE COPYED) ---<info> -copy from ---- to ---- ---<info> -copy from ---- to ---- ---<info> -copy from 31334 to 31484 (HERE BOOT SECTOR IS COPYED) ---<info> -copy from ---- to ---- ---<info> File closed ---<info> BOOT SECTOR is mapped to page 31484 (HERE BOOT SECTOR IS MAPPED TO A NEW LOCATION) ---<info> file opened ---<info>write 64 bytes ---<info>close file ---<info> -copy from ---- to ---- (A LOT OF OTHER PAGES ARE COPYED) ---<info> -copy from ---- to ---- ---<info> -copy from ---- to ---- ---<info> -copy from ---- to ---- (BOOTSECTOR is not copyed here) ---<info> -copy from ---- to ---- ---<info> -copy from ---- to ---- ---<info> File closed ---<info> BOOT SECTOR is mapped to page 31498 (HERE BOOT SECTOR IS MAPPED TO A NEW LOCATION) ---<info> BOOT SECTOR is invalid

As you see from the last log the boot sector is at the new location but was not copied. Although if i read the previous sector (31484) the boot sector will be there. Why dhara maps a page but does not copy it? Thanks,

Copying is not the only way that a sector can be remapped -- it could occur with it being rewritten, and in that case you wouldn't see a call to dhara_nand_copy.

The next thing I would try would be to log calls to dhara_nand_prog as well and see if anything turns up for the final location.

-- Daniel Beer @.***> http://dlbeer.co.nz/ PGP: BA6E 0B26 1F89 246C E3F3 C910 1E58 C43A 160A 553B

gedworker commented 3 years ago

Hello Daniel,

Thanks for reply. I forgot to mention that I have added a log to nand prog function to track writes to pages. Here is the log before error happen.

[00:34:21.724,609] file_srv: Write 64 bytes on file [123.bin] [00:34:21.724,609] file_srv: Open file: 123.bin [0x20031540] [00:34:21.745,117] file_srv: File opened [00:34:21.745,117] file_srv: Writing file [0x20031540] 64 bytes [00:34:21.745,117] file_srv: Write 64 bytes to file, bytes written 64, op time 0 mS [00:34:21.745,117] file_srv: Close file [0x20031540].. [00:34:21.746,093] FRL_NAND: DHARA writing to page: 41552 [00:34:21.754,882] FRL_NAND: DHARA writing to page: 41553 [00:34:21.761,718] FRL_NAND: DHARA copy from: 41261 to 41554 [00:34:21.783,203] FRL_NAND: DHARA copy from: 41262 to 41555 [00:34:21.804,687] FRL_NAND: DHARA copy from: 41266 to 41556 [00:34:21.826,171] FRL_NAND: DHARA copy from: 41267 to 41557 [00:34:21.847,656] FRL_NAND: DHARA copy from: 41268 to 41558 [00:34:21.868,164] FRL_NAND: DHARA copy from: 41269 to 41559 [00:34:21.889,648] FRL_NAND: DHARA copy from: 41270 to 41560 [00:34:21.911,132] FRL_NAND: DHARA copy from: 41271 to 41561 [00:34:21.931,640] FRL_NAND: DHARA copy from: 41272 to 41562 [00:34:21.952,148] FRL_NAND: DHARA copy from: 41273 to 41563 [00:34:21.972,656] FRL_NAND: DHARA copy from: 41274 to 41564 Logs dropped (1) [00:34:21.994,140] FRL_NAND: DHARA copy from: 41275 to 41565 Logs dropped (1) [00:34:22.015,625] FRL_NAND: DHARA copy from: 41276 to 41566 Logs dropped (1) [00:34:22.035,156] FRL_NAND: DHARA writing to page: 41567 [00:34:22.041,015] file_srv: File closed [00:34:22.041,992] MEM_TST: DHARA:journal root is: 41566 [00:34:22.042,968] MEM_TST: BOOTSECTOR IS MAPPED TO PAGE 41394

[00:34:22.044,921] file_srv: write string to file done, writen 64 bytes Logs dropped (1) [00:34:22.044,921] MEM_TST: 6354

[00:34:22.054,687] file_srv: Write 64 bytes on file [123.bin] [00:34:22.054,687] file_srv: Open file: 123.bin [0x20031540] [00:34:22.074,218] file_srv: File opened [00:34:22.074,218] file_srv: Writing file [0x20031540] 64 bytes [00:34:22.074,218] file_srv: Write 64 bytes to file, bytes written 64, op time 0 mS [00:34:22.074,218] file_srv: Close file [0x20031540].. [00:34:22.075,195] FRL_NAND: DHARA writing to page: 41568 [00:34:22.083,984] FRL_NAND: DHARA writing to page: 41569 [00:34:22.090,820] FRL_NAND: DHARA copy from: 41277 to 41570 [00:34:22.112,304] FRL_NAND: DHARA copy from: 41278 to 41571 [00:34:22.133,789] FRL_NAND: DHARA copy from: 41346 to 41572 [00:34:22.155,273] FRL_NAND: DHARA copy from: 41347 to 41573 [00:34:22.176,757] FRL_NAND: DHARA copy from: 41348 to 41574 [00:34:22.198,242] FRL_NAND: DHARA copy from: 41349 to 41575 [00:34:22.219,726] FRL_NAND: DHARA copy from: 41350 to 41576 [00:34:22.240,234] FRL_NAND: DHARA copy from: 41351 to 41577 [00:34:22.261,718] FRL_NAND: DHARA copy from: 41352 to 41578 [00:34:22.283,203] FRL_NAND: DHARA copy from: 41354 to 41579 [00:34:22.303,710] FRL_NAND: DHARA copy from: 41355 to 41580 [00:34:22.325,195] FRL_NAND: DHARA copy from: 41356 to 41581 [00:34:22.346,679] FRL_NAND: DHARA copy from: 41357 to 41582 [00:34:22.366,210] FRL_NAND: DHARA writing to page: 41583 [00:34:22.371,093] file_srv: File closed Logs dropped (1) [00:34:22.373,046] MEM_TST: DHARA:journal root is: 41582 Logs dropped (1) [00:34:22.374,023] MEM_TST: BOOTSECTOR IS MAPPED TO PAGE 41353

Logs dropped (1) [00:34:22.375,976] MEM_TST: *BOOT SECTOR IS INVALID***** [00:34:22.378,906] MEM_TST: *BOOT SECTOR IS INVALID REP***** Logs dropped (1)

So the boot sector was mapped to 41394. After file close it became mapped to page 41353. But there was no any writes and copy to that page. And obviously the boot sector is not valid because it was not rewritten and or copyied. What else should i check? Thanks,

dlbeer commented 3 years ago

On Wed, Jun 23, 2021 at 03:43:09AM -0700, gedworker wrote:

[00:34:22.041,992] MEM_TST: DHARA:journal root is: 41566 [00:34:22.042,968] MEM_TST: BOOTSECTOR IS MAPPED TO PAGE 41394 ... [00:34:22.373,046] MEM_TST: DHARA:journal root is: 41582 Logs dropped (1) [00:34:22.374,023] MEM_TST: BOOTSECTOR IS MAPPED TO PAGE 41353

This looks like a problem -- during a period in which the root changes from 41566 to 41582, and the journal doesn't wrap, there is no way that the root can change to a page which is not something in that range.

Next steps for debugging might be:

Cheers, Daniel

-- Daniel Beer @.***> http://dlbeer.co.nz/ PGP: BA6E 0B26 1F89 246C E3F3 C910 1E58 C43A 160A 553B

dlbeer commented 3 years ago

On Thu, Jun 24, 2021 at 06:48:17AM +1200, Daniel Beer wrote:

This looks like a problem -- during a period in which the root changes from 41566 to 41582, and the journal doesn't wrap, there is no way that the root can change to a page which is not something in that range.

Sorry, I meant to say "there is no way that the boot sector can change...". Any change to any mapping, if the journal hasn't wrapped, must point to a page in the range between the old and new roots.

-- Daniel Beer @.***> http://dlbeer.co.nz/ PGP: BA6E 0B26 1F89 246C E3F3 C910 1E58 C43A 160A 553B

gedworker commented 3 years ago

Hello, Thanks for help. I will try to check on a simulaor. I have noticed a strange thing. When i create several files so that almost 80% of the space is used i delete one file. So there is more that 50% free space available. After that i create a new file and write to it. But writing now takes much longer than it was after format. Is that relates to dhara gc or something else?

dlbeer commented 3 years ago

On Fri, Jun 25, 2021 at 07:53:15AM -0700, gedworker wrote:

I have noticed a strange thing. When i create several files so that almost 80% of the space is used i delete one file. So there is more that 50% free space available. After that i create a new file and write to it. But writing now takes much longer than it was after format. Is that relates to dhara gc or something else?

When you format, do you also reinitialize the Dhara map? If so, that may explain it. Dhara can avoid doing GC if there are sufficiently many unallocated sectors.

What do you have set for GC ratio? Is there any opportunity to invoke trim() on sectors that are known to be unused in your filesystem (e.g. the sectors occupied by a deleted file)?

-- Daniel Beer @.***> http://dlbeer.co.nz/ PGP: BA6E 0B26 1F89 246C E3F3 C910 1E58 C43A 160A 553B

gedworker commented 3 years ago

Hello, I have tested today with simulator on Visual studio. My test sequence looks the following

  1. format_flahs with DHARA initialization
  2. create_file_100MB --------------------takes 610 MS
  3. create_file_100MB --------------------takes 609 MS
  4. create_file_100MB --------------------takes 609 MS
  5. delete file 1 file using f_unlink
  6. create_file_100MB --------------------takes 34313 MS

The file creation after delete takes 34 seconds. I initialize Dhara map on format. The garbage ratio is set to 4. The size of the flash is 385 MB. So that happens because GC is working? Because when 3 files are created they occupy 300MB. And there 85MB left. I have checked that when i create only two files and delete one and create another. It is created fast enough. I will check whether trim can be used.

dlbeer commented 3 years ago

On Sun, Jun 27, 2021 at 12:53:15PM -0700, gedworker wrote:

I have tested today with simulator on Visual studio. My test sequence looks the following

  1. format_flahs with DHARA initialization
  2. create_file_100MB --------------------takes 610 MS
  3. create_file_100MB --------------------takes 609 MS
  4. create_file_100MB --------------------takes 609 MS
  5. delete file 1 file using f_unlink
  6. create_file_100MB --------------------takes 34313 MS

The file creation after delete takes 34 seconds. I initialize Dhara map on format. The garbage ratio is set to 4. The size of the flash is 385 MB. So that happens because GC is working? Because when 3 files are created they occupy 300MB. And there 85MB left. I have checked that when i create only two files and delete one and create another. It is created fast enough. I will check whether trim can be used.

Yes, that is likely a consequence of GC. You could also set GC_RATIO to something much lower, but then you will have a corresponding drop in usable space.

A read cache will also likely increase performance significantly (many partial reads are performed during both writes and GC).

Cheers, Daniel

-- Daniel Beer @.***> http://dlbeer.co.nz/ PGP: BA6E 0B26 1F89 246C E3F3 C910 1E58 C43A 160A 553B

gedworker commented 3 years ago

Hello,

You could also set GC_RATIO to something much lower

I see that when i set larger value for GC ratio (It is the last parameter in dhara_map_init function) the file is created much faster after delete. Why do you recommend to set lower ratio?

but then you will have a corresponding drop in usable space

Does that mean that on initialization more space will be reserved for dhara structures or more space will be consumed durint writing and deletion.

There is not any good detailed description on dhara internal mechanism so i am a bit confused how everything works under the hood.

dlbeer commented 3 years ago

On Mon, Jun 28, 2021 at 01:21:09PM -0700, gedworker wrote:

I see that when i set larger value for GC ratio (It is the last parameter in dhara_map_init function) the file is created much faster after delete. Why do you recommend to set lower ratio?

Then that's a bit strange, and I'm not sure what's going on there. Did the file creation actually succeed (remember that you will have less available space)?

The GC ratio determines the amount of GC work done per write when the map size is near capacity.

Does that mean that on initialization more space will be reserved for dhara structures or more space will be consumed durint writing and deletion.

It means that more of the raw flash is allowed to be garbage at any given time. This, combined with a lower number of usable sectors, means that less work is required to keep up.

There is not any good detailed description on dhara internal mechanism so i am a bit confused how everything works under the hood.

Have you read map_internals.txt in the repository?

Cheers, Daniel

-- Daniel Beer @.***> http://dlbeer.co.nz/ PGP: BA6E 0B26 1F89 246C E3F3 C910 1E58 C43A 160A 553B

bbrother commented 3 years ago

@gedworker did you ever figure out what was happening? I was struggling with what seems to be the exact same problem this week. I create a FAT32 partition, write a file close the file, unmount and remount. It works sometimes but eventually it fails to read the boot sector. If you found a solution that would be wonderful.

I added a bunch of debugging including verifying all writes/ and copies by reading back and from what I can tell my NAND driver is working fine.

Thanks,

-Brett

gedworker commented 3 years ago

Hello bbrother , That issue with invalid boot sector solved after i have rewritten the functions for dhara (copy,erase,mark bad,check bad). I can't say exactly what was the reason for that. I suggest for you to check your spi interface, check all functions which require dhara. For instance check if copy bock function works fine. If you are still getting invalid boot sector and really see that it is remapped without a copy let me know. I will check my old project. You can also use a simulator. I have set up it with visual studio. Works fine and helps to debug easily.