PaulStoffregen / LittleFS

73 stars 20 forks source link

Bad block management for NAND #54

Open davidefer opened 1 year ago

davidefer commented 1 year ago

Hi Paul I'm a novice in NAND memories and I would have a question regarding your NAND driver. With the look-up table, it seems that with the Winbond memory one can get rid of a sophisticated bad blocks management, provided that the 20 Bad blocks/GBit ratio is respected (2%).

As I understand it, in your driver you mark the block as bad (addBBLUT) as soon as the ECC reports a correction.

As I have understood, the NAND flashes need sometime the data in the block to be refreshed, please refer to this link: https://www.segger.com/products/file-system/emfile/add-ons/device-driver-nand-flash/about-device-driver-nand-flash/ chapter "Data scrubbing". As understand it, the process involves erasing the entire block and then writing back again the correct data.

In your driver you don't perform data scrubbing, and as said the block is marked immediately as bad. I guess this is too restrictive, and the 20 Bad blocks/GBit ratio could then not be valid anymore. The block could be anyway valid, it is just that the data has suffered a "normal" degradation. If later the real 20 bad blocks will manifest, your driver has already occupied the look-up table with false positives, thus reducing the useful lifetime of the flash chip.

What do you think about it? Many thanks in advance. Regards ,Davide

CSC-Sendance commented 2 months ago

Hi, I also think this is very suboptimal.

I am currently trying to implement something that handles this in the recommended manner with the W25N01 and an nrf52840 platform.

A major issue here is, clearly, that there are only block erases with 128 kB available. This would mean, to "refresh/scrub" a "ECC-corrected" block, one would need to read & buffer the full 128 kB of a block, perform the erase operation, then re-prog it. Occupying 128 kB RAM seems kinda infeasible concerning the 256 kB of RAM the nrf52840 has (in my case - other platforms probably have much less RAM).

As such, I think the mentioned checking for corrected ECC errors and immediate scrubbing can only realistically be performed under special circumstances (e.g. in advance/periodically in the bootloader before the OS is booted?) but not during normal operation. This may also be why scrubbing is not mentioned - at all - in the W25N01 documentation? Accordingly, I think the only practical way is to "skip scrubbing" BUT still not congest the BBLUT when the "bad" page was anyway successfully corrected.

For this lib, this would mean to simply introduce a small break in the switch (case 1)?

// Check ECC
    uint8_t statReg = readStatusRegister(0xC0, false);
    uint8_t eccCode = (((statReg) & ((1 << 5)|(1 << 4))) >> 4);

    wait(progtime);

    switch (eccCode) {
    case 0: // Successful read, no ECC correction
      break;
    case 1: // Successful read with ECC correction
      //Serial.printf("Successful read with ECC correction (addr, code): %x, %x\n", addr, eccCode);
      break; // <--- add this to avoid congesting BBLUT with nicely corrected pages
    case 2: // Uncorrectable ECC in a single page
      //Serial.printf("Uncorrectable ECC in a single page (addr, code): %x, %x\n", addr, eccCode);
    case 3: // Uncorrectable ECC in multiple pages
      addBBLUT(LINEAR_TO_BLOCK(addr));
      //deviceReset();
      //Serial.printf("Uncorrectable ECC in a multipe pages (addr, code): %x, %x\n", addr, eccCode);
      break;
    }

edit: addBBLUT also does nothing but check the current ecc entries (currently) - it does NOT add new ones. I guess @PaulStoffregen stopped implementing this for some reason since the "write BBLUT" entry logic is covered in an #ifdef LATER where "LATER" does not seem to be defined anywhere - also the SPI transaction is commented out, even if LATER would be defined somewhere.