analogdevicesinc / linux

Linux kernel variant from Analog Devices; see README.md for details
https://github.com/analogdevicesinc/linux
Other
405 stars 820 forks source link

Bricking of FL-S Flash S25FL512S due to a bug in its Linux driver #2519

Open ChrisMcGowanAu opened 2 weeks ago

ChrisMcGowanAu commented 2 weeks ago

We were using ADI Linux tag "2021_r1_release" on our project, rather than the Xilinx release, as were were using an ADI device for which there was no compatible driver for our Xilinx SOC, a Mercury XU1. We got it working OK and went through extensive testing. The Mercury XU1 has a S25FL512S FL-S Flash as part of the SOC. During a testing run, the power to the system was turned on then off quickly, this resulted in the S25FL512S getting "Bricked" on both SOCS which start at the same time. For us this is a very serious issue as replacing or repairing the SOC would be extremely expensive, as the installed device is very difficult and costly to get to to change. We narrowed down the issue to "linux/drivers/mtd/spi-nor/core.c"

> 3450 static int spi_nor_init(struct spi_nor *nor)
> 3451 {
> 3452         struct spi_nor_flash_parameter *params;
> 3453         int err, idx;
> 3454 
> 3455         if (nor->info->id[0] == CFI_MFR_ATMEL ||
> 3456             nor->info->id[0] == CFI_MFR_INTEL ||
> 3457             nor->info->id[0] == CFI_MFR_SST ||
> 3458             nor->info->id[0] & SNOR_F_HAS_LOCK) {
> 3459                 spi_nor_write_enable(nor);
> 3460                 nor->bouncebuf[0] = 0;
> 3461                 spi_nor_write_sr(nor, nor->bouncebuf, 1);
> 3462         }

In our case, this forced this driver code to execute the 3 lines 3459 to 3461 because our device has SNOR_F_HAS_LOCK set. The effect of the call to "spi_nor_write_sr(nor, nor->bouncebuf, 1);" on line 3461 was that it took between 440ms to around 500ms to complete. If we got a power off during this call, the S25FL512S FL-S Flash would brick itself.
Our fix was to change the '||' on line 3458 to and '&&' this bypassing this if statement. We did a lot of testing and proved that a power off at 12.2 seconds after power on would 'brick' the device each time we tried it. With the fix, it worked perfectly with no issues. I will raise a pull request later, and also mention this issue for the Xilinx Linux release.

ChrisMcGowanAu commented 2 weeks ago

Assign this issue to me, I have a fix.

nunojsa commented 2 weeks ago

Assign this issue to me, I have a fix.

Please do use git blame and try to understand if this is something that should go upstream or to the xilinx tree (patches also via mailing list). We don't tweak that driver much so odds are that it's not our bug.

ChrisMcGowanAu commented 1 week ago

Thanks nunojsa, Yes, this is likely to be an up upstream issue from Xilinx, not directly from Analog devices. I wanted to report it in case anyone else has this issue, and describe the fix we did. I will also report it to Xilinx when I can work out how to do this :-)