dlbeer / dhara

NAND flash translation layer for low-memory systems
Other
409 stars 120 forks source link

journal.c: cp_free does an illegal dhara_nand_is_free if on last group and none of the pages are free #22

Closed bjornwedell closed 2 years ago

bjornwedell commented 3 years ago

https://github.com/dlbeer/dhara/blob/6cc57366448ff5dd9b7e370843fc693f22605b64/dhara/journal.c#L396

We have seen a situation where we get failed READPAGE command with IpCommandSequenceErrorFlagSet. This is the log:

[ 3.350000] w25n_transfer: Spi command failed with error: 7002 [ 3.350000] w25n_load_nand_buffer: Failed to load page at 0x10000804: 3 [ 3.360000] w25n_nand_read: Failed to load nand buffer with data from 0x10000804. [ 3.360000] dhara_nand_is_free: Failed to check if page 65536 is free: 3

This basically means that we get an SPI error caused by a read outside the configured memory area. This is our dhara config: ` / W25N01GW (1Gb / 128 MB) memory capacity /

/ Log2 sizes defined for Dhara /

define W25N01GW_PAGE_SIZE_LOG2 (11)

define W25N01GW_PAGE_SIZE (1 << W25N01GW_PAGE_SIZE_LOG2)

define W25N01GW_BLOCK_COUNT (1024)

define W25N01GW_PAGES_PER_BLOCK_LOG2 (6)

define W25N01GW_PAGES_PER_BLOCK (1 << W25N01GW_PAGES_PER_BLOCK_LOG2)

define W25N01GW_BLOCK_SIZE (W25N01GW_PAGES_PER_BLOCK * W25N01GW_PAGE_SIZE)

`

To me it seems that if we for some reason end up in the last checkpoint group and that group have no free pages we will keep looking "over the edge", since we start by skipping one page to next user page before the call to cp_free.

Can you help me?

dlbeer commented 3 years ago

On Thu, May 06, 2021 at 07:49:07AM -0700, bjornwedell wrote:

https://github.com/dlbeer/dhara/blob/6cc57366448ff5dd9b7e370843fc693f22605b64/dhara/journal.c#L396

We have seen a situation where we get failed READPAGE command with IpCommandSequenceErrorFlagSet. This is the log:

[ 3.350000] w25n_transfer: Spi command failed with error: 7002 [ 3.350000] w25n_load_nand_buffer: Failed to load page at 0x10000804: 3 [ 3.360000] w25n_nand_read: Failed to load nand buffer with data from 0x10000804. [ 3.360000] dhara_nand_is_free: Failed to check if page 65536 is free: 3

To me it seems that if we for some reason end up in the last checkpoint group and that group have no free pages we will keep looking "over the edge", since we start by skipping one page to next user page before the call to cp_free.

Can you help me?

The increment in next_upage() wraps when it reaches the end of the chip.

I would first double-check the parameters you've filled in struct dhara_nand, and if you're sure they're correct, let me know what you have.

-- Daniel Beer @.***> http://dlbeer.co.nz/ PGP: BA6E 0B26 1F89 246C E3F3 C910 1E58 C43A 160A 553B

bjornwedell commented 3 years ago

Thank you for your quick response,

I agree with you that next_upage() will wrap, but in this particular situation the next user page will be one page after last checkpoint and from what I understand cp_free will check a full 1 << ppc without wrapping, i.e run outside the memory.

However, in my situation this should actually not be a problem since dhara_nand_is_free will, part from the error log, just act as if the memory is not free. It rather seems I had some actual problem with the memory and I will investigate it further in other directions.

dlbeer commented 3 years ago

On Fri, May 07, 2021 at 07:19:42AM -0700, bjornwedell wrote:

Thank you for your quick response,

I agree with you that next_upage() will wrap, but in this particular situation the next user page will be one page after last checkpoint and from what I understand cp_free will check a full 1 << ppc without wrapping, i.e run outside the memory.

Yes, you're right!

And actually, find_head shouldn't scan a user-page at a time, because we need to examine all the user-pages in the checkpoint group anyway.

I'll see what I can come up with for a fix.

-- Daniel Beer @.***> http://dlbeer.co.nz/ PGP: BA6E 0B26 1F89 246C E3F3 C910 1E58 C43A 160A 553B

dlbeer commented 3 years ago

On Fri, May 07, 2021 at 07:19:42AM -0700, bjornwedell wrote:

Thank you for your quick response,

I agree with you that next_upage() will wrap, but in this particular situation the next user page will be one page after last checkpoint and from what I understand cp_free will check a full 1 << ppc without wrapping, i.e run outside the memory.

However, in my situation this should actually not be a problem since dhara_nand_is_free will, part from the error log, just act as if the memory is not free. It rather seems I had some actual problem with the memory and I will investigate it further in other directions.

Would you mind trying the attached patch and seeing if it works for you?

-- Daniel Beer @.***> http://dlbeer.co.nz/ PGP: BA6E 0B26 1F89 246C E3F3 C910 1E58 C43A 160A 553B

diff --git a/dhara/journal.c b/dhara/journal.c index 3573037..87743c3 100644 --- a/dhara/journal.c +++ b/dhara/journal.c @@ -368,7 +368,9 @@ static int find_root(struct dhara_journal j, dhara_page_t start, static int find_head(struct dhara_journal j, dhara_page_t start, dhara_error_t *err) {

bjornwedell commented 3 years ago

Thank you Daniel and sorry for the slow reply. I am afraid I am not able to reproduce the situation we encountered earlier. It has only occurred once and I am not completely sure in what state the chip was in.

However, when running with your patch in my own tests forcing the head to the end of the memory I now see that the last checkpoint group is handled without attempts to read outside the memory, which is an improvement.

dlbeer commented 3 years ago

On Tue, May 18, 2021 at 04:51:00AM -0700, bjornwedell wrote:

Thank you Daniel and sorry for the slow reply. I am afraid I am not able to reproduce the situation we encountered earlier. It has only occurred once and I am not completely sure in what state the chip was in.

However, when running with your patch in my own tests forcing the head to the end of the memory I now see that the last checkpoint group is handled without attempts to read outside the memory, which is an improvement.

Hi Bjorn,

Ok, thanks for testing that. I've pushed the change now. I'm going to guess that your original issue may well have been related to requesting out-of-bounds page data due to this bug.

Cheers, Daniel

-- Daniel Beer @.***> http://dlbeer.co.nz/ PGP: BA6E 0B26 1F89 246C E3F3 C910 1E58 C43A 160A 553B