Expected overheads / available capacity

pgreenland commented 1 year ago

Hi,

Hopefully quick question on how available capacity is calculated / may be maximised.

I've got dhara working with a 2Gb Micron MT29F, appears to be working well. API's were straight forward to implement, documentation helped get me going real quick (great job). It stores and retrieves data, looking good.

Micron's flash has 64 x 2KB pages per block and 2048 total blocks. It's ECC is such that each page can be written 4 times. I've expressed this to dhara as pages are 512 bytes, with 256 pages per block.

GC ratio is set the example from the readme (4).

After initialisation dhara_map_capacity reports 307456 sectors. Which at 512 bytes per sector = 150.125MB

Dhara is therefore using 41.3% of the raw 256MB NAND capacity for its own internal purposes.

Nosing around in the code I saw that I could increase the GC ratio, which would result in a reduction of reserved capacity.

Presumably this would result in an increase in the number of garbage collection cycles? Presumably slowing writing as GC is performed more often.

Is there anything else I could do to increase usable capacity? I couldn't see many other levers to pull.

Thanks,

Phil

mintisan commented 11 months ago

+1

Yaxit commented 2 months ago

This does not sound right, did you solve this issue?

Forty-Bot commented 2 months ago

The overhead can be calculated from the memory layout:

Memory layout
=============

This is a memory layout example with the following configuration:

  * log2_ppc = 2, ie check point group contains 4 pages
  * log2_ppb = 4, ie 16 pages per erase block

  A single erase block layout, each slot represents a physical flash page:

    +------+------+------+------+
    | data | data | data |  cp  |
    +------+------+------+------+
    | data | data | data |  cp  |
    +------+------+------+------+
    | data | data | data |  cp  |
    +------+------+------+------+
    | data | data | data |  cp  |
    +------+------+------+------+

  * data: User data (len = (1 << log2_page_size))
  * cp:   Check point metadata for last 3 pages, ie (1 << log2_ppc) - 1

<snip>

### Checkpoint metadata

This zone is 132 bytes long (DHARA_META_SIZE), and it's where the radix tree is
implemented, thus, where the map between physical pages and logical sectors is
saved. It contains the alt pointers vector, {A0, A1, A2, ..., A(31)}, and the
logical sector number (id). There is one of this structure per physical mapped
page.

    |---------- 32 bits ------------|
    +-------------------------------+
    |              id               |
    +-------------------------------+
    |             ALT               |
    :           POINTERS            :
    |            32x4B              |
    +-------------------------------+

The important thing to note is that

There is one of this [132-byte] structure per physical mapped page

So the base overhead with 512-byte pages is 20%. But we can only fit three checkpoints in a 512-byte page, so the real overhead is 25% (512 bytes of overhead per 1536 bytes of data).

Then on top of this, some of the capacity is reserved for

Bad blocks: 1.5%
Garbage collection: 1/(gc_ratio + 1), 20% in your case
Retrying bad writes: DHARA_MAX_RETRIES, constant 8 blocks

So there's where your 41.5% overhead comes from (although note that some of that overhead is multiplicative and not additive). You could reduce it in the following ways:

Increase the page size. For example, if you increased the page size to 2048 bytes, then you could fit 15 metadata blocks per page. This would result in 2048 bytes of overhead per 30720 bytes of data, or 6% overhead. Of course, this can result in application-level overhead if larger blocks are not used as well (such as if you have many small files that the filesystem does not pack).
Increase the GC ratio. For example, you could double it to 8, decreasing the overhead to 11%. This will result in more GC steps when the flash is full, which could decrease write performance and increase write amplification.

One good optimization for the code could be to reduce the size of the metadata to match the actual address space of the device. For example, your device only has 524288 512-byte pages, which could be stored in 19 bits. The metadata size could be reduced to 80 bytes (32-bit id plus 19 alt pointers), allowing 6 metadata to be stored in a page. This would reduce the overhead to 12.5% (7 metadata pages per block, plus one unused page).

This could be reduced even further by packing the bits, reducing the size of the metadata to 48 bytes, which would allow storing 10 metadata in a page. This would reduce the overhead to 9% (6 metadata pages per block, one of them only holding metadata for 8 data pages).

Forty-Bot commented 2 months ago

This would reduce the overhead to 12.5% (7 metadata pages per block, plus one unused page).

Actually, I forgot the ppb is 256 with 512-byte pages, so the overhead would be 14% (37 metadata pages per block).

This would reduce the overhead to 9% (6 metadata pages per block, one of them only holding metadata for 8 data pages).

Similarly, the overhead here would be 9% still, but with 24 metadata pages per block.

Forty-Bot commented 2 months ago

OK, so I did some write amplification experiments. I used the flash parameters from above, simulating a 100 MiB area. I tested sequential and random writes. Here's what I found:

Increasing gc_ratio improves (decreases) write amplification. However, it makes the worst-case latency worse. Every time we write a page, we have to make an entry in the journal. This will eventually fill up the whole flash, so before writing a new page, we first have to free up some space if possible. If we keep writing to the same set of sectors, each time we write to a sector another sector will be free'd (since it will be invalidated by our write). So on average, we need to run GC once every time we write. The way dhara implements this is that if the journal is larger than the map capacity (that is, if we are near the journal size limit), we run GC before writing, and we do this gc_ratio+1 times. So in the worst case, we have to make gc_ratio+2 writes to write a single page. This means if (for example) you choose a gc_ratio of 16, some writes will take 1 page program, while others will take 18 page programs.

This is also where the write amplification comes from. When we run GC, sometimes we will find an unused (garbage) page (which we don't have to rewrite), but sometimes we will find a page which is still in use (live). In that case, we have to copy it to the front of the journal in order to free up the page at the end of the journal. With larger values for gc_ratio, we run GC "later" which means we have a higher probability of encountering garbage, which means we have to copy fewer pages. This is why larger values of gc_ratio reduce write amplification.

The sync frequency also directly influences the write amplification ratio. This is because when we sync, we have to immediately write the current metadata page, which may not be fully used. Additionally, this also results in additional overhead, since the unused pages can't be used until we GC this metadata page. This is why we have to reserve some pages for GC, as otherwise we could run out of pages due to this overhead. Syncing after every write results in 4x write amplification (since there are 4 pages in a "slot"). On the other hand, never syncing results in 1.333x write amplification (since every third write we have to program a metadata page). This of course doesn't include GC amplification, so the worst case can be as much as 6.8x amplification (with a gc_ratio of 4 and completely sequential writes).

dlbeer / dhara

Expected overheads / available capacity #38

Increase the GC ratio. For example, you could double it to 8, decreasing the overhead to 11%. This will result in more GC steps when the flash is full, which could decrease write performance and increase write amplification.