Performance Tuning - Githubissues

dereks-igor commented 3 years ago

Hello! Thank you for publishing Dhara.

I have been playing with it and wanted to share some performance numbers. I am using a NAND Flash chip on an embedded system. When I use my raw nand driver, I get the following numbers writing and reading full pages. (This output is from my UART shell:)

# nand_write_read

..............................
Write 3840.0 KB at 986.988 KB/s
..............................
Read 3840.0 KB at 1998.049 KB/s
Done. Marked 0 this run, 0 total
#

However, when I run a Dhara write/read loop on a freshly erased NAND (with a brand new map/journal), I get these numbers:

# dhara_write_read

................................
Write 4000.0 KB at 620.230 KB/s
................................
Read 4000.0 KB at 271.348 KB/s
#

I was surprised that reads are slower than writes. Dhara sector reads are 86% slower than raw NAND page reads. Dhara sector writes are 37% slower than raw NAND page writes.

Through logging, I can see that this is because of two reasons:

Dhara walks the Radix tree from the root for every sector lookup
Dhara does a partial page read each time it walks a node of the tree

Dhara walks the Radix tree from the root for every sector lookup

Here is the comment at the top of trace_path():

/* Trace the path from the root to the given sector, emitting
 * alt-pointers and alt-full bits in the given metadata buffer.

Since the vast majority of sector accesses are sequential, meaning the next target sector will be sector+1 (since file data is read or written in big streams), it seems like there could be an easy optimization to not search the entire tree every time. Perhaps it could remember the last sector found, and then start searching at the sibling node of the Radix tree. I looked at the loop but did not see an obvious way to implement that.

Dhara does a partial page read each time it walks a node of the tree

Here are some logs showing Dhara's partial page reads of metadata as it walks the tree. Notice that it does ~10 small metadata reads of len 132 bytes each, before it has the page address and can read the full page of len 2048 bytes (from offset 0) that the user is seeking.

E (11422) nand: nand_read_page: blk 18, pg 15, pg_addr 0x48F, pg_offset 548, p_data 0x20009354, len 132
E (11423) nand: nand_read_page: blk 9, pg 47, pg_addr 0x26F, pg_offset 284, p_data 0x20009354, len 132
E (11424) nand: nand_read_page: blk 5, pg 31, pg_addr 0x15F, pg_offset 152, p_data 0x20009354, len 132
E (11425) nand: nand_read_page: blk 3, pg 15, pg_addr 0xCF, pg_offset 1076, p_data 0x20009354, len 132
E (11426) nand: nand_read_page: blk 2, pg 15, pg_addr 0x8F, pg_offset 548, p_data 0x20009354, len 132
E (11427) nand: nand_read_page: blk 1, pg 47, pg_addr 0x6F, pg_offset 284, p_data 0x20009354, len 132
E (11428) nand: nand_read_page: blk 1, pg 31, pg_addr 0x5F, pg_offset 152, p_data 0x20009354, len 132
E (11429) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 1076, p_data 0x20009354, len 132
E (11430) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 548, p_data 0x20009354, len 132
E (11431) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 284, p_data 0x20009354, len 132
E (11432) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 152, p_data 0x20009354, len 132
E (11433) nand: nand_read_page: blk 1, pg 1, pg_addr 0x41, pg_offset 0, p_data 0x2000d1c0, len 2048
.E (11435) nand: nand_read_page: blk 18, pg 15, pg_addr 0x48F, pg_offset 548, p_data 0x20009354, len 132
E (11436) nand: nand_read_page: blk 9, pg 47, pg_addr 0x26F, pg_offset 284, p_data 0x20009354, len 132
E (11437) nand: nand_read_page: blk 5, pg 31, pg_addr 0x15F, pg_offset 152, p_data 0x20009354, len 132
E (11438) nand: nand_read_page: blk 3, pg 15, pg_addr 0xCF, pg_offset 1076, p_data 0x20009354, len 132
E (11439) nand: nand_read_page: blk 2, pg 15, pg_addr 0x8F, pg_offset 548, p_data 0x20009354, len 132
E (11440) nand: nand_read_page: blk 1, pg 47, pg_addr 0x6F, pg_offset 284, p_data 0x20009354, len 132
E (11441) nand: nand_read_page: blk 1, pg 31, pg_addr 0x5F, pg_offset 152, p_data 0x20009354, len 132
E (11442) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 1076, p_data 0x20009354, len 132
E (11443) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 548, p_data 0x20009354, len 132
E (11444) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 284, p_data 0x20009354, len 132
E (11445) nand: nand_read_page: blk 1, pg 2, pg_addr 0x42, pg_offset 0, p_data 0x2000d1c0, len 2048
...etc...

The large number of semi-random accesses to different blocks and pages is the reason for Dhara's read slowness on my system. There is a considerable amount of overhead to kick off a transaction -- a 132 byte metadata read takes just about as long as a full 2048 byte full page read, due to bus and DMA setup and kickoff.

It seems like there is an opportunity to cache and avoid all these NAND I/O calls. Either by caching ten metadata entries (since it seems to follow mostly the same tree path in sequential block reads, and they are only 132 bytes each), or by having one or two full-page caches at the driver level (so that, for example, all those reads to block 1, page 15 could be reduced to just one full-page read, and then the various offset reads for metadata could come from RAM).

This is just for your information only. Dhara's performance already meets my system requirements, so I don't plan to implement any optimizations. But it seems like there is some low-hanging fruit that could result in significant performance improvements.

Thanks again!

dlbeer commented 3 years ago

On Thu, Mar 18, 2021 at 02:33:50PM -0700, Derek Simkowiak wrote:

Dhara does a partial page read each time it walks a node of the tree

Here are some logs showing Dhara's partial page reads of metadata as it walks the tree. Notice that it does ~10 small metadata reads of len 132 bytes each, before it has the page address and can read the full page of len 2048 bytes (from offset 0) that the user is seeking.
E (11422) nand: nand_read_page: blk 18, pg 15, pg_addr 0x48F, pg_offset 548, p_data 0x20009354, len 132
E (11423) nand: nand_read_page: blk 9, pg 47, pg_addr 0x26F, pg_offset 284, p_data 0x20009354, len 132
E (11424) nand: nand_read_page: blk 5, pg 31, pg_addr 0x15F, pg_offset 152, p_data 0x20009354, len 132
E (11425) nand: nand_read_page: blk 3, pg 15, pg_addr 0xCF, pg_offset 1076, p_data 0x20009354, len 132
E (11426) nand: nand_read_page: blk 2, pg 15, pg_addr 0x8F, pg_offset 548, p_data 0x20009354, len 132
E (11427) nand: nand_read_page: blk 1, pg 47, pg_addr 0x6F, pg_offset 284, p_data 0x20009354, len 132
E (11428) nand: nand_read_page: blk 1, pg 31, pg_addr 0x5F, pg_offset 152, p_data 0x20009354, len 132
E (11429) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 1076, p_data 0x20009354, len 132
E (11430) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 548, p_data 0x20009354, len 132
E (11431) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 284, p_data 0x20009354, len 132
E (11432) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 152, p_data 0x20009354, len 132
E (11433) nand: nand_read_page: blk 1, pg 1, pg_addr 0x41, pg_offset 0, p_data 0x2000d1c0, len 2048
.E (11435) nand: nand_read_page: blk 18, pg 15, pg_addr 0x48F, pg_offset 548, p_data 0x20009354, len 132
E (11436) nand: nand_read_page: blk 9, pg 47, pg_addr 0x26F, pg_offset 284, p_data 0x20009354, len 132
E (11437) nand: nand_read_page: blk 5, pg 31, pg_addr 0x15F, pg_offset 152, p_data 0x20009354, len 132
E (11438) nand: nand_read_page: blk 3, pg 15, pg_addr 0xCF, pg_offset 1076, p_data 0x20009354, len 132
E (11439) nand: nand_read_page: blk 2, pg 15, pg_addr 0x8F, pg_offset 548, p_data 0x20009354, len 132
E (11440) nand: nand_read_page: blk 1, pg 47, pg_addr 0x6F, pg_offset 284, p_data 0x20009354, len 132
E (11441) nand: nand_read_page: blk 1, pg 31, pg_addr 0x5F, pg_offset 152, p_data 0x20009354, len 132
E (11442) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 1076, p_data 0x20009354, len 132
E (11443) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 548, p_data 0x20009354, len 132
E (11444) nand: nand_read_page: blk 1, pg 15, pg_addr 0x4F, pg_offset 284, p_data 0x20009354, len 132
E (11445) nand: nand_read_page: blk 1, pg 2, pg_addr 0x42, pg_offset 0, p_data 0x2000d1c0, len 2048
...etc...
The large number of semi-random accesses to different blocks and pages is the reason for Dhara's read slowness on my system. There is a considerable amount of overhead to kick off a transaction -- a 132 byte metadata read takes just about as long as a full 2048 byte full page read, due to bus and DMA setup and kickoff.

It seems like there is an opportunity to cache and avoid all these NAND I/O calls. Either by caching ten metadata entries (since it seems to follow mostly the same tree path in sequential block reads, and they are only 132 bytes each), or by having one or two full-page caches at the driver level (so that, for example, all those reads to block 1, page 15 could be reduced to just one full-page read, and then the various offset reads for metadata could come from RAM).

This is just for your information only. Dhara's performance already meets my system requirements, so I don't plan to implement any optimizations. But it seems like there is some low-hanging fruit that could result in significant performance improvements.

Hi Derek,

I think an LRU write-through cache holding a few full pages at the driver level is a good idea and would definitely improve performance. I haven't implemented it here because I was intending for it to work on MCUs with only a few kB of memory.

If you implemented that, then you probably wouldn't need to worry so much about the number of radix-tree pointers followed, since they'd become very cheap.

Not sure if there's anything useful and universally applicable that I could include for that purpose, but I'm happy to hear suggestions.

Cheers, Daniel

-- Daniel Beer @.***> http://dlbeer.co.nz/ PGP: BA6E 0B26 1F89 246C E3F3 C910 1E58 C43A 160A 553B

dereks-igor commented 3 years ago

Daniel, Thank you for your response.

cache holding a few full pages at the driver level

I agree that it would be preferable to cache things at the NAND driver level. One of the most attractive features of Dhara is that it is clean and minimalist, just ~2000 lines of simple C code.

But looking at the above example, there are 11 (for page 1, because it's odd) or 10 (for page 2, because it's even) lookups spread across 8 different pages:

Block.Page.Offset
-----------------
18.15.548
9.47.284
5.31.152
3.15.1076
2.15.548
1.47.284
1.31.152
1.15.1076
1.15.548
1.15.284

To achieve a 100% cache hit success between page 1 and page 2 it would take

8 x 2048 B = 16384 B

of page buffers to cache the full lookup.

However, if those metadata chunks were cached individually, it would only take

10 x 132 B = 1320 B

of metadata buffers to cache the full lookup. That's 92% less RAM to get the same performance increase.

I think it would still be possible to implement the cache at the NAND driver level, rather than in Dhara directly. For any read of size that just happens to be of DHARA_META_SIZE (132 B), add it to the cache, indexed by Block.Page.Offset.

dereks-igor commented 3 years ago

Quick update: I wrote a quick, simplistic metadata cache that caches the 11 most recently requested 132 B metadata chunks. It's just an array, and it uses simple brute-force for loops to search for the requested page address and offset. It's 252% faster. Not bad for 100 lines of code!

# dhara_write_read
................................
Write 4000.0 KB at 625.439 KB/s
................................
Read 4000.0 KB at 683.122 KB/s
#

dlbeer commented 3 years ago

On Fri, Mar 19, 2021 at 02:30:38PM -0700, Derek Simkowiak wrote:

Quick update: I wrote a quick, simplistic metadata cache that caches the 11 most recently requested 132 B metadata chunks. It's just an array, and it uses simple brute-force for loops to search for the requested page address and offset. It's 252% faster. Not bad for 100 lines of code!
# dhara_write_read
................................
Write 4000.0 KB at 625.439 KB/s
................................
Read 4000.0 KB at 683.122 KB/s
# 

Well, that's fairly convincing! I'm happy to accept a simple cache of this kind as a patch, just as long as it's possible to disable it in situations where the memory really can't be spared, and if it can be used without heap allocation.

Perhaps if the user were to supply a block of memory to be used for metadata caching and indicate its size?

-- Daniel Beer @.***> http://dlbeer.co.nz/ PGP: BA6E 0B26 1F89 246C E3F3 C910 1E58 C43A 160A 553B

jjungo commented 3 years ago

I would definitively love this improvement!

pgreenland commented 2 months ago

@dereks-igor Appreciate this has been open for a while. Any chance you could share a patch, even if its rough for your caching implementation? I'm likely about to implement something similar. Would be good to have a more or less working starting point.

Yaxit commented 4 days ago

Following as well!

dlbeer / dhara

Performance Tuning #21

Dhara walks the Radix tree from the root for every sector lookup

Dhara does a partial page read each time it walks a node of the tree

Dhara does a partial page read each time it walks a node of the tree