risc-v/bl808: MMU causes CPU slowdown

henryrov commented 2 months ago

A while ago, I noticed that benchmark results on the BL808 were significantly lower on NuttX than on Linux. In coremark, NuttX scores ~18.3 while Linux scores 1141, which is over 60x faster. Messing with build settings changed the results a bit, but the highest I was able to get from NuttX was about 20. Recently, I decided to do some lower level testing to find the root of the issue, and I think the cause is related to MMU configuration. I used a nop loop to to test:

up_putc('0');
for (uint32_t i = 0; i < 400000000; i++)
  {
    asm("nop");
  }
up_putc('1');

Running at any point before mmu_enable is called, it takes under 2.5 seconds. After mmu_enable, it takes about 100 seconds. I've verified using objdump that the compiler output is equivalent before or after:

    50200c0a:   03000513            li  a0,48
    50200c0e:   6b6000ef            jal 502012c4 <up_putc>
    50200c12:   17d787b7            lui a5,0x17d78
    50200c16:   40078793            add a5,a5,1024 # 17d78400 <__pgheap_size+0x16978400>
    50200c1a:   0001                    nop
    50200c1c:   37fd                    addw    a5,a5,-1
    50200c1e:   fff5                    bnez    a5,50200c1a <bl808_mm_init+0x1a>
    50200c20:   03100513            li  a0,49
    50200c24:   6a0000ef            jal 502012c4 <up_putc>

As I see it, this leaves the MMU as the most likely reason for the slowdown, but I don't know enough about the subject to debug this any further (or maybe this is expected behavior?). I was hoping someone who knows more about this would be able to help look into it.

acassis commented 2 months ago

Hey @henryrov really interesting discovery! In fact I was expecting NuttX to be faster than Linux.

"Houston we have a problem!!!!"

@lupyuen did you noticed it before?

@xiaoxiang781216 @raiden00pl @pkarashchenko @masayuki2009 @patacongo any idea?

pkarashchenko commented 2 months ago

This description sounds to me like a 10 year old flashback of one of the projects that I was working on. At that time it was with an AM335x based device. I would omit most of the parts of that wonderful investigation, but that time we figured out that our embedded system didn't enable caching while configuring MMU regions, so I would suggest to take a look at that as the fist point of investigation.

acassis commented 2 months ago

Hi @pkarashchenko makes sense! I remember when disabling cache support on Linux kernel the boot process was really slow.

Normally we don't pay too much attention on it on NuttX because it always boot in milliseconds even when cache is disabled. So this kind of benchmark, comparison is very important. I think we need to have HW CI that run benchmark to catch regressions.

patacongo commented 2 months ago

Hey @henryrov really interesting discovery! In fact I was expecting NuttX to be faster than Linux.

"Houston we have a problem!!!!"

@lupyuen did you noticed it before?

@xiaoxiang781216 @raiden00pl @pkarashchenko @masayuki2009 @patacongo any idea?

Possibly related to Issue #3355

A lot has changed since #3355 but re-assessing the system call utilization would also be a good starting point.

patacongo commented 2 months ago

The realtime scheduler could also be a cause of reduced performance in a comparison with Linux benchmarks.

Linux defaults to SCHED_OTHER which is tuned for data throughput and Linux has some of the best throughput times available. It minimizes context switching and "ages" threads to assure that each gets a shot at the CPU (after a delay). So everything makes good progress with minimum context switching overhead.

SCHED_OTHER will not support real-time behavior.

Realtime RTOSs, on the other hand do not typically support SCHED_OTHER. Several other schedulers are available for real time behavior. SCHED_FIFO is the only one specified by POSIX and can be used, for example, to support Minimum Latency Scheduling. That behavior depends on the strict priority scheduling of SCHED_FIFO. SCHED_FIFO is super responsive to the point of being "goosey". I can easier lose throughput due to many context switch "storms". Better response at the expense of reduced overall throughput and higher rates of context switches.

Low priority threads can also be blocked indefinitely.

Issue #3355 is a more likely cause a performance issue.

pkarashchenko commented 2 months ago

@henryrov what is the NuttX score with MMU disabled?

henryrov commented 2 months ago

The realtime scheduler could also be a cause of reduced performance in a comparison with Linux benchmarks.

I agree that this could impact benchmark results somewhat, but I don't think it fully explains the difference here, especially since the difference I saw in the for loop tests was before the scheduler was initialized (assuming the issues are related), and running the loop again after starting the scheduler performs the same as it does immediately after mmu_enable.

@henryrov what is the NuttX score with MMU disabled?

I wanted to test this, but booting without the MMU isn't currently supported on the BL808.

patacongo commented 2 months ago

I wanted to test this, but booting without the MMU isn't currently supported on the BL808.

And if it is like the ARM MMU which I am more familiar with, disabling the MMU also requires disabling the caches as well since the MMU controls the cachable properties of each mapped region.

lupyuen commented 2 months ago

@henryrov It's possible that we're flushing the MMU Cache too often: "MMU Cache for T-Head C906". Sorry the docs for BL808 SoC and T-Head C906 CPU are lacking, I have trouble guessing the correct MMU Cache settings, we might need to tweak them.

BL808 SoC is not officially supported by Linux / Debian Mainline, so it might be hard to figure out how Linux handles the MMU. Maybe that's why the SBC Makers (Sipeed, Pine64) are moving away from Bouffalo Lab BL808 to Sophgo SG2000 / SG2002, which has Mainline Linux Support.

(BTW: I'm not sure about Bouffalo Lab's future plans for BL808? It seems to have disappeared from their website)

UPDATE: We have enabled Strong Ordering in the MMU, which might cause performance issues. We might need to tweak it: T-Head C906 Strong Ordering

henryrov commented 2 months ago

A few findings:

It's possible that we're flushing the MMU Cache too often

I tested this by removing the call to mmu_flush_cache, but this didn't seem to affect coremark or the for loop at all.

We have enabled Strong Ordering in the MMU, which might cause performance issues.

I timed the for loop with different combinations of the shareable and strong order flags, but again this didn't seem to make a difference.

Maybe that's why the SBC Makers (Sipeed, Pine64) are moving away from Bouffalo Lab BL808 to Sophgo SG2000 / SG2002, which has Mainline Linux Support.

In that case, maybe we could learn something from testing the SG2000? Since it also uses the C906, it might be worth checking if it behaves similarly to the BL808 in NuttX, and if the performance difference is as large compared to Linux.

lupyuen commented 2 months ago

Since it also uses the C906, it might be worth checking if it behaves similarly to the BL808 in NuttX, and if the performance difference is as large compared to Linux

@henryrov Yep sure! I'll run the NOP Loop before and after initing the MMU on SG2000. How do I run the benchmark for NuttX vs Linux?

henryrov commented 2 months ago

@henryrov Yep sure! I'll run the NOP Loop before and after initing the MMU on SG2000. How do I run the benchmark for NuttX vs Linux?

That's great! You can enable coremark on NuttX through menuconfig under Application Configuration -> Benchmark Applications. I don't know much about Linux on the SG2000, but what I ended up doing for the BL808 was cross compiling coremark with the buildroot toolchain externally and moving the compiled binary to my SD card. Maybe if there's enough hardware support it might be easier to get the source code and then compile directly on the board?

lupyuen commented 2 months ago

@henryrov Here are the CoreMark Results for SG2000 (Milk-V Duo S), NuttX vs Debian. Yep the results look similar to Ox64 BL808 NuttX, since SG2000 NuttX is nearly identical to Ox64 NuttX:

SG2000 NuttX CoreMark -Os: 16

SG2000 NuttX CoreMark -O2: 21

Only CoreMark was compiled with -O2. Kernel won't boot with -O2 (why?)

SG2000 Debian CoreMark -O2: 2,470

I'll do more analysis of the NOP Loop before and after initing SG2000 MMU. Thanks!

(FYI: I thought it might be due to the OpenSBI System Timer Interrupt triggered too often, but nope it makes no difference when I disabled the interrupt)

UPDATE: We have a fix for the MMU Delay, we need to tell the MMU that the Kernel Text, Data and Heap are Cacheable. Otherwise the MMU won't cache them!

CoreMark is now 17, up slightly from 16 earlier. NuttX Apps are also having the same MMU Delay, I'll check the MMU Flags for NuttX Apps:

Thanks Henry for tracking down the MMU Delay! I'll upstream the Kernel Fix to Ox64 and SG2000 real soon.

henryrov commented 2 months ago

We have a fix for the MMU Delay, we need to tell the MMU that the Kernel Text, Data and Heap are Cacheable. Otherwise the MMU won't cache them!

Nice! I tested a similar change on the BL808 and it did also fix NOP loop difference. It also increased coremark slightly, from 18 to 19.

Thanks Henry for tracking down the MMU Delay! I'll upstream the Kernel Fix to Ox64 and SG2000 real soon.

No problem, thank you for your help!

lupyuen commented 2 months ago

Thanks @henryrov for testing on Ox64! I configured the MMU to cache User Text and Data (for NuttX Apps):

Now NuttX CoreMark is really close to Debian CoreMark!

SG2000 NuttX CoreMark -O2: 2,422
SG2000 Debian CoreMark -O2: 2,470

I'll upstream the fixes. Thanks again :-)

FYI: SG2000 NuttX CoreMark is 1,758 with default settings -Os and -g. So -O2 really makes a difference! How I compiled CoreMark for -O2:

rm ../apps/benchmarks/coremark/*.o
## Edit arch/risc-v/src/common/Toolchain.defs
## Change `ARCHOPTIMIZATION += -Os` to `ARCHOPTIMIZATION += -O2`
## Change `ARCHOPTIMIZATION += -g`  to `ARCHOPTIMIZATION +=`
## Note: NuttX Kernel won't boot with `-O2` (why?)

pkarashchenko commented 2 months ago

@lupyuen could you please fill an issue related to -O2 compilation for kernel that you mentioned above?

henryrov commented 2 months ago

Now NuttX CoreMark is really close to Debian CoreMark!

I tested the same fixes on Ox64, and now it's also really close to Buildroot (1104 with -O2 vs 1141). I'll go ahead and close this issue now. Thanks everyone for your help!

pkarashchenko commented 2 months ago

I'm not sure what is the compile optimization flags for Debian used. -O3 or -Ofast may still give some excursion speed sacrificing the space.

lupyuen commented 1 month ago

Hi @henryrov I'm very sorry for the delay, the Performance Fix for Ox64 has just been upstreamed to NuttX Mainline. Thanks for waiting :-)

henryrov commented 4 weeks ago

Hi @henryrov I'm very sorry for the delay, the Performance Fix for Ox64 has just been upstreamed to NuttX Mainline. Thanks for waiting :-)

risc-v/mmu: Configure T-Head MMU to cache User Text, Data and Heap

risc-v/bl808: Configure MMU to cache User Text, Data and Heap

No problem at all! Thanks for making the fix!

apache / nuttx

risc-v/bl808: MMU causes CPU slowdown #12696