Closed henryrov closed 2 months ago
Hey @henryrov really interesting discovery! In fact I was expecting NuttX to be faster than Linux.
"Houston we have a problem!!!!"
@lupyuen did you noticed it before?
@xiaoxiang781216 @raiden00pl @pkarashchenko @masayuki2009 @patacongo any idea?
This description sounds to me like a 10 year old flashback of one of the projects that I was working on. At that time it was with an AM335x based device. I would omit most of the parts of that wonderful investigation, but that time we figured out that our embedded system didn't enable caching while configuring MMU regions, so I would suggest to take a look at that as the fist point of investigation.
Hi @pkarashchenko makes sense! I remember when disabling cache support on Linux kernel the boot process was really slow.
Normally we don't pay too much attention on it on NuttX because it always boot in milliseconds even when cache is disabled. So this kind of benchmark, comparison is very important. I think we need to have HW CI that run benchmark to catch regressions.
Hey @henryrov really interesting discovery! In fact I was expecting NuttX to be faster than Linux.
"Houston we have a problem!!!!"
@lupyuen did you noticed it before?
@xiaoxiang781216 @raiden00pl @pkarashchenko @masayuki2009 @patacongo any idea?
Possibly related to Issue #3355
A lot has changed since #3355 but re-assessing the system call utilization would also be a good starting point.
The realtime scheduler could also be a cause of reduced performance in a comparison with Linux benchmarks.
Linux defaults to SCHED_OTHER which is tuned for data throughput and Linux has some of the best throughput times available. It minimizes context switching and "ages" threads to assure that each gets a shot at the CPU (after a delay). So everything makes good progress with minimum context switching overhead.
SCHED_OTHER will not support real-time behavior.
Realtime RTOSs, on the other hand do not typically support SCHED_OTHER. Several other schedulers are available for real time behavior. SCHED_FIFO is the only one specified by POSIX and can be used, for example, to support Minimum Latency Scheduling. That behavior depends on the strict priority scheduling of SCHED_FIFO. SCHED_FIFO is super responsive to the point of being "goosey". I can easier lose throughput due to many context switch "storms". Better response at the expense of reduced overall throughput and higher rates of context switches.
Low priority threads can also be blocked indefinitely.
Issue #3355 is a more likely cause a performance issue.
@henryrov what is the NuttX score with MMU disabled?
The realtime scheduler could also be a cause of reduced performance in a comparison with Linux benchmarks.
I agree that this could impact benchmark results somewhat, but I don't think it fully explains the difference here, especially since the difference I saw in the for loop tests was before the scheduler was initialized (assuming the issues are related), and running the loop again after starting the scheduler performs the same as it does immediately after mmu_enable.
@henryrov what is the NuttX score with MMU disabled?
I wanted to test this, but booting without the MMU isn't currently supported on the BL808.
I wanted to test this, but booting without the MMU isn't currently supported on the BL808.
And if it is like the ARM MMU which I am more familiar with, disabling the MMU also requires disabling the caches as well since the MMU controls the cachable properties of each mapped region.
@henryrov It's possible that we're flushing the MMU Cache too often: "MMU Cache for T-Head C906". Sorry the docs for BL808 SoC and T-Head C906 CPU are lacking, I have trouble guessing the correct MMU Cache settings, we might need to tweak them.
BL808 SoC is not officially supported by Linux / Debian Mainline, so it might be hard to figure out how Linux handles the MMU. Maybe that's why the SBC Makers (Sipeed, Pine64) are moving away from Bouffalo Lab BL808 to Sophgo SG2000 / SG2002, which has Mainline Linux Support.
(BTW: I'm not sure about Bouffalo Lab's future plans for BL808? It seems to have disappeared from their website)
UPDATE: We have enabled Strong Ordering in the MMU, which might cause performance issues. We might need to tweak it: T-Head C906 Strong Ordering
A few findings:
It's possible that we're flushing the MMU Cache too often
I tested this by removing the call to mmu_flush_cache, but this didn't seem to affect coremark or the for loop at all.
We have enabled Strong Ordering in the MMU, which might cause performance issues.
I timed the for loop with different combinations of the shareable and strong order flags, but again this didn't seem to make a difference.
Maybe that's why the SBC Makers (Sipeed, Pine64) are moving away from Bouffalo Lab BL808 to Sophgo SG2000 / SG2002, which has Mainline Linux Support.
In that case, maybe we could learn something from testing the SG2000? Since it also uses the C906, it might be worth checking if it behaves similarly to the BL808 in NuttX, and if the performance difference is as large compared to Linux.
Since it also uses the C906, it might be worth checking if it behaves similarly to the BL808 in NuttX, and if the performance difference is as large compared to Linux
@henryrov Yep sure! I'll run the NOP Loop before and after initing the MMU on SG2000. How do I run the benchmark for NuttX vs Linux?
@henryrov Yep sure! I'll run the NOP Loop before and after initing the MMU on SG2000. How do I run the benchmark for NuttX vs Linux?
That's great! You can enable coremark on NuttX through menuconfig under Application Configuration -> Benchmark Applications. I don't know much about Linux on the SG2000, but what I ended up doing for the BL808 was cross compiling coremark with the buildroot toolchain externally and moving the compiled binary to my SD card. Maybe if there's enough hardware support it might be easier to get the source code and then compile directly on the board?
@henryrov Here are the CoreMark Results for SG2000 (Milk-V Duo S), NuttX vs Debian. Yep the results look similar to Ox64 BL808 NuttX, since SG2000 NuttX is nearly identical to Ox64 NuttX:
-O2
. Kernel won't boot with -O2
(why?)SG2000 Debian CoreMark -O2
: 2,470
I'll do more analysis of the NOP Loop before and after initing SG2000 MMU. Thanks!
(FYI: I thought it might be due to the OpenSBI System Timer Interrupt triggered too often, but nope it makes no difference when I disabled the interrupt)
UPDATE: We have a fix for the MMU Delay, we need to tell the MMU that the Kernel Text, Data and Heap are Cacheable. Otherwise the MMU won't cache them!
CoreMark is now 17
, up slightly from 16
earlier. NuttX Apps are also having the same MMU Delay, I'll check the MMU Flags for NuttX Apps:
Thanks Henry for tracking down the MMU Delay! I'll upstream the Kernel Fix to Ox64 and SG2000 real soon.
We have a fix for the MMU Delay, we need to tell the MMU that the Kernel Text, Data and Heap are Cacheable. Otherwise the MMU won't cache them!
Nice! I tested a similar change on the BL808 and it did also fix NOP loop difference. It also increased coremark slightly, from 18 to 19.
Thanks Henry for tracking down the MMU Delay! I'll upstream the Kernel Fix to Ox64 and SG2000 real soon.
No problem, thank you for your help!
Thanks @henryrov for testing on Ox64! I configured the MMU to cache User Text and Data (for NuttX Apps):
Now NuttX CoreMark is really close to Debian CoreMark!
-O2
: 2,422
-O2
: 2,470
I'll upstream the fixes. Thanks again :-)
FYI: SG2000 NuttX CoreMark is 1,758 with default settings -Os
and -g
. So -O2
really makes a difference! How I compiled CoreMark for -O2
:
rm ../apps/benchmarks/coremark/*.o
## Edit arch/risc-v/src/common/Toolchain.defs
## Change `ARCHOPTIMIZATION += -Os` to `ARCHOPTIMIZATION += -O2`
## Change `ARCHOPTIMIZATION += -g` to `ARCHOPTIMIZATION +=`
## Note: NuttX Kernel won't boot with `-O2` (why?)
@lupyuen could you please fill an issue related to -O2
compilation for kernel that you mentioned above?
Now NuttX CoreMark is really close to Debian CoreMark!
I tested the same fixes on Ox64, and now it's also really close to Buildroot (1104 with -O2 vs 1141). I'll go ahead and close this issue now. Thanks everyone for your help!
I'm not sure what is the compile optimization flags for Debian used. -O3
or -Ofast
may still give some excursion speed sacrificing the space.
Hi @henryrov I'm very sorry for the delay, the Performance Fix for Ox64 has just been upstreamed to NuttX Mainline. Thanks for waiting :-)
Hi @henryrov I'm very sorry for the delay, the Performance Fix for Ox64 has just been upstreamed to NuttX Mainline. Thanks for waiting :-)
No problem at all! Thanks for making the fix!
A while ago, I noticed that benchmark results on the BL808 were significantly lower on NuttX than on Linux. In coremark, NuttX scores ~18.3 while Linux scores 1141, which is over 60x faster. Messing with build settings changed the results a bit, but the highest I was able to get from NuttX was about 20. Recently, I decided to do some lower level testing to find the root of the issue, and I think the cause is related to MMU configuration. I used a nop loop to to test:
Running at any point before
mmu_enable
is called, it takes under 2.5 seconds. Aftermmu_enable
, it takes about 100 seconds. I've verified using objdump that the compiler output is equivalent before or after:As I see it, this leaves the MMU as the most likely reason for the slowdown, but I don't know enough about the subject to debug this any further (or maybe this is expected behavior?). I was hoping someone who knows more about this would be able to help look into it.