Closed shazz closed 5 years ago
This doesn't surprise me because performance hasn't been our focus. I'll probably take a glance for simple improvements in the next couple weeks but we'd totally take PRs that improve things.
I did some research on this. I looked at the stm32 port settings in MicroPython. In our ports, we use MICROPY_VM_HOOK_LOOP
and MICROPY_VM_HOOK_RETURN
to call run_background_tasks()
very frequently. This happens after every jump opcode and every return. We do this to service USB and other background tasks that need to use polling instead of interrupts. The stm32 port does not do this.
If do not define the HOOK
macros to call run_background_tasks()
, your timing test and a similar one done with explicit increment and test code run run 3-4 times faster.
I could only test this in CircuitPython 3.1.2. It works at least for a short while without calling background tasks frequently, but eventually the REPL connection goes away. CircuitPython 4.0.1 doesn't work at all if I eliminate the background task call.
That accounts for a large fraction of your performance difference, but not all, and further research would uncover more.
Thanks Dan! Interesting findings! "This happens after every jump opcode and every return" looks pretty scary :)
so everything happens here, ans especially usb_background(), right?
#define MICROPY_VM_HOOK_LOOP run_background_tasks();
#define MICROPY_VM_HOOK_RETURN run_background_tasks();
void run_background_tasks(void) {
// Don't call ourselves recursively.
if (running_background_tasks) {
return;
}
assert_heap_ok();
running_background_tasks = true;
#if (defined(SAMD21) && defined(PIN_PA02)) || defined(SAMD51)
audio_dma_background();
#endif
#if CIRCUITPY_DISPLAYIO
displayio_refresh_displays();
#endif
#if CIRCUITPY_NETWORK
network_module_background();
#endif
filesystem_background();
usb_background();
running_background_tasks = false;
assert_heap_ok();
last_finished_tick = ticks_ms;
}
And yep MICROPY_VMHOOK* are called in a lot of places...
Further testing: I have turned on MICROPY_OPT_COMPUTED_GOTO
which is a performance enhancement in the VM interpreter (adds about 1700 bytes ti firmware). This further speeds up the basic timing loop by a factor of 5. Together these seem to account for pretty much all the speed differences.
The basic test I am using (to avoid any issues about how range()
is implemented:
import time
def test():
i = 100000
startTime = time.monotonic()
while i > 0:
i -= 1
endTime = time.monotonic()
return endTime - startTime
print(test())
Interesting ! I'll try tonight on my pygamer to enable MICROPY_OPT_COMPUTED_GOTO and disable CIRCUITPY_DISPLAYIO. I'll post my results here.
@dhalbert, with the last CircuitPython source code, when I add #define MICROPY_OPT_COMPUTED_GOTO (1)
I got this error at compilation time:
In file included from ../../py/vm.c:129:0:
../../py/vmentrytable.h:27:5: error: "__clang__" is not defined, evaluates to 0 [-Werror=undef]
#if __clang__
^~~~~~~~~
../../py/vmentrytable.h:116:5: error: "__clang__" is not defined, evaluates to 0 [-Werror=undef]
#if __clang__
^~~~~~~~~
cc1: all warnings being treated as errors
make: *** [../../py/mkrules.mk:82: build-pygamer/genhdr/qstr.i.last] Error 123
Anything I should do ?
Change both #if __clang__
to #ifdef __clang__
. The original is a bit sloppy and we turn on warnings on that.
Note the PR: https://github.com/adafruit/circuitpython/pull/1934
Thanks!
So I ran again the same test with MICROPY_OPT_COMPUTED_GOTO
enabled and CIRCUITPY_DISPLAYIO
disabled:
Results: 35.3ms
So a huge improvement but still some room for optimization to reach the ~10ms on an equivalent 120MHz STM32F4 board with MicroPython.
Good job @dhalbert !
Thanks! The next step is more work, which is to figure out how to call the background tasks only as often as necessary. That will get the next 2-3x speedup, I think.
We really appreciate your taking the time to do this comparison and point out the big disparity.
Just tried at 200Mhz: 21.2 ms.
Ah ah, no issue :) It forces me to go into the details to get a better understanding of CircuitPython, I like that. But a beer in the Walking City downtown would be appreciated too :)
Closing in favor of #2142.
trying with
Adafruit CircuitPython 5.0.0-beta.0-109-g15886b150 on 2019-12-03; Adafruit Feather M4 Express with samd51j19
got 18.765
seconds @ 120 MHz, so a 7.4x improvement over original 138.39
:)
Hi,
I did a simple loop test in python to compare with my STM32F4 based MicroPython board and the results are quite unexpected. CircuitPython 4.x is 16x slower than MicroPython for an equivalent CPU frequency.
Details available here: https://forums.adafruit.com/viewtopic.php?f=60&t=152372&p=753830#p753830
Test code:
Results:
C module reference:
Results:
Real issue or I did something wrong ?