adafruit / circuitpython

CircuitPython - a Python implementation for teaching coding with microcontrollers
https://circuitpython.org
Other
4.07k stars 1.2k forks source link

CircuitPython 4.x Performance issue with the pygamer vs. MicroPython #1933

Closed shazz closed 5 years ago

shazz commented 5 years ago

Hi,

I did a simple loop test in python to compare with my STM32F4 based MicroPython board and the results are quite unexpected. CircuitPython 4.x is 16x slower than MicroPython for an equivalent CPU frequency.

Details available here: https://forums.adafruit.com/viewtopic.php?f=60&t=152372&p=753830#p753830

Test code:

def test():
    s = time.monotonic
    count = 0
    r = range(0, 10000000)

    startTime = s()
    for count in r:
        pass
    endTime = s()

    duration = endTime - startTime

    return duration

Results:

- 18 766ms on Meowbit/MP at 56 MHz (using pyb.millis)
- 12 506ms on my Meowbit/MP at 84 MHz (using pyb.millis)
- 138 390ms on my Pygamer/CP at 120 MHz (using time.monotonic)
- 86 000ms on my Pygamer/CP at 200 MHz (using time.monotonic)

C module reference:

#define NOP asm volatile(" nop \n\t")

    uint64_t start = ticks_ms;  
    uint32_t count = 0;  
    uint64_t end = 0; 

    for(count=0; count<10000000; count++) { NOP; }

    end = ticks_ms;  

    uint32_t delta = end - start;
    printf("Delta: %lu\n", delta);

Results:

- 715ms at 56MHz on Meowbit/MP
- 477ms at 84MHz on Meowbit/MP
- 334ms at 120MHz on pygamer/CP
- 200ms at 200MHz on pygamer/CP

Real issue or I did something wrong ?

tannewt commented 5 years ago

This doesn't surprise me because performance hasn't been our focus. I'll probably take a glance for simple improvements in the next couple weeks but we'd totally take PRs that improve things.

dhalbert commented 5 years ago

I did some research on this. I looked at the stm32 port settings in MicroPython. In our ports, we use MICROPY_VM_HOOK_LOOP and MICROPY_VM_HOOK_RETURN to call run_background_tasks() very frequently. This happens after every jump opcode and every return. We do this to service USB and other background tasks that need to use polling instead of interrupts. The stm32 port does not do this.

If do not define the HOOK macros to call run_background_tasks() , your timing test and a similar one done with explicit increment and test code run run 3-4 times faster.

I could only test this in CircuitPython 3.1.2. It works at least for a short while without calling background tasks frequently, but eventually the REPL connection goes away. CircuitPython 4.0.1 doesn't work at all if I eliminate the background task call.

That accounts for a large fraction of your performance difference, but not all, and further research would uncover more.

shazz commented 5 years ago

Thanks Dan! Interesting findings! "This happens after every jump opcode and every return" looks pretty scary :)

so everything happens here, ans especially usb_background(), right?

#define MICROPY_VM_HOOK_LOOP run_background_tasks();
#define MICROPY_VM_HOOK_RETURN run_background_tasks();

void run_background_tasks(void) {
    // Don't call ourselves recursively.
    if (running_background_tasks) {
        return;
    }
    assert_heap_ok();
    running_background_tasks = true;

    #if (defined(SAMD21) && defined(PIN_PA02)) || defined(SAMD51)
    audio_dma_background();
    #endif
    #if CIRCUITPY_DISPLAYIO
    displayio_refresh_displays();
    #endif

    #if CIRCUITPY_NETWORK
    network_module_background();
    #endif
    filesystem_background();
    usb_background();
    running_background_tasks = false;
    assert_heap_ok();

    last_finished_tick = ticks_ms;
}

And yep MICROPY_VMHOOK* are called in a lot of places...

dhalbert commented 5 years ago

Further testing: I have turned on MICROPY_OPT_COMPUTED_GOTO which is a performance enhancement in the VM interpreter (adds about 1700 bytes ti firmware). This further speeds up the basic timing loop by a factor of 5. Together these seem to account for pretty much all the speed differences.

The basic test I am using (to avoid any issues about how range() is implemented:

import time
def test():
    i = 100000
    startTime = time.monotonic()
    while i > 0:
        i -= 1
    endTime = time.monotonic()
    return endTime - startTime

print(test())
shazz commented 5 years ago

Interesting ! I'll try tonight on my pygamer to enable MICROPY_OPT_COMPUTED_GOTO and disable CIRCUITPY_DISPLAYIO. I'll post my results here.

shazz commented 5 years ago

@dhalbert, with the last CircuitPython source code, when I add #define MICROPY_OPT_COMPUTED_GOTO (1) I got this error at compilation time:

In file included from ../../py/vm.c:129:0:
../../py/vmentrytable.h:27:5: error: "__clang__" is not defined, evaluates to 0 [-Werror=undef]
 #if __clang__
     ^~~~~~~~~
../../py/vmentrytable.h:116:5: error: "__clang__" is not defined, evaluates to 0 [-Werror=undef]
 #if __clang__
     ^~~~~~~~~
cc1: all warnings being treated as errors

make: *** [../../py/mkrules.mk:82: build-pygamer/genhdr/qstr.i.last] Error 123

Anything I should do ?

dhalbert commented 5 years ago

Change both #if __clang__ to #ifdef __clang__. The original is a bit sloppy and we turn on warnings on that.

dhalbert commented 5 years ago

Note the PR: https://github.com/adafruit/circuitpython/pull/1934

shazz commented 5 years ago

Thanks!

So I ran again the same test with MICROPY_OPT_COMPUTED_GOTO enabled and CIRCUITPY_DISPLAYIO disabled:

Results: 35.3ms

So a huge improvement but still some room for optimization to reach the ~10ms on an equivalent 120MHz STM32F4 board with MicroPython.

Good job @dhalbert !

dhalbert commented 5 years ago

Thanks! The next step is more work, which is to figure out how to call the background tasks only as often as necessary. That will get the next 2-3x speedup, I think.

dhalbert commented 5 years ago

We really appreciate your taking the time to do this comparison and point out the big disparity.

shazz commented 5 years ago

Just tried at 200Mhz: 21.2 ms.

Ah ah, no issue :) It forces me to go into the details to get a better understanding of CircuitPython, I like that. But a beer in the Walking City downtown would be appreciated too :)

dhalbert commented 5 years ago

Closing in favor of #2142.

ladyada commented 4 years ago

trying with Adafruit CircuitPython 5.0.0-beta.0-109-g15886b150 on 2019-12-03; Adafruit Feather M4 Express with samd51j19

got 18.765 seconds @ 120 MHz, so a 7.4x improvement over original 138.39 :)