flipperdevices / flipperzero-firmware

Flipper Zero firmware source code
https://flipperzero.one
GNU General Public License v3.0
12.77k stars 2.72k forks source link

Canvas: Fast blit function(s) #3426

Open CookiePLMonster opened 8 months ago

CookiePLMonster commented 8 months ago

Describe the enhancement you're suggesting.

Since @skotopes mentioned in another issue that suggestions for Canvas API expansions can be considered, an addition to allow for fast blit of a buffer on-screen would be appreciated.

Consider this demo example, based upon the Direct Draw API debug app. In this sample, I allocate a single 128x64px buffer, rasterize on it manually (in this case, a simple checkerboard, but in a real app it'd be substituted by more non-trivial workload, like in e.g. my Mode 7 demo), then I blit the buffer on screen in a single full-screen draw:

Sample app ```cpp #include #include #include #define BUFFER_SIZE (32U) // This is simplified compared to the "traditional" bit band alias access, since we only occupy the bottom 256KB of SRAM anyway #define BIT_BAND_ALIAS(var) ((uint32_t*)(((uintptr_t)(var) << 5) | 0x22000000)) [[maybe_unused]] static void canvas_blit_fast(Canvas* canvas, const uint8_t* bitmap) { uint8_t* buf = canvas_get_buffer(canvas); const uint32_t* bitmap_buffer = BIT_BAND_ALIAS(bitmap); uint32_t* canvas_buffer = BIT_BAND_ALIAS(buf); for(uint32_t y = 0; y < 64; ++y) { const uint32_t tile_y = ((y & 0xFFFFFFF8) << 7) | (y & 7); for(uint32_t x = 0; x < 128; ++x) { const uint32_t column = tile_y + (x * 8); canvas_buffer[column] = *bitmap_buffer++; } } } typedef struct { FuriPubSub* input; FuriPubSubSubscription* input_subscription; Gui* gui; Canvas* canvas; bool stop; uint8_t screen_buffer[16 * 64]; } DirectDraw; static void gui_input_events_callback(const void* value, void* ctx) { furi_assert(value); furi_assert(ctx); DirectDraw* instance = ctx; const InputEvent* event = value; if(event->key == InputKeyBack && event->type == InputTypeShort) { instance->stop = true; } } static DirectDraw* direct_draw_alloc() { DirectDraw* instance = malloc(sizeof(DirectDraw)); instance->input = furi_record_open(RECORD_INPUT_EVENTS); instance->gui = furi_record_open(RECORD_GUI); instance->canvas = gui_direct_draw_acquire(instance->gui); instance->input_subscription = furi_pubsub_subscribe(instance->input, gui_input_events_callback, instance); return instance; } static void direct_draw_free(DirectDraw* instance) { furi_pubsub_unsubscribe(instance->input, instance->input_subscription); instance->canvas = NULL; gui_direct_draw_release(instance->gui); furi_record_close(RECORD_GUI); furi_record_close(RECORD_INPUT_EVENTS); } static void direct_draw_run(DirectDraw* instance) { size_t start = DWT->CYCCNT; size_t counter = 0; float fps = 0; furi_thread_set_current_priority(FuriThreadPriorityIdle); do { size_t elapsed = DWT->CYCCNT - start; char buffer[BUFFER_SIZE] = {0}; if(elapsed >= SystemCoreClock) { fps = (float)counter / ((float)elapsed / SystemCoreClock); start = DWT->CYCCNT; counter = 0; } snprintf(buffer, BUFFER_SIZE, "FPS: %.1f", (double)fps); // Here we would normally be doing some more complex rendering. // For this example, fill the bottom 3/4 of the screen with a thick checkerboard pattern // just to have -some- processing going on. { uint8_t* back_buffer = instance->screen_buffer + 2 * 128; for(uint32_t y = 16; y < 64; ++y) { if((y % 32) == 16) { back_buffer += 2; } else if((y % 32) == 0) { back_buffer -= 2; } for(uint32_t x = 0; x < 128; x += 32) { *back_buffer++ = 0xFF; *back_buffer++ = 0xFF; back_buffer += 2; } } } canvas_blit_fast(instance->canvas, instance->screen_buffer); //canvas_draw_xbm(instance->canvas, 0, 0, 128, 64, instance->screen_buffer); canvas_draw_str(instance->canvas, 10, 10, buffer); canvas_commit(instance->canvas); counter++; furi_thread_yield(); } while(!instance->stop); } int32_t fast_blit_demo_app(void* p) { UNUSED(p); DirectDraw* instance = direct_draw_alloc(); direct_draw_run(instance); direct_draw_free(instance); return 0; } ```

Currently, canvas_draw_xbm is used for tasks like this. However, u8g2 proves itself quite inefficient for a task like this that should be a glorified memcpy - instead, u8g2 appears to be drawing bit by bit.

For my proof-of-concept, I exposed canvas_get_buffer() to the public API. However, the memory returned by this function is tiled and therefore not user-friendly, so exposing this function is not a solution to the problem. Instead, I suggest introducing a new blit function that does the swizzle and assignment internally, nothing else.

I benchmarked both solutions on this app, and the results (on a COMPACT=1 build) are as follows:

In a real app, the gain of blit_fast would likely be slightly smaller, as the call got inlined in my sample app and I have not profiled it with noinline. Regardless, it should still be substantial, as the function would still be subject to loop unrolling/other optimizations when part of the firmware.

Anything else?

I'm not pressed on this being the only "correct" solution. A few other things come to mind:

CookiePLMonster commented 8 months ago

Note: Virtual Display RPC can benefit from an addition like this, since it also blits a fullscreen buffer to the screen directly. https://github.com/flipperdevices/flipperzero-firmware/blob/26da5f564b7f72f71b61f4a74f4638be8d435323/applications/services/rpc/rpc_gui.c#L256

This kind of plays into an already-planned refactor referenced further in this file:

    // TODO FL-3511: consider refactoring
    // Using display framebuffer size as an XBM buffer size is like comparing apples and oranges
    // Glad they both are 1024 for now
CookiePLMonster commented 8 months ago

I played around with this demo more and got even better performance by slightly unrolling the loop to do tile columns (8px) in one sweep:

static void canvas_blit_fast(Canvas* canvas, const uint8_t* bitmap) {
    uint8_t* buf = canvas_get_buffer(canvas);
    uint32_t* canvas_buffer = BIT_BAND_ALIAS(buf);

    const uint32_t* bitmap_buffer = BIT_BAND_ALIAS(bitmap);
    for(uint32_t y = 0; y < 8192; y += 1024, bitmap_buffer += 1024) {
        const uint32_t* bitmap_cur_tile = bitmap_buffer;
        for(uint32_t x = 0; x < 1024; x += 8) {
            // Unroll the copy of a single tile column
            canvas_buffer[x + y + 0] = bitmap_cur_tile[0 * 128];
            canvas_buffer[x + y + 1] = bitmap_cur_tile[1 * 128];
            canvas_buffer[x + y + 2] = bitmap_cur_tile[2 * 128];
            canvas_buffer[x + y + 3] = bitmap_cur_tile[3 * 128];
            canvas_buffer[x + y + 4] = bitmap_cur_tile[4 * 128];
            canvas_buffer[x + y + 5] = bitmap_cur_tile[5 * 128];
            canvas_buffer[x + y + 6] = bitmap_cur_tile[6 * 128];
            canvas_buffer[x + y + 7] = bitmap_cur_tile[7 * 128];

            bitmap_cur_tile++;
        }
    }
}

This version of fast blit gets around 222FPS, which is more or less 35FPS more than my first attempt. At this point it's so fast that my demo sample is bottlenecked by the FPS counter itself.

Size shouldn't be a concern either - with -Os the inner loop is very lightweight and the additional ~20 bytes used by the code are worth the performance gain: image

EDIT: I experimented with moving stuff to an internal function with both buffers marked as __restrict in hopes of the compiler generating code that would avoid a potential stall on R6, but I couldn't. I'm not sure if these stores will stall waiting for loads to finish, either, but to me it looks like it might.

EDIT2:

[b] Neighboring load and store single instructions can pipeline their address and data phases. This enables these instructions to complete in a single execution cycle.

I was able to confirm that these operations pipeline well by implementing a variation where I first read all 8 samples, then write them. This approach was marginally (but measurably) slower, around -2FPS on this test. Thus, simpler is better here.

skotopes commented 8 months ago

looks very promising

CookiePLMonster commented 8 months ago

Note to self: This code ignores the canvas orientation, so it currently breaks the Lefty mode. It most likely needs splitting into 4 separate functions/code paths per canvas orientation to do copies the other way around.

On an upside, vertical orientation (or vertical flip, not sure at the moment) can then "degenerate" to a single memcpy because the pixels should be ordered linearly! So in practice this only needs 3 distinct functions, as memcpy can be thought of as shared.

CookiePLMonster commented 8 months ago

@skotopes I updated the proof of concept to handle all 4 canvas orientations. Notice that the vertical orientations do not require the bit band and can operate on full bytes, which makes the copy even faster - my example checkerboard render is rendering at 280FPS vertically!

static void canvas_blit_fast_int_r0(uint32_t* __restrict dest, const uint32_t* __restrict src) {
    for(uint32_t y = 0; y < 8192; y += 1024, src += 1024) {
        const uint32_t* bitmap_cur_tile = src;
        for(uint32_t x = 0; x < 1024; x += 8) {
            // Unroll the copy of a single tile column
            dest[x + y + 0] = bitmap_cur_tile[0 * 128];
            dest[x + y + 1] = bitmap_cur_tile[1 * 128];
            dest[x + y + 2] = bitmap_cur_tile[2 * 128];
            dest[x + y + 3] = bitmap_cur_tile[3 * 128];
            dest[x + y + 4] = bitmap_cur_tile[4 * 128];
            dest[x + y + 5] = bitmap_cur_tile[5 * 128];
            dest[x + y + 6] = bitmap_cur_tile[6 * 128];
            dest[x + y + 7] = bitmap_cur_tile[7 * 128];

            bitmap_cur_tile++;
        }
    }
}

static void canvas_blit_fast_int_r2(uint32_t* __restrict dest, const uint32_t* __restrict src) {
    for(uint32_t y = 8192; y > 0; y -= 1024, src += 1024) {
        const uint32_t* bitmap_cur_tile = src;
        for(uint32_t x = 0; x < 1024; x += 8) {
            // Unroll the copy of a single tile column
            const uint32_t cur_y = y - 1024;
            dest[x + cur_y + 0] = bitmap_cur_tile[0 * 128];
            dest[x + cur_y + 1] = bitmap_cur_tile[1 * 128];
            dest[x + cur_y + 2] = bitmap_cur_tile[2 * 128];
            dest[x + cur_y + 3] = bitmap_cur_tile[3 * 128];
            dest[x + cur_y + 4] = bitmap_cur_tile[4 * 128];
            dest[x + cur_y + 5] = bitmap_cur_tile[5 * 128];
            dest[x + cur_y + 6] = bitmap_cur_tile[6 * 128];
            dest[x + cur_y + 7] = bitmap_cur_tile[7 * 128];

            bitmap_cur_tile++;
        }
    }
}

static void canvas_blit_fast_int_r1(uint8_t* __restrict dest, const uint8_t* __restrict src) {
    for(uint32_t x = 1024; x > 0; x -= 128, ++src) {
        const uint32_t cur_x = x - 128;
        const uint8_t* bitmap_cur_tile = src;
        for(uint32_t y = 0; y < 128; ++y) {
            dest[cur_x + y] = *bitmap_cur_tile;
            bitmap_cur_tile += 8;
        }
    }
}

static void canvas_blit_fast_int_r3(uint8_t* __restrict dest, const uint8_t* __restrict src) {
    for(uint32_t x = 0; x < 1024; x += 128, ++src) {
        const uint8_t* bitmap_cur_tile = src;
        for(uint32_t y = 128; y > 0; --y) {
            const uint32_t cur_y = y - 1;
            dest[x + cur_y] = __RBIT(*bitmap_cur_tile) >> 24;
            bitmap_cur_tile += 8;
        }
    }
}

static void canvas_blit_fast(Canvas* canvas, const uint8_t* bitmap) {
    uint8_t* buf = canvas_get_buffer(canvas);

    switch(canvas_get_orientation(canvas)) {
    case CanvasOrientationHorizontal:
        canvas_blit_fast_int_r0(BIT_BAND_ALIAS(buf), BIT_BAND_ALIAS(bitmap));
        break;
    case CanvasOrientationHorizontalFlip:
        canvas_blit_fast_int_r2(BIT_BAND_ALIAS(buf), BIT_BAND_ALIAS(bitmap));
        break;
    case CanvasOrientationVertical:
        canvas_blit_fast_int_r1(buf, bitmap);
        break;
    case CanvasOrientationVerticalFlip:
        canvas_blit_fast_int_r3(buf, bitmap);
        break;
    default:
        furi_crash();
        break;
    }
}

Sidenote - apparently it is impossible to rotate the canvas in Direct Draw mode. Should canvas_get_orientation and canvas_set_orientation be moved to the public API to accommodate for this use case? Canvas orientation does not remap inputs, but it does change the coordinate origins, so it could potentially still be useful.