Open CookiePLMonster opened 8 months ago
Note: Virtual Display RPC can benefit from an addition like this, since it also blits a fullscreen buffer to the screen directly. https://github.com/flipperdevices/flipperzero-firmware/blob/26da5f564b7f72f71b61f4a74f4638be8d435323/applications/services/rpc/rpc_gui.c#L256
This kind of plays into an already-planned refactor referenced further in this file:
// TODO FL-3511: consider refactoring
// Using display framebuffer size as an XBM buffer size is like comparing apples and oranges
// Glad they both are 1024 for now
I played around with this demo more and got even better performance by slightly unrolling the loop to do tile columns (8px) in one sweep:
static void canvas_blit_fast(Canvas* canvas, const uint8_t* bitmap) {
uint8_t* buf = canvas_get_buffer(canvas);
uint32_t* canvas_buffer = BIT_BAND_ALIAS(buf);
const uint32_t* bitmap_buffer = BIT_BAND_ALIAS(bitmap);
for(uint32_t y = 0; y < 8192; y += 1024, bitmap_buffer += 1024) {
const uint32_t* bitmap_cur_tile = bitmap_buffer;
for(uint32_t x = 0; x < 1024; x += 8) {
// Unroll the copy of a single tile column
canvas_buffer[x + y + 0] = bitmap_cur_tile[0 * 128];
canvas_buffer[x + y + 1] = bitmap_cur_tile[1 * 128];
canvas_buffer[x + y + 2] = bitmap_cur_tile[2 * 128];
canvas_buffer[x + y + 3] = bitmap_cur_tile[3 * 128];
canvas_buffer[x + y + 4] = bitmap_cur_tile[4 * 128];
canvas_buffer[x + y + 5] = bitmap_cur_tile[5 * 128];
canvas_buffer[x + y + 6] = bitmap_cur_tile[6 * 128];
canvas_buffer[x + y + 7] = bitmap_cur_tile[7 * 128];
bitmap_cur_tile++;
}
}
}
This version of fast blit gets around 222FPS, which is more or less 35FPS more than my first attempt. At this point it's so fast that my demo sample is bottlenecked by the FPS counter itself.
Size shouldn't be a concern either - with -Os
the inner loop is very lightweight and the additional ~20 bytes used by the code are worth the performance gain:
EDIT:
I experimented with moving stuff to an internal function with both buffers marked as __restrict
in hopes of the compiler generating code that would avoid a potential stall on R6
, but I couldn't. I'm not sure if these stores will stall waiting for loads to finish, either, but to me it looks like it might.
EDIT2:
[b] Neighboring load and store single instructions can pipeline their address and data phases. This enables these instructions to complete in a single execution cycle.
I was able to confirm that these operations pipeline well by implementing a variation where I first read all 8 samples, then write them. This approach was marginally (but measurably) slower, around -2FPS on this test. Thus, simpler is better here.
looks very promising
Note to self: This code ignores the canvas orientation, so it currently breaks the Lefty mode. It most likely needs splitting into 4 separate functions/code paths per canvas orientation to do copies the other way around.
On an upside, vertical orientation (or vertical flip, not sure at the moment) can then "degenerate" to a single memcpy because the pixels should be ordered linearly! So in practice this only needs 3 distinct functions, as memcpy can be thought of as shared.
@skotopes I updated the proof of concept to handle all 4 canvas orientations. Notice that the vertical orientations do not require the bit band and can operate on full bytes, which makes the copy even faster - my example checkerboard render is rendering at 280FPS vertically!
static void canvas_blit_fast_int_r0(uint32_t* __restrict dest, const uint32_t* __restrict src) {
for(uint32_t y = 0; y < 8192; y += 1024, src += 1024) {
const uint32_t* bitmap_cur_tile = src;
for(uint32_t x = 0; x < 1024; x += 8) {
// Unroll the copy of a single tile column
dest[x + y + 0] = bitmap_cur_tile[0 * 128];
dest[x + y + 1] = bitmap_cur_tile[1 * 128];
dest[x + y + 2] = bitmap_cur_tile[2 * 128];
dest[x + y + 3] = bitmap_cur_tile[3 * 128];
dest[x + y + 4] = bitmap_cur_tile[4 * 128];
dest[x + y + 5] = bitmap_cur_tile[5 * 128];
dest[x + y + 6] = bitmap_cur_tile[6 * 128];
dest[x + y + 7] = bitmap_cur_tile[7 * 128];
bitmap_cur_tile++;
}
}
}
static void canvas_blit_fast_int_r2(uint32_t* __restrict dest, const uint32_t* __restrict src) {
for(uint32_t y = 8192; y > 0; y -= 1024, src += 1024) {
const uint32_t* bitmap_cur_tile = src;
for(uint32_t x = 0; x < 1024; x += 8) {
// Unroll the copy of a single tile column
const uint32_t cur_y = y - 1024;
dest[x + cur_y + 0] = bitmap_cur_tile[0 * 128];
dest[x + cur_y + 1] = bitmap_cur_tile[1 * 128];
dest[x + cur_y + 2] = bitmap_cur_tile[2 * 128];
dest[x + cur_y + 3] = bitmap_cur_tile[3 * 128];
dest[x + cur_y + 4] = bitmap_cur_tile[4 * 128];
dest[x + cur_y + 5] = bitmap_cur_tile[5 * 128];
dest[x + cur_y + 6] = bitmap_cur_tile[6 * 128];
dest[x + cur_y + 7] = bitmap_cur_tile[7 * 128];
bitmap_cur_tile++;
}
}
}
static void canvas_blit_fast_int_r1(uint8_t* __restrict dest, const uint8_t* __restrict src) {
for(uint32_t x = 1024; x > 0; x -= 128, ++src) {
const uint32_t cur_x = x - 128;
const uint8_t* bitmap_cur_tile = src;
for(uint32_t y = 0; y < 128; ++y) {
dest[cur_x + y] = *bitmap_cur_tile;
bitmap_cur_tile += 8;
}
}
}
static void canvas_blit_fast_int_r3(uint8_t* __restrict dest, const uint8_t* __restrict src) {
for(uint32_t x = 0; x < 1024; x += 128, ++src) {
const uint8_t* bitmap_cur_tile = src;
for(uint32_t y = 128; y > 0; --y) {
const uint32_t cur_y = y - 1;
dest[x + cur_y] = __RBIT(*bitmap_cur_tile) >> 24;
bitmap_cur_tile += 8;
}
}
}
static void canvas_blit_fast(Canvas* canvas, const uint8_t* bitmap) {
uint8_t* buf = canvas_get_buffer(canvas);
switch(canvas_get_orientation(canvas)) {
case CanvasOrientationHorizontal:
canvas_blit_fast_int_r0(BIT_BAND_ALIAS(buf), BIT_BAND_ALIAS(bitmap));
break;
case CanvasOrientationHorizontalFlip:
canvas_blit_fast_int_r2(BIT_BAND_ALIAS(buf), BIT_BAND_ALIAS(bitmap));
break;
case CanvasOrientationVertical:
canvas_blit_fast_int_r1(buf, bitmap);
break;
case CanvasOrientationVerticalFlip:
canvas_blit_fast_int_r3(buf, bitmap);
break;
default:
furi_crash();
break;
}
}
Sidenote - apparently it is impossible to rotate the canvas in Direct Draw mode. Should canvas_get_orientation
and canvas_set_orientation
be moved to the public API to accommodate for this use case? Canvas orientation does not remap inputs, but it does change the coordinate origins, so it could potentially still be useful.
Describe the enhancement you're suggesting.
Since @skotopes mentioned in another issue that suggestions for Canvas API expansions can be considered, an addition to allow for fast blit of a buffer on-screen would be appreciated.
Consider this demo example, based upon the Direct Draw API debug app. In this sample, I allocate a single 128x64px buffer, rasterize on it manually (in this case, a simple checkerboard, but in a real app it'd be substituted by more non-trivial workload, like in e.g. my Mode 7 demo), then I blit the buffer on screen in a single full-screen draw:
Sample app
```cpp #includeCurrently,
canvas_draw_xbm
is used for tasks like this. However, u8g2 proves itself quite inefficient for a task like this that should be a glorified memcpy - instead, u8g2 appears to be drawing bit by bit.For my proof-of-concept, I exposed
canvas_get_buffer()
to the public API. However, the memory returned by this function is tiled and therefore not user-friendly, so exposing this function is not a solution to the problem. Instead, I suggest introducing a newblit
function that does the swizzle and assignment internally, nothing else.I benchmarked both solutions on this app, and the results (on a
COMPACT=1
build) are as follows:canvas_draw_xbm
- 39.0FPScanvas_blit_fast
- ~194FPSIn a real app, the gain of
blit_fast
would likely be slightly smaller, as the call got inlined in my sample app and I have not profiled it withnoinline
. Regardless, it should still be substantial, as the function would still be subject to loop unrolling/other optimizations when part of the firmware.Anything else?
I'm not pressed on this being the only "correct" solution. A few other things come to mind:
canvas_draw_xbm
could be special-cased for fullscreen draws to do something like this.=
to^=
or|=
, depending on the combination of those modes. That said, I think it would be beneficial to have a function that guarantees a copy as-is, hence the idea to differentiatedraw
functions (respecting the u8g2 state) andblit
functions (ignoring that state by design).