emoon / rust_minifb

Cross platfrom window and framebuffer crate for Rust
MIT License
1.04k stars 99 forks source link

Scaling not working on XWayland. #51

Closed cedws closed 6 years ago

cedws commented 6 years ago

Hi. Currently on an X11 system but I'll try to get more details for you later.

The minifb::Scale option appears to have no effect on Wayland. I also seem to remember seeing an info message saying that only minifb::Scale::X8 and lower is supported. The window starts extremely small.

I believe the system I had the issue on was running Fedora 27, which uses XWayland.

cedws commented 6 years ago

I'm testing on X11 and managed to get out: Scale above 4 not support currently, defaults to 4

emoon commented 6 years ago

Hi,

If you set scaling to 4 do you still see no effect?

cedws commented 6 years ago

@emoon There's a difference between 2 and 4. Those work fine. As the info message suggests, scale factors above 4 don't work/are not supported. Can we do anything about this?

emoon commented 6 years ago

Yeah so the problem is that X11 doesn't have any support for scaling by itself which macOS and Windows has so I implemented them manually like this

https://github.com/emoon/rust_minifb/blob/master/src/native/x11/X11MiniFB.c#L490

It would easy to implement x8 or more here. I have manually unrolled it to keep it as fast as possible but the best option would likely to use SIMD here to have good performance.

An other alternative would be to have optional support for OpenGL and have the GPU do that work but I haven't looked into that.

emoon commented 6 years ago

I have fixed this now. 0.10.5 has been released with this added.

cedws commented 6 years ago

@emoon Just gave this a try, X8 seems to be working, but X16 and X32 are still unsupported. Any plans for this in the future?

emoon commented 6 years ago

I actually didn't realize I support up to X32. I can try to add support for it. You must have a pretty small window and a fairly large screen res if you need that though :)

cedws commented 6 years ago

I would quite like to use X16, as I'm building a CHIP8 emulator which has a 64x32 resolution. Thus, my window is currently only 512x256. I could probably alter my emulator to duplicate a pixel value to do some scaling myself but it would be nice to have in minifb. Please do if you have the time, thank you :+1:

emoon commented 6 years ago

Alright. I will try to get it sorted during the weekend :)

cedws commented 6 years ago

@emoon I've just noticed some performance differences between scale factors. You said that you were having to implement this scaling manually as X11 doesn't support it, so I assume this what's causing it. Is it possible to optimise this function in any way? Maybe a memcpy? Thanks.

emoon commented 6 years ago

The problem is that you can't do a memcpy because you need to "expand" a single pixel to multiple pixels so it means that you need to write it to the same buffer several times. Like this https://github.com/emoon/rust_minifb/blob/master/src/native/x11/X11MiniFB.c#L490

When scaling up there will of course be more pixels to write making it taking CPU time. In order to reduce the time I can look at the generated assembly and make sure it's generated correct, it's also possible to use SSE on x86 machines but I don't think it will help much as you are almost only bound writing to memory here.

In terms of perf it should be faster to use the release version. Another thing you can do is to run your update in a thread so it runs at the same time as the update_with_buffer call is running.

There is a another solution to this and it's to use OpenGL to do the update instead. This has several issues such as requiring OpenGL which is a big downside as it isn't needed today. I'm not sure if it's possible to only do it optional but it's quite a bit of work to get there.

emoon commented 6 years ago

I have investigated this now. For x8 the compiler generates this code

.L131:
        add     rsi, 4
        movd    xmm1, DWORD PTR -4[rsi]
        mov     r15d, ecx
        add     ecx, r8d
        pshufd  xmm0, xmm1, 0
        movups  XMMWORD PTR [rdi], xmm0
        movups  XMMWORD PTR [rax], xmm0
        movups  XMMWORD PTR [rdi+r14], xmm0
        movups  XMMWORD PTR [rax+r14], xmm0
        movups  XMMWORD PTR [rdi+r13], xmm0
        movups  XMMWORD PTR [rax+r13], xmm0
        movups  XMMWORD PTR [rdi+r12], xmm0
        movups  XMMWORD PTR [rax+r12], xmm0
        movups  XMMWORD PTR [rdi+rbp], xmm0
        movups  XMMWORD PTR [rax+rbp], xmm0
        movups  XMMWORD PTR [rdi+r10], xmm0
        movups  XMMWORD PTR [rax+r10], xmm0
        movups  XMMWORD PTR [rdi+r9], xmm0
        movups  XMMWORD PTR [rax+r9], xmm0
        movups  XMMWORD PTR [rdi+r11], xmm0
        add     rdi, rbx
        movups  XMMWORD PTR [rax+r11], xmm0
        add     rax, rbx
        cmp     edx, r15d
        jg      .L131
        add     DWORD PTR -24[rsp], r8d
        add     rdi, QWORD PTR -16[rsp]
        mov     eax, DWORD PTR -24[rsp]
        cmp     DWORD PTR -20[rsp], eax
        jg      .L130

Which is pretty good as I write u32 values but it figures out to use SSE instead.

On my machine when running a 64x32 buffer and x8 scaling the update_with_buffer code takes 0.18-0.2 ms and I think it's not that easy to improve much more as the compile does a fairly good job here.

emoon commented 6 years ago

I have added x16 and x32 now in 0.10.6 (x32 doesn't seem to behave correct for me but you can try it out, x16 works fine)

cedws commented 6 years ago

They work brilliantly, thank you very much for your hard work :+1:. As for the performance issue, I'm not really sure what the problem is. I am confident that it's due to scaling, as the difference in render speed between X2 and X16 is significant. As you say, I don't think there's much that can be done there though. It's just a little hobby project anyway, performance isn't a big issue. Thanks again!

emoon commented 6 years ago

Cool np :) Yeah. As I do the scaling on the CPU there is not much to do to get it faster in higher res this is really a task for the GPU but (like I wrote above) I would need to depend on HW OpenGL acceleration which may or may not be present in the system.