audetto / AppleWin

Apple II emulator for Linux
GNU General Public License v2.0
49 stars 12 forks source link

New features (SDL) #12

Closed audetto closed 2 years ago

audetto commented 4 years ago

Qt app: quit from menu. Ctrl-Q SDL2: F6 full screen and some command line options.

audetto commented 4 years ago

SDL2: F2 quit, Left / Right Alt Open Solid Apple --qt-ini will reuse the Qt config file.

audetto commented 4 years ago

Ok keyboard works. But I do not like it. I was not sure if I should take a physical view of the keyboard or an ASCII one, and the result is a bit of both. It is impossible to do CTRL-ASCII. Need to sort it out.

webspacecreations commented 4 years ago

This all looks very promising and it's great to see the SDL build included with Qt build. A quick informal test (Mario Bros) and I definitely see some input delays for Qt. SDL appears more responsive, but although CPU is only about 30% on Pi4, emulation appears to be running slower than normal (enhanced speed is unchecked). The --qt-ini flag makes changing configurations very easy. Exciting to see the rapid progress!

audetto commented 4 years ago

Added audio. Currently there are 200ms delay. I need to see how much it can be reduced while avoiding underruns.

Enhanced speed only affects the emulator when the disk is spinning. CPU utilisation is not always obvious. Are you looking at 30% of 1 CPU? or 30% of all CPUs? On the PI3 it uses 1 entire CPU. and it is easy to fall behind.

I need to find some tradeoff quality / speed.

webspacecreations commented 4 years ago

Would it be possible to allocate audio processing to another core? I can imagine synchronization might be an issue... what about dedicating a core to the 6502 CPU responsibilities and another core to everything else (or developing the model of add-on cards using separate threads / cores)?

Have you reached out the main AppleWin group to inquire about your code repository being wrapped in as a first class citizen?

audetto commented 4 years ago

The problem with what you suggest is that it requires a departure from AppleWin, which would make merging 10x complicated. The only thing I could try to put on a different thread is the non-AppleWin audio processing, but I doubt it will make any difference. At the moment it is just copying the buffer to SDL.

What might be more effective is a decrease in video quality where most of the time is actually spent.

As merging the code with AW, you could try to mention it there and see what they answer: https://github.com/AppleWin/AppleWin/issues/538

I periodically create PRs to fix compiler issues in the files that are shared, but have never tried to unify the code completely. AW would require some work to become more modular, which I don't think they are interested in.

webspacecreations commented 4 years ago

I stumbled across the following value in the applelin.conf file from https://github.com/linappleii/linapple: Singlethreaded = 0

The comment surround this config value reads "By default the emulator's draw code, a large share of the processing, is performed in a separate thread, probably on a different core." I can confirm that on Pi two cores are being consumed for smooth sound & video.

I understand not wanting to deviate from the AppleWin core, but since the video component deviates anyway, offloading video to a separate core may be reasonably straightforward and reconcile issues on Pi4.

audetto commented 4 years ago

Could you please do me a favour and check this

https://github.com/audetto/AppleWin/blob/master/source/frontends/sa2/emulator.cpp#L197

Using top check the %CPU with master and then comment out lines 197-205.

The emulator will not run and just render a black screen. For me on a Pi3, it still goes at 77% CPU which means the SDL2 rendering is the bottleneck.

No idea how to improve it though, short of reducing FPS.

webspacecreations commented 4 years ago

With top, CPU utilization is at 97-98%. Looks like pretty close to a full core being utilized based on overall CPU utilization of around 33%. Don't see a latency / performance problem with the normal screen.

Switching to fullscreen, I now see lag (guessing top would show 100% utilization). Commenting out lines dropped overall utilization by 5-8%, but obviously isn't displaying anything.

The above referenced linapple repository successfully runs SDL on a separate core, but it's SDL 1.2. I'm attaching A2 emulator code I've been working on from James Hammons that pulls some of the SDL2 initialization routines from GSPlus as a possible point of reference. apple2.cpp.txt

The code may or may not be useful... the emulator itself only uses ~2/3 of a CPU core, so I have no idea whether it would span an additional core if needed.

An additional note... SDL seems to be very sensitive to how it's initialized. It's still early to say definitively, but I'm seeing that initializing with the wrong audio values can sometimes sap performance.

audetto commented 4 years ago

I am still puzzled by the results. In Windows AW runs at < 1% CPU, on linux (same machine) at about 20%.

What definitely takes a long time is the video update, which you can switch off to check.

If you disable this

https://github.com/audetto/AppleWin/blob/master/source/frontends/sa2/emulator.cpp#L199

if will still run the CPU, draw the screen (black) but will not update the Apple bitmap. This takes a very long time.

I suspect that other emulators have a less precise video generation and so run quicker. I originally wrote my own video update (non precise) but dropped as it was a lot of extra work.

You are definitely write about SDL and I still have to find a quick way to just paint a black screen at 60FPS on a Pi using SDL2.

audetto commented 4 years ago

try this I get a 10% CPU improvements

https://github.com/audetto/AppleWin/tree/pi

On a Pi3 with FakeKMS ~77% in 1x size, ~99% is 2x size. What do you get on Pi4?

audetto commented 4 years ago

I have tried https://github.com/robmcmullen/apple2

and it does run very much at the same speed as my SDL port. I tweaked the makefile for optimisation, but it runs at 98% of CPU with very bad audio.

My version runs at 78% or 99% as mentioned above and in 1x size audio is ok.

I think they all suffer the same problem: SDL screen drawing.

If you find any other emulator that runs quick on a Pi at 60FPS, I am happy to copy their video rendering.

webspacecreations commented 4 years ago

I've really been surprised by the performance differential between SheepShaver (PPC Mac emulator) versus several different Apple II emulators. I'm not sure whether the build I'm running is using SDL1.2 or SDL2, but resource utilization is only about 10% of CPU. My guess is the A2 emulators dedicate more cycles to ensure proper 6502 timing (whereas PPC Mac emulators mainly don't seem to care about CPU timing). It looks like SheepShaver can be compiled against either SDL1.2 or SDL2, which may be a solid test to identify whether SDL2 is the performance culprit.

For GSPlus (and several other A2 emulators), increasing audio sample buffer initialization up to 4096, (e.g. wanted.samples = 4096;) addressed both sound problems AND reduced CPU load for SDL1.2 AND SDL2. A key advantage of SDL2 is that you get "free" scaling and texture overlays. It's allowed me to make the screen resizeable (up to 1080p) without significantly affecting performance as well as simulating scanlines. HOWEVER, in order to do this, I had to disable where the emulator was relying on code routines to double the image size (i.e. switched emulator FROM running at 560x384 with software scaling BACK to 280x192 relying on SDL2 hardware scaling for arbitrary resolution). This emulator is running at 60FPS along with some video improvements at about 66% core utilization. It seems that increasing the video buffer from 280x192 to 560x384 is pushing too much data to the SDL video buffer.

I haven't taken a close look at how you're rendering video yet (i.e. a whole frame at a time or only refreshed regions). I don't know of any easy way to calculate a "buffer checksum," but for some emulators I think CPU utilization would go WAY down if it were possible to quickly calculate the video buffer checksum (something like an array_sum function) and only push video updates to SDL1.2/2 video buffer if the checksum changes.

I'm not really surprised that DirectX on Windows is rendering faster than SDL2 on Linux. Unless you're using OpenGL within SDL, it's probably not taking full advantage of the GPU.

webspacecreations commented 4 years ago

GSPlus idles at about 10% CPU utilization, which is much more inline w/ SheepShaver (PPC Mac emulator). It uses SDL2 and is probably the best reference (check the Issues section for tips on proper compilation). Linapple happily runs over 100% (top), meaning it appears to be truly multi-core, but it's SDL1.2. The emulator I've optimized by James Hammons runs at 66% of a CPU core. Happy to drop ARM compiled binaries if you want to check for yourself. Alternatively, happy to supply whatever source you'd like (if you can't find it on GitHub).

audetto commented 4 years ago

One thing at a time.

  1. We are only talking about Pi (3/4). On my PC it runs at 20% CPU maximum, no matter what the window size is.
  2. I've tried https://github.com/robmcmullen/apple2 and it behaves exactly like my code on a Pi3. If you believe it is faster, then please fork it, make all necessary changes and I will compare it.
  3. I thought that this was your code, I will try the cpp file directly
  4. SheepShaver, GSPlus: please post some github links I can try (again, please fork and modify if they need tweaks)
  5. without running it, vice has exactly the same drawing code: https://github.com/hpingel/vice-emu-mirror/blob/9842c45458aea54a05cbf081636cb013fa4d2de5/vice/src/arch/sdl/video_sdl2.c#L686
  6. I already asked the Pi forums about fast screen drawing and did not get any useful ideas: https://www.raspberrypi.org/forums/viewtopic.php?f=67&t=259450&p=1581132#p1581073
  7. I have an idea about splitting SDL to a separate thread which I will try today
webspacecreations commented 4 years ago

I sat down and took a look at your rendering code. I think you're spending a lot of time on the memcpy operation inside emulator.cpp (~40% of CPU time on a Pi4) to basically copy your video buffer for SDL (I believe instead of using SDL's own version). My own tests with memcpy in the past weren't very good for a block of data as large as what you're generating (I think you mentioned 560 x 384 in your SDL forum post.

Here's relevant (working) code that renders without a memcpy operation: SDL_LockTexture(sdlTexture, NULL, (void **)&scrBuffer, &scrPitch); ... operations that draw graphics / text to scrBuffer ... SDL_UnlockTexture(sdlTexture); SDL_RenderCopy(renderer, sdlTexture, NULL, NULL);

Here's a really simple example that renders a black screen: SDL_LockTexture(sdlTexture, NULL, (void **)&scrBuffer, &scrPitch); memset(scrBuffer, 0, VIRTUAL_SCREEN_WIDTH * VIRTUAL_SCREEN_HEIGHT * sizeof(uint32_t)); SDL_UnlockTexture(sdlTexture); SDL_RenderCopy(renderer, sdlTexture, NULL, NULL);

You've got a similar set of operations within your refreshTexture method, but are returning a rectangle and then pursuing subsequent rendering operations. I tried a relatively straightforward code swap, but am getting a black screen. Top, however, shows %CPU right at 60%, so if you can get rid of memcpy I believe it'll run with breathing room on a Pi 4.

If you do manage to engage the Broadcom GPU blob mentioned within the SDL forums (perhaps using OpenGL ES), combined with eliminating memcpy, I think you'll be able to get this running with desired performance characteristics on a Pi 3.

audetto commented 4 years ago

Yes,

this is something I changed here

https://github.com/audetto/AppleWin/tree/pi

It seems that SDL_UpdateTexture is faster even if the doc says no. So it requires no memcpy.

The problem with your code is that AW manages its own memory buffer for the video, so it would require deeper changes.

webspacecreations commented 4 years ago

I think your best bet is to really investigate how GSPlus is doing things: https://github.com/digarok/gsplus (v015 branch)

The only change I've made locally is documented here (to get sound working): https://github.com/digarok/gsplus/issues/106

When I start this up, the emulator indicates that OpenGL is being used. Running at 2.8Mhz top indicates about 10% CPU utilization. At 8Mhz it's up to 25%. At "unlimited" I'm hitting 100% CPU, but that's no surprise. Looks like the key is to get OpenGL involved. Based on what I'm seeing with GSPlus, that should fix the problems on Pi3 too.

audetto commented 4 years ago

Try this

https://github.com/audetto/AppleWin/tree/threads

I've moved AW CPU to a separate thread. CPU utilisation overall has gone up, but I can hear audio without glitches both at 2x size and full screen.

webspacecreations commented 4 years ago

Tested threads branch. I do see that CPU is exceeding 100% (multi-threading) and emulator is playable at full-screen with a minor increase in load. Sound, for me, is still significantly delayed (by at least a few seconds).

webspacecreations commented 4 years ago

As another point of reference, GSPort (https://github.com/david-schmidt/gsport/) is running at under 5% utilization (aoss ./gsportx for sound). Pretty sure it's not SDL2 though.

Could this be useful: https://github.com/digarok/gsplus/issues/58?

audetto commented 4 years ago

https://github.com/digarok/gsplus : this one is fast because it does not redraw at a constant speed. If the screen changes, CPU goes up. It is a good idea, but needs cooperation from AW. The code was more complicated that I was able to understand in a quick glance. I put some counters around texture update and render copy.

https://github.com/david-schmidt/gsport/ : this one uses xlib and maybe the same trick as gsplus. I dont want to try xlib. If you can make up a simple loop refreshing at 60Hz, and it is fast, then we can move from SDL to xlib (a very sad decision).

https://github.com/digarok/gsplus/issues/58 : it does not say much.

sicklittlemonkey commented 4 years ago

Have you run a profiler on it? Seems like gprof and Gperftools work on Pi.

I actually rewrote the core Windows BitBlt/StretchBlt in GSport back in 2015 to add full-screen integer scaling. I'm pretty sure some partial redraws were done, but I completely forget the redraws, sorry. I did test other GSport features I was developing on a Pi 2 and performance seemed fine.

audetto commented 4 years ago

Short of profiling SDL2 and the Pi kernel, I don't know what else to do.

In order to increase SNR, I've created a small projects that does exactly the same as this SDL2 port of AW so we can experiment and find the best configuration.

https://github.com/audetto/SDL_Demo

compiled in Release, on a Pi3 I get 66% CPU just to redraw the screen. This leaves very little for the emulator (which could be profiled on ARM anyway).

If anybody can do better than this, I'd be happy to know.

Doing a smart update of the screen requires invasive changes to AW video update routines, which will not happen anytime soon. Running at 30Hz is another possibility.

webspacecreations commented 4 years ago

This is a good approach. I took a few stabs at it and don't see a way to significantly reduce the CPU load. It DOES look like the problem relates to SDL and there are special ways to build SDL for Pi that interface directly with the Broadcom GPU (which then has to be statically linked). A discussion here that looks particularly relevant: https://github.com/grimfang4/sdl-gpu/issues/87

...and this: https://sourceforge.net/projects/raspberry-pi-cross-compilers/

audetto commented 3 years ago

In that discussion they were suggesting opengles and if one uses the opengles2 driver in SDL, CPU usage drops to 42% in the demo. Good.

I've added a few options to ./sa2 (see --help):

At the end of the run, it will print stats about timings:

Video refresh rate: 60 Hz, 16.67 ms
Global:  [. .], total =    7789.16 ms, mean =    7789.16 ms, std =       0.00 ms, n =      1
Events:  [0 M], total =      22.42 ms, mean =       0.05 ms, std =       0.17 ms, n =    471
Texture: [0 M], total =     113.32 ms, mean =       0.24 ms, std =       0.06 ms, n =    471
Screen:  [0 .], total =    7624.87 ms, mean =      16.19 ms, std =       1.66 ms, n =    471
CPU:     [1 M], total =     647.21 ms, mean =       1.34 ms, std =       0.48 ms, n =    484
Expected clock: 1020484.45 Hz, 7.74 s
Actual clock:   1014560.11 Hz, 7.79 s

The meaning of [0 M] is: 0/1 which thread and M if it is in the mutex protected area.

They do not include time spent in locking.

The clock shows expected vs actual speed (crucial for correct audio play).

webspacecreations commented 3 years ago

FYI, the changes drop CPU from ~50% (top) on Pi4 to 27%. Just about cuts resource usage in half.

sicklittlemonkey commented 3 years ago

Hi guys. If the lack of hardware acceleration is the problem, the only recourse will be to try updating frames (to SDL) less often. A couple of approaches come to mind.

  1. I think GSport does this: track performance, and if it's less than desired then increase a variable which controls frame skipping. So first you would skip every other frame (30 fps) etc. Of course, AppleWin will still be updating the bitmap, but you just don't push it to SDL.
  2. If it can be done efficiently, try to detect whether a new frame to be pushed is the same as the last frame pushed. If it is, don't push to SDL. Ideally this would be done inside AppleWin, but support for this kind of feature was removed as Tom described. The cell-based method AppleWin used was a common way to do this. Also line by line methods are used. Another simple way is to keep the video data for the last frame and compare byte by byte as the new frame is generated. This could be done externally to AppleWin. A quicker method (rather than comparing) would be to generate a checksum of the frame and just keep that. Compare it to the checksum for the new frame and skip it if it is the same. Obviously this risks skipping frames that are different but have the same checksum. The safest way is to keep the entire last frame and compare every byte to the new frame. Of course, if they differ early you can immediately start the SDL push. In most cases they won't differ, which means you check the entire buffer, but the upside is you avoid the SDL push, so it might be faster.

Cheers, Nick.

audetto commented 3 years ago

This is what implicitly happens with the 2 threads version (I think). If the main thread lags behind, it will skip to the next vsync without affecting too much the thread of the CPU (the vsync is not mutex protected). It will delay audio by 16ms in the worst case (just about missed the vsync), but I am running now with 200ms buffer with is 10 frames (probably excessive anyway).

What I was toying with is exactly what you said, trying to be smart about detecting duplicate frames. Unfortunately, doing it "outside" AppleWin bitmap presents some challenges

  1. check sum: one must scan the whole buffer always, and this seems to take forever (is there a super fast CRC for ARM? we do not need "cryptographically" secure hash, just a quick check). This is the code I tried and it does not perform well at all:

    template <class T>
    inline void hash_combine(std::size_t& seed, T const& v)
    {
    seed ^= std::hash<T>()(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);
    }
    
    size_t seed = 0;
    for (size_t i = 0; i < width * height * 4; ++i)
    {
    hash_combine(seed, *(data + i));
    }
  2. detect the updated "rectangle": in this case I need to scan it to the "first" (from both ends ideally) diff. Incredibly how slow memcpy can be. True that you only memcpy between first and last so in total one has to check or copy the entire buffer every time.

With the latest findings about opengles2, none of them are urgent, but I find the problem challenging and interesting, I will try to see what can be done.

webspacecreations commented 3 years ago

Here are a few compiler options compatible w/ cmake: set(CMAKE_CXX_FLAGS "-Wall -Wextra -fomit-frame-pointer -mcpu=cortex-a72 -mfloat-abi=hard -mfpu=crypto-neon-fp-armv8 -mneon-for-64bits")

the 'crypto' option for fpu might help with those hashes. It's designed for coin mining, so you may need to choose your hash algorithms carefully for it to kick in.

sicklittlemonkey commented 3 years ago
  1. check sum: one must scan the whole buffer always, and this seems to take forever (is there a super fast CRC for ARM? we do not need "cryptographically" secure hash, just a quick check). This is the code I tried and it does not perform well at all: seed ^= std::hash()(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2);

That is very complicated considering we are trying to fix a performance problem. ; - ) Using an 8-bit (size_t) hash is also a bad idea because you only have 256 values, so a high chance of collision.

I would just XOR or ADD the data to get the simplest checksum possible. Because we want the fastest result, and also the smallest chance of hash (checksum) collision, we should choose the longest data unit available. If you can use 64-bit then do that, otherwise just a running 32-bit XOR or ADD should be good enough.

It's probably worth looking at the generated assembly language and using that to try to optimize your C++ loop. For instance, it might be like 6502 assembly in that counting backwards to 0 is more efficient that counting from 0 up to a constant.

Cheers, Nick.

audetto commented 3 years ago

All you say is true except: size_t is 32 / 64 bits depending on architecture.

You are right as well in the data size, the loop should be over 32 / 64 bits at least, now it does byte by byte.

sicklittlemonkey commented 3 years ago

Of course, you're right. I've been working in C#, JavaScript, and PowerShell for weeks, so my C++ personality is paged out. ; - )

I thought I would have a quick look in VS 2017, and got a surprise.

Counting down, I was pleased that it unrolled the loop, and this version took 20 microseconds:

    int const length = 560 * 192;
    int data[length] = {};
00B4109C  mov         esi,4600h  
    auto t1 = std::chrono::high_resolution_clock::now();
00B410A1  adc         dword ptr [esp+10h],edx  
    size_t result = 0;
00B410A5  xor         edx,edx  
    for (int i = length - 1; i >= 0 ; i -= 1)
    {
        result ^= (data[i]);
00B410A7  mov         ecx,dword ptr [eax+10h]  
00B410AA  lea         eax,[eax-18h]  
00B410AD  xor         ecx,dword ptr [eax+24h]  
00B410B0  xor         ecx,dword ptr [eax+20h]  
00B410B3  xor         ecx,dword ptr [eax+14h]  
00B410B6  xor         ecx,dword ptr [eax+1Ch]  
00B410B9  xor         ecx,dword ptr [eax+18h]  
00B410BC  xor         edx,ecx  
00B410BE  sub         esi,1  
00B410C1  jne         main+0A7h (0B410A7h)  
00B410C3  mov         dword ptr [esp+24h],edx  
    }

If your compiler doesn't do this you could manually unroll the loop - which didn't change the time here of course:

    for (int i = length - 1; i >= 0 ; i -= 4)
    {
        result ^= (data[i]);
        result ^= (data[i - 1]);
        result ^= (data[i - 2]);
        result ^= (data[i - 3]);
    }

But then I tried counting up, and VS unleased the SIMD magic. This code ran in 11 microseconds:

    for (int i = 0; i < length; i += 1)
    {
        result ^= (data[i]);
00B810A1  movups      xmm0,xmmword ptr [esp+eax*4+28h]  
00B810A6  pxor        xmm1,xmm0  
00B810AA  movups      xmm0,xmmword ptr [esp+eax*4+38h]  
00B810AF  add         eax,8  
00B810B2  pxor        xmm2,xmm0  
00B810B6  cmp         eax,1A400h  
00B810BB  jl          main+0A1h (0B810A1h)  
    int const length = 560 * 192;
    int data[length] = {};
00B810BD  pxor        xmm1,xmm2  
00B810C1  movaps      xmm0,xmm1  
00B810C4  psrldq      xmm0,8  
00B810C9  pxor        xmm1,xmm0  
00B810CD  movups      xmm0,xmm1  
00B810D0  psrldq      xmm0,4  
00B810D5  pxor        xmm1,xmm0  
00B810D9  movd        dword ptr [esp+24h],xmm1  
    }

I would have to look those instructions up(!) but I know ARM has SIMD these days too.

Cheers, Nick.

audetto commented 3 years ago

Audio: made the emulator speed stick to wall clock. Removed some hacks around audio speed to leave AW adaptive algorithm to decide. Press F1 during emulation and it will print what it thinks the audio buffer size / queue is:

Channels: 1, buffer: 32768, SDL:  8804, queue: 0.47 s
Channels: 2, buffer: 45000, SDL: 65536, queue: 0.63 s

Channels 1 is Speaker, 2 is Mboard. The rest is the actual number of bytes to be played in the internal and SDL buffers, and queue the total lag in seconds. It is probably twice as bad as AW in Windows but should be stable at least.

webspacecreations commented 3 years ago

Awesome to see you making headway! I've been investigating ways to integrate a proper UI (Gtk3 or Qt5) with SDL2 in order to be able to provide the responsiveness along with a full-fledged interface. SDL2 provides the SDL_CreateWindowFrom(window_id); call that I'm able to attached to a window I create with Gtk3. It should, in theory, work with something like Qt's WId QWidget::winId() const (https://doc.qt.io/qt-5/qwidget.html#winId).

There's not a whole lot of documentation available for mixing SDL with UI libraries, but it looks like Bsnes (https://github.com/bsnes-emu/bsnes) is using SDL2 with Gtk2, Gtk3, Qt4, and Qt5 (apparently selectable at compile time). It might be possible to come full circle back the the Qt interface you built along with the optimized SDL2 rendering code.

audetto commented 3 years ago

Can you post an example of how you display a gtk dialog in sdl.

webspacecreations commented 3 years ago

Unfortunately I don't have Gtk code that goes further than attaching SDL2 to a Gtk3 window. For my own needs, I think ImGui (https://github.com/ocornut/imgui) is the more straightforward approach. It provides GUI elements with native SDL2 support. It's pretty lightweight w/ extensive demo code. Since it uses the native rendering engine (e.g. SDL2) it doesn't require the "window hack" (for Gtk/Qt) that probably isn't portable to Windows and appears reasonably cross-platform.

Since you're rewriting the display mechanism, I assume you have access to it, but the caveat with ImGui is that it ties into the main rendering loop. If you're interested in taking the ImGui route, I'll try to throw together sample dialog code.

audetto commented 3 years ago

My first reaction was: not another GUI toolkit! GTK or QT are stable, supported, and available everywhere.

But, but, but ....

their front page looks really impressive. It is still a one man effort though.

Are packages available in the main distros: Fedora, Ubuntu, Raspbian? That would definitely help.

audetto commented 3 years ago

I've learned what ImGui is and made a 2nd SDL version using it. No dialogs yet, but they seem easy to add.

It uses OpenGL2, but they say one should jump to OpenGL3, and I need to see how they both work on a Pi.

https://github.com/audetto/AppleWin/tree/imgui

one needs to pass -DIMGUI_PATH=/path/to/imgui to cmake.

Most of the SDL code can be reused.

webspacecreations commented 3 years ago

I think this is a positive development! Keeping the GUI elements managed by SDL reduces the dependencies and, I suspect, will provide greater longevity.

XGS (https://github.com/jmthompson/xgs) also uses ImGui + OpenGL3 and this has been tested on a Pi. In private exchanges with the author, he had this to say about rendering: "The VideoCore stuff isn't supported on the Pi 4 or 400; there is now a fully open source OpenGL driver for the Pi 4/400 as well as the Pi 3. But, I had to custom compile SDL to enable it (the KMSDRM driver). If you don't do this, then on the Pi 4/400 Mesa (and by extension, SDL) will fall back to the llvmpipe software rendering pipeline..and that would certainly spike your CPU because it really will be copying textures around manually. Why Raspberry Pi OS still ships without KMSDRM enabled in SDL baffles me." A new commit should be coming out soon that includes ImGui 1.80 and some further speed optimizations.

Looking forward to trying out the ImGui version of AppleWin and will post back if I encounter any problems.

audetto commented 3 years ago

Good to hear. It would be good to understand why the good version is not shipped with the Pi. Has he tried to open an issue or write to the forums?

webspacecreations commented 3 years ago

Apologies, haven't had a chance to compile the latest code on my build machine over the last few weeks. Attempting the following: cmake -DIMGUI_PATH=../imgui ..

Returns CMake Warning: Manually-specified variables were not used by the project: IMGUI_PATH

I git cloned the imgui library into the imgui path within the AppleWin project directory. I assume if there were a path issue or something that cmake would complain a little more loudly. The only other error message cmake displays is "Bad LIBRETRO_COMMON_PATH=NONE, skipping libretro code." The inclusion of imgui sounds like a promising development, but I'm not sure how to test it.

audetto commented 3 years ago

have you used the imgui branch? it is currently very much behind, but i have been busy integrating all the changes from AW that make x-platform a lot easier.

if the path is not found, you should get a warning https://github.com/audetto/AppleWin/blob/imgui/source/frontends/imgui/CMakeLists.txt#L6

webspacecreations commented 3 years ago

have you used the imgui branch? Well, that explains it :-/ No problems with the imgui compile :-)

The imgui version runs very well on the Pi4 (very responsive). Not sure why, but the Qt build seems to run slow, even though the CPU core isn't hitting 100%. Scaling on imgui didn't seem to noticeably affect performance.

audetto commented 3 years ago

Try this

https://github.com/audetto/AppleWin/tree/imgui3

It uses SDL2+OpenGLES2 which should be a better option. The other branch is OpenGL2, which according to imgui is not a good choice.

You need libgles-dev.

webspacecreations commented 3 years ago

I tried the imgui3 branch, but don't see a performance difference (on Pi4).

I believe that that the Imgui note on "OpenGL2 being a non-ideal choice" is related to a couple factors: 1.) I think the OpenGL2 code example in ImGui uses older style initialization syntax whereas the OpenGLES code uses newer syntax 2.) OpenGL2 has technically been deprecated by OpenGL3

I don't think there's a specific reason to opt for OpenGLES 2 over OpenGL 2 in ImGui if you use the newer style syntax to initialize your OpenGL2 rendering engine. By comparison, OpenGL ES support was more recently added, so the examples provide "modern" syntax.

Hope this provides some clarity. It's how I understand the situation with ImGui.

audetto commented 3 years ago

Let's use a separate Issue for ImGui related opinions: https://github.com/audetto/AppleWin/issues/22

audetto commented 2 years ago

Ability to use OpenGL2 or OpenGLES has been added https://github.com/audetto/AppleWin/blob/9ec45b1dab7d8a298967b5a46f240bbcc2fefb0c/source/frontends/sdl/CMakeLists.txt#L16