commanderx16 / x16-emulator

Emulator for the Commander X16 8-bit computer
384 stars 61 forks source link

Linux CPU 100% usage #51

Open mlongval opened 4 years ago

mlongval commented 4 years ago

Great work gentlemen. I was testing the latest r30 build on Linux. Works great but it grabs one cpu core and runs it at 100%, even when the emulator is idle. Otherwise lots of fun.

FYI: I’m running on Ubuntu 19.04 ThinkPad T520 i7 quad with 16GB ram.

indigodarkwolf commented 4 years ago

Not seeing this on Windows with an i7-9700K @ 3.6GHz. The emulator seems to favor a single core, and I can eventually get it up to 100% usage if I enable a bunch of local bells and whistles and debugging display stuff that I modified into a local build, but if I just run the official r30 release I waffle between 50%-60% of one core. I don't foresee much opportunity to parallelize anything, but maybe some profiling would reveal some inefficiencies to address.

ghost commented 4 years ago

I'm seeing high (but not quite 100%) usage of a single core on Windows as soon as I launch the application. i7-4710 @ 2.50 GHz. Release r30.

mobluse commented 4 years ago

I have the same problem in Raspbian Buster on Raspberry Pi 4 B (4 cores) with 4 GB RAM (of which 256 MB is GPU memory). I see it in htop, but the core that is 100% switches during run ─ it can be 1, 2, 3, or 4. x16emu is just idle. The C64 emulator x64 from VICE 3.3 runs at about 88% in total, but maximum 92% on one core sometimes. The ZX Spectrum emulator fuse-sdl (installed using apt) runs at about 8% in total.

Maybe one could optimize screen updating so that only the changed part of the screen is updated, but maybe this is already done.

indigodarkwolf commented 4 years ago

@mobluse Maybe. That seems seems worth investigating.

I did take a brief crack and reducing the amount of math that was needlessly recomputed in get_pixel() (because honestly there's a fair bit) and also got some wins, but invalidated rectangles would probably be a great macro-level win. The tricky part is probably knowing once it's safe to mark a given rectangle as "valid", since the emulator is, by design, allowing mid-frame, even mid-scanline updates to Vera data. So any write to the Vera at any point is potentially invalidating a portion of the screen until the renderer has had a chance to cycle around.

mist64 commented 4 years ago

First of all, there is no such thing as "idle" for the X16 emulator. Unless the screen is completely turned off, it always does the same amount of work.

And I agree with @indigodarkwolf, it's the VERA emulation that is expensive, and it's far from optimizes. It's a quite straightforward implementation, and it does lots of redundant work.

It should work more like real hardware, in that it counts up addresses as opposed to calculating the current address based on X and Y.

blackknight36 commented 4 years ago

I'm noticing about 20% CPU use on Windows as well. taskman

indigodarkwolf commented 4 years ago

@mist64 Question for the sake of accurate emulation:

Suppose someone, whether by accident or intentionally probing the limits of the Vera chip, intentionally tries to access the largest numerically possible address in VRAM. I believe this is 0x00020fe0 for sprites (set a base address of 0x00fff for a 64x64x8bpp sprite) and 0x0004fffc for layers (set a tile base of 0xffff, and subsequently write 0xff at whatever map address, for 16x16x8bpp tiles). In Vera 0.8 these addresses are unmapped, so no problem. But if the mappings were to change later and place something at those addresses, would it be possible the Vera would mistakenly read from the data mapped to those addresses? Or would the Vera clamp the value? Or else, how would the Vera react to that?

Asking because the emulator calls "video_space_read(map_addr)" to find the graphics data in VRAM, when it would be faster to just directly go to "video_ram[map_addr & (ADDR_VRAM_END-1)]", but I don't want to sacrifice accuracy in a quest for speed.

tomxp411 commented 4 years ago

I'm going to suggest an optimization for this.

Either use an opcode that is known to have no effect on the 65C02 (such as $42, aka WDM on the 65816), or treat a write to ROM as an emulator commands. For example, if the CPU writes to $FF00, which is ROM on the hardware, this could tell the emulator to release the thread until the next timer cycle. (which generates an IRQ on the virtual CPU and runs the keyboard, joystick, and I/O cycle.)

I don't see the need for many commands like this, but there are a few that would be useful, such as the OP's request (which I've nicknamed "Slow down the CPU fan so it stops blowing like a jet engine.") and maybe some debugger commands.

indigodarkwolf commented 4 years ago

@tomxp411 I can say very conclusively that the major pain point in the emulator's runtime is it's drawing code, by far and away, accounting for some 97.5% of its runtime. (Or at least, it was before my first pass on optimization, but it's only less painful now, it's still an awfully sore hotspot.)

The reason for this is very simple: It is calculating 307,200 individual color values for the 640x480 display. To calculate these values, it checks 2 layers and 16 sprites, so some 5,529,600 color calculations. Every frame, trying to maintain 60fps. That's pretty rough.

My first pass on optimizing this was to reduce a bunch of math that was being repeated for every single color calculation. That appears to have been a big win. I'm mulling over ideas to try and reduce the number of color calculations we can get away with, next, or at least using them more intelligently since it makes no sense to check for 16 sprites on a row where no sprites exist, and ditto columns.

I actually had something that looked promising, using arrays of bitmasks to try and quickly pare down the list of candidate sprites and avoid any potential large loops, though it had degenerate cases when a large number of sprites was overlapped or otherwise co-located on top of each other.

I'm still thinking about other options, but would be more than happy if other folks wanted to dive into the drawing code and optimize.

tomxp411 commented 4 years ago

Good point. I was thinking of the situations where the screen is not being updated. If all that is happening is the cursor blinking, there’s no need to draw the screen at all, for up to 1/3 of a second.

Is there currently a “dirty” flag in video memory on the emulator? That’s another way to bypass redrawing redundant frames.

indigodarkwolf commented 4 years ago

There is no "dirty" flag right now. Feel free to take a stab at one. <3 I haven't looked at validating/invalidating screen regions or pixels yet, because the way the emulator works is to literally calculate "There have been X clock cycles since the last time I ran, therefore I need to draw VGA steps Y0 through Y1." So the first thing is does, then, is "for each VGA step to draw, start by calculating its effective X and Y coordinate" (because X and Y could be scaled by the composer).

So I could see one managing a "dirty" flag with a uint32_t saying "this is how many VGA steps are dirty", which gets reset to an appropriate value and is then decrement with each VGA step until it hits 0, after which we can skip whole frames as long as nobody touches the Vera.

indigodarkwolf commented 4 years ago

Hrm... in fact, it may be possible to extend that idea into a binary tree, or something, of "this step begins a dirty section, plus this many steps", and start each video update by searching the tree for the next dirty section that'll be hit. The hard part (?) will be merging tree sections when something like multiple sprite movements invalidate overlapping portions of the screen.

tomxp411 commented 4 years ago

Probably simpler just to do something like “has there been any write in the last cursor cycle” and just switch to low FPS mode.

itoshkov commented 4 years ago

@tomxp411 I can say very conclusively that the major pain point in the emulator's runtime is it's drawing code, by far and away, accounting for some 97.5% of its runtime. (Or at least, it was before my first pass on optimization, but it's only less painful now, it's still an awfully sore hotspot.)

Would offloading this to the GPU be an option? Vulkan is pretty well supported for Intel, NVidia and AMD chips on Windows and Linux, and is also available on macOS through MoltenVK.

vk2gpu commented 4 years ago

Been having a bit of a mess about, managed to get it down from a around 800ms for 60 frames to 600ms for 60 frames by:

Can be found here, not putting a pull request up yet as it may not be the best/preferred approach (dirty lines/frame skipping may be more practical and simpler for maintenance), but here it is if anyone is interested: https://github.com/neilogd/x16-emulator/commit/495ada26373ae88ba56e948d1ecbb52a857758de

vk2gpu commented 4 years ago

@itoshkov Vulkan may add a bit much complexity. I reckon we can get the performance improved significantly without resorting to GPU just yet. Maybe the color lookup and NTSC overscan bit, after building up a full frame, then it can it least run on the GPU whilst the next frame is being setup.

vk2gpu commented 4 years ago

Done a little more work building on those previous changes, but improving the performance of render_layer_line by implementing a video_space_read_range to skip checks if the whole read is withing vram. I'll need to check what the maximum amount that can be read in would be, and could likely be made better by perhaps just returning a pointer, rather than copying memory, although I seem to be getting some small wins: https://github.com/neilogd/x16-emulator/blob/video_optimization/video.c

Just timing video_step over 300 frames (158025 calls):

Baseline:

Plus optimized render_line:

Plus optimized render_layer_line:

It could be worth putting some basic timing in (I've got some setup locally, but it isn't particularly great) to more easily track this kind of stuff, and make it easier for others to run these tests (and verify things aren't broken). Especially since different programs may behave slightly differently too. For example, in bitmap mode the render_layer_line changes are unlikely to have much of an impact.

EDIT: render_layer_line optimization appears to be broken. Calculation of base address to read from was incorrect, so when scrolling/wrapping it will read the wrong data in.

Follow up edit: Pushed some fixes for that last problem. Still needs some more testing, but I may look at throwing something to help with testing this kind of stuff.

indigodarkwolf commented 4 years ago

In the sprite budget calculations in render_sprite_line(), is that specific to the emulator, or does that approximate real hardware? I notice that my parallax demo hits the sprite budget limit with the way it uses 11 64x64 sprites to fake having a "3rd layer" for a parallax effect.

mist64 commented 4 years ago

In the sprite budget calculations in render_sprite_line(), is that specific to the emulator, or does that approximate real hardware?

It approximates real hardware.

vk2gpu commented 4 years ago

Cleaned up my optimization branch, pull request is here: https://github.com/commanderx16/x16-emulator/pull/185

I gave it a test with @indigodarkwolf's x16-racer as well since that utilizes the VERA a bit more.

klsgrtx commented 4 years ago

As of R33, I still see high CPU utilization on a linux box with the emulator doing nothing.

Here are my system specs: System: Kernel: 4.15.0-65-generic x86_64 bits: 64 compiler: gcc v: 7.4.0 Desktop: Cinnamon 4.2.4 wm: muffin dm: LightDM Distro: Linux Mint 19.2 Tina base: Ubuntu 18.04 bionic Machine: Type: Desktop System: Dell product: Precision WorkStation T3500 v: N/A Chassis: type: 7 Mobo: Dell model: 09KPNV v: A00 serial: BIOS: Dell v: A17 date: 05/28/2013 CPU: Topology: Quad Core model: Intel Xeon W3530 bits: 64 type: MCP arch: Nehalem rev: 5 L2 cache: 8192 KiB flags: lm nx pae sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx bogomips: 22402 Speed: 1612 MHz min/max: N/A Core speeds (MHz): 1: 1612 2: 2857 3: 2062 4: 2855

Here are some screenshots I've taken showing before, during and after runs of the emulator:

CPU_Spike_LinuxMint19_minimized CPU_Spike_LinuxMint19 CPU_Spike_LinuxMint19_stopping

mist64 commented 4 years ago

@klsgrtx: Some clarification: There is no such thing as the emulator "doing nothing". It always does exactly the same: Emulate instructions and draw pixels. Computers from the 80s did not have the concept of CPU "load" and a CPU being "idle". The machine always did the same. And if it did "nothing", it was just sitting in a tight loop.

klsgrtx commented 4 years ago

@mist64 understood. Some of the other posters on this thread, however, were referring to applications running and rendering being done. I just wanted to clarify that my emulator was in one of two states - active with no programs runnin or minimized to the taskbar with no programs running.

mobluse commented 4 years ago

Maybe VERA could run on its own core.

vk2gpu commented 4 years ago

It could, and I think that is probably a nice idea after some of the low hanging fruit for single threaded optimization is done. That way the Web version gains benefits at least too. I've had a few thoughts on how to do it well when i put my optimizations together, even moving entirely to GPU as someone else has suggested, but that would impact maintainability (maybe later type thing tho)

oziphantom commented 4 years ago

The emulator needs to look up colours and sort and arrange sprites exactly how VERA does it. The issue with putting VERA on another thread is you loose sync, they need to remain lockstepped, to which you win some by having it on another thread but the amount you will lose by it syncing every 65C02 Cycle is going to probably be a net loss. Then when you have to sync the Audio cores it will get worse. Since there is no idle time the exe doesn't ever yield, or does it work out a whole frames of the simulation, sleep until the next 60th of a second, flip then go again?

tomxp411 commented 4 years ago

The issue with putting VERA on another thread is you loose sync, they need to remain lockstepped, to which you win some by having it on another thread but the amount you will lose by it syncing every 65C02 Cycle is going to probably be a net loss.

I'm just thinking out loud here, so bear with me....

On the real system, VERA and the CPU run on their own memory, and VERA's drives the system clock and shares 8 bytes of I/O space and (I think) an interrupt. The CPU runs at 8MHz, and the system clock is (I believe) generated by VERA now.

However, I would probably not do that in the emulator. Instead, I would tick a clock every 125ns and use that to drive the CPU thread. The virtual VERA only needs to update 4 I/O bytes per clock tick and generate a 60Hz interrupt . Otherwise, it can free run and sleep during the refresh interval.

The only thing I can compare this to right now is VICE, which will peg one core at 11 MHz when locked to a single core on my little NUC (3.5GHz i7). X16 emu will use about 80% of one core on this same computer.

I think parallelization will help, but mostly in terms of CPU temps and allowing the system to sleep. I'd also suggest adding a low power mode that sleeps the process in the keyboard loop. That helped CPU utilization a ton when I was writing my 65816 emulator.

indigodarkwolf commented 4 years ago

The emulator does idle, if it finished drawing a frame in less than 1/60th of the second. It idles for the remainder of that 60th of a second. If the frame takes longer than 1/60th of a second, there's no idling, and the emulator starts showing the % in the window's title bar indicating how slow you're running relative to 60fps.

My understanding is that the Linux kernel timer resolution is 1ms, iOS is 17ms (using CADisplayLink), and Windows is 15ms. To have a thread idle at better resolution than that, the emulator would need to execute a tight-loop, which means the CPU will be pegged on that thread.

The platform seems difficult to parallelize without sacrificing accuracy, but I think everyone's open to suggestions. We also need to make sure that we don't touch any SDL2 draw surfaces from other threads while the main thread is in SDL code.

As for the low-power mode, what about demos that animate without needing keyboard input?

tomxp411 commented 4 years ago

Okay, I get the thread synchronization problem. It's not enough to sync up every video frame. You need to sync on every scan line. The CPU runs at 8MHz, or roughly 1.6M instructions per second. The vertical scan rate of VERA is 30KHz. So we have room for rougly 53 instructions per scan line.

Some people are already writing sprite multiplexers and interrupt driven routines. So the scan line position registers in VERA need to update over the evolution of a frame. I guess the question is... can you synchronize a thread 30,000 times a second without adding more overhead than you save? I would expect we'd have to try it to see what happens.

As for the low-power mode, what about demos that animate without needing keyboard input?

It would only affect the "wait for a key and blink the cursor" routine. This could be handled by implementing the WAI instruction in the emulator and then adding a WAI to the BASIC input editor. The CPU basically just waits until the next interrupt.

Since you would only add that to the keyboard wait loop in the full screen editor, it would not affect demos or games that are actively processing data and not simply sitting, waiting for keyboard input.

For those programs that DO sit and wait for a keypress (ie: a text editor), this could be added to a "blink the cursor and wait for a key" KERNAL call. That would allow text editors and the like to take advantage of this automatically.

For those programs that do not want the WAI to be executed, I can think of a few ways to handle that. The simple way is set a flag in low memory and skip the WAI if the flag is set (or clear).

oziphantom commented 4 years ago

Your maths is a little off. Given the C64 has 65 clocks per line, 1Mhz = 1,000,000 clocks per second, so we have about 8,000,0000 per second, or about x8 the C64, the VGA is double the scan frequency, so 65*8/2 = 260 clocks per VGA line. VGA has 800 pixels of timing per line, 640 visible then you have the HBlank Sweep + sync porches etc which means for every CPU clock the VERA needs to draw 3 pixels. Its not sync per line, it needs to be sync per 3 pixels. If I change a sprite X in the middle of it drawing then I expect the sprite to Sheer, if I modify a palette entry on line 140 cycle 37 then I would expect any tiles after that entry to show the updated palette entry. So each part of the system needs to walk hand in hand.

Waiting for a key press does not do nothing.. it will still update internal timers, the VIA timers are still running and they count clocks, they must be accurate. The Jiffy timer still needs to update, the timer to count down to the flash needs to update, the emulated 1541 drives CPU And VIAs needs to update their code, their timer, move the head position if need be. the Audio chips LSFR for white noise needs to update so the sound is correct.

There is a Kernal call for wait for key input, however it doesn't run on the IRQ, so you need to wait for the IRQ to happen, there is nothing to stop you from playing music, running an NMI or loading in some data of the drive while it happens.

A Raspberry Pi or other EEEpc or thermal constrained PC can't run VICE X64sc full spec ( there is a very custom scaled back bare metal VICE that does run on a Pi, but its not the proper version) and this machines is 8x faster with 16 more sprites to sort, 2 planes to draw and a lot more Audio channels and more complex Audio systems..

indigodarkwolf commented 4 years ago

The VERA calculates an entire scanline and then sequences it out to the display, so the VERA code technically only needs to update every scanline, and this is more-or-less what the emulator does (it does a trivial bit of math to determine if we've reached the next scanline, and if so it draw the line). This was an enormous performance win, too, because it meant only having to process 128 sprites on a per-line basis, not a per-pixel one as it had previously been calculating new pixels between every single 6502 instruction.

oziphantom commented 4 years ago

so it pulls all data into a line buffer during the HBlank? doing all tile looks up, palette lookups etc in the HBlank? And the update from the port is disabled during this period?

indigodarkwolf commented 4 years ago

I don't know all the specifics, only that when the current implementation went in, which draws whole lines at a time, I asked whether this was accurate to the VERA hardware's behavior and was told "yes".

Same for how the emulator limits sprite draw costs.

tomxp411 commented 4 years ago

Your maths is a little off. Given the C64 has 65 clocks per line, 1Mhz = 1,000,000 clocks per second, so we have about 8,000,0000 per second, or about x8 the C64, the VGA is double the scan frequency, so 65*8/2 = 260 clocks per VGA line. VGA has 800 pixels of timing per line, 640 visible then you have the HBlank Sweep + sync porches etc which means for every CPU clock the VERA needs to draw 3 pixels. Its not sync per line, it needs to be sync per 3 pixels. If I change a sprite X in the middle of it drawing then I expect the sprite to Sheer, if I modify a palette entry on line 140 cycle 37 then I would expect any tiles after that entry to show the updated palette entry. So each part of the system needs to walk hand in hand.

My math is fine. 8MHz with ~5 cycles per operation (on average) is 1.6MIPS. 1.6MIPS / 30,000 scan lines per second is 53 instructions per line.

Waiting for a key press does not do nothing.. it will still update internal timers, the VIA timers are still running and they count clocks, they must be accurate. The Jiffy timer still needs to update, the timer to count down to the flash needs to update, the emulated 1541 drives CPU And VIAs needs to update their code, their timer, move the head position if need be. the Audio chips LSFR for white noise needs to update so the sound is correct.

WAI just stops the CPU. It doesn't stop VIA timers, interrupts, or the actual system clock. Nor does it need to.

There is a Kernal call for wait for key input, however it doesn't run on the IRQ, so you need to wait for the IRQ to happen,

That doesn't actually matter. The actual keyboard read routine should be handled by an interrupt. If it's not... that's going to cause some real problems. There's a reason the AT and PS/2 keyboard has its own interrupt on a PC.

there is nothing to stop you from playing music, running an NMI or loading in some data of the drive while it happens.

WAI waits for an interrupt. All of those activities are interrupt driven, at least if they are happening at the BASIC editor screen or while the system otherwise expects keyboard input.

And the update from the port is disabled during this period?

Why would you think that? VERA uses dual-port RAM. You can write to it from one port while reading from the other port.

oziphantom commented 4 years ago

Well OK if by instructions you mean running full opcodes then on average you get 53 instructions per line. Where the full range is 33~130 instructions per line. It is an odd way to talk about it so I falsely assumed you were talking about clocks but with your correction and the addition of 'on average' I agree.

WAI does just halt the CPU, but making it sleep the process would not work as the process ( where in the process is defined as the emulator) has other things it needs to update while the CPU halts is my point.

WAI does wait for a interrupt to which it carries on execution of the code from after the WAI instruction and does not fire the normal interrupt handlers. So if the NMI fires it will carry on, not jump to and handle the NMI interrupt routine. So putting in a WAI could potentially cause the system to fail, as it won't be doing what it was told to, i.e run interrupt handler and then it will go into the keyboard scanning code when it actually hasn't got an interrupt from the keyboard system. If it is an NMI and the standard routine doesn't Ack said NMI it will form an infinite loop.

The standard IRQ does call 'key' which will poll the keyboard for new events and update the keyboard buffers. However the KERNAL call, one makes, CHRIN(which halts and waits for a key press) should not be called from an Interrupt or you will deadlock, as it will spin waiting for new data to be entered into the Keyboard buffer, and since you are in an interrupt the standard IRQ be not able to responded and hence it never gets new data. Adding a WAI to this routine would also be bad, as when the interrupt does fire, it will then look for a new byte in the buffer, not find it as the IRQ handler has not been called and you will remain in deadlock. Making it call the IRQ handler manually would probably work in most cases, but it would fail if the person had an NMI set up, as it would stop the NMI from happening. To which for the purposes of saving CPU time in the emulator, would be academic anyway as you still need to make sure everything else happened you just don't run the CPU doing anything, but that is not going to save you much in terms of host cost anyway as during the WAI it will exit very quickly from its update loop.

I don't think the VERA would be blocking internal updates during the phase, but I was checking to be sure. To which the idea that we can run the CPU free and draw the line in one go to help speed up the emulator is somewhat moot. As while the VERA is doing its "build the line" step, you have to keep the VERA and CPU lock-stepped, Having the code do lock-stepped then switch to just running the CPU is probably more effort that it is worth, as the VERA routine would just return basically immediately if it has nothing to do.. We don't get away from the need for lock-stepping, so its not a big save or safe way to parallelize the CPU and VERA, and with an early return in the VERA code vs an If to detect if we need to run it would probably cost the same in terms of CPU performance and greatly simplify the code.