dbalsom / martypc

An IBM PC/XT emulator written in Rust.
Other
581 stars 19 forks source link

Timing accuracy measurement or benchmark #117

Closed s0rent closed 4 months ago

s0rent commented 5 months ago

If it does not exist already (I have not found any), I think it would be relevant to know the timing accuracy of the emulator. Either as a measurement delivered in real time inside the emulator (graph or similar) or a benchmark test that can be performed on demand. As a side note, if someone has any measurements they made themselves, I would be interested in knowing the results.

Why: Cycle accuracy is awesome, but for tight assembly programming with IO (COM ports, audio), it is also relevant to know the delta time (in absolute time) between emulated CPU cycles, and how much jitter there is in percentage and how big the spikes are.

In any case, keep up the good work :)

dbalsom commented 5 months ago

To respond to your request I have to explain a little bit about about how emulators are typically designed.

It is natural to consider at first some sort of emulator that runs in realtime - perhaps we calculate the time delta between cycles given our CPU's frequency, which at 4.77Mhz is ~200ns, so we could just execute a cycle, then sleep() for 200ns, then do another cycle. The nice thing about that is that the emulator could process inputs and outputs in realtime at the same rate as real hardware.

The problem is that there is no real portable way to access timing facilities that precise - and you would indeed have issues with jitter and delays as functions that do some sort of sleep() are usually implemented by yielding the thread and resuming the thread is not precise enough and could be delayed for any number of reasons by the OS. We could spinloop instead, but that means our emulator will take 100% of a core.

There are additional issues - if a particular cycle triggers some processing by an IO device - and that emulation takes longer than 200ns - we now have a big spike in jitter. We'd have to somehow ensure that every operation in the emulator can completely comfortably in under 200ns, and also adjust our sleep/spinlock for that time taken. There are lots of potential areas where an emulator may have to do a relatively big batch of work all at once - if it is rendering a video device by scanline, fast-forwarding a scheduled timer, or committing a sector worth of data to the disk image, etc.

The problem gets worse the faster of a chip you are emulating and your time slices get smaller.

So most emulators do not work like that. Instead, a much larger time slice is calculated. Often this is calculated by say, the time taken to draw an entire frame of graphics, or for the audio subsystem to produce N samples. MartyPC currently uses frame-based timeslices. With a 60Hz virtual video card, that means I execute 16.7ms worth of CPU cycles (79,545 cycles at 60Hz) in one go, as fast as possible. In realtime, the delta between successive CPU cycles ideally approaches 0.

image

The performance view window can give you some of this information. The 'cycle target' is how many cycles per time-slice/frame we will emulate. 79545 cycles, for a 4.77Mhz CPU. The 'total frame time' is how much time we spent emulating the time slice. You could think of 'total frame time':16.7ms as our 'time compression' ratio. The graph at the bottom is showing you how well your system is staying under the maximum time-slice budget. If we exceed 16.7ms then we start to emulate the system slower than the real hardware should be running. You will occasionally see spikes in this graph, but it doesn't really mean that emulation is affected. As long as the pikes are under 16.7ms when it is time to present the frame and play the frame's worth of audio you will have the same experience at 16.6ms as you would at 0.1ms. Just don't go over, or the host audio device will want to read more samples than emulator produced, and the sound will get crackly.

As far as IO latency, if you are talking to an internally emulated device in MartyPC, there is 0 virtual latency, at least none that isn't intentional. IO devices are ticked based off synchronization with the system timer of 14Mhz, with the number of ticks derived from the number of CPU cycles executed times the CPU's clock divisor. Therefore your code running within the emulator will never experience or have to deal with any latency or jitter with internal devices. The hardware you interact with inside the emulator should, ideally, operate exactly as it would on a real machine, from a timing perspective. This is not just speculation - I have been able to align logic analyzer dumps produced by the emulator to ones taken from real hardware and they compare favorably.

Now, there is a certain amount of latency in receiving input or presenting our virtual time-slice to the user. The audio samples played back to the user will start, necessarily with some 16.7ms of latency, but the entire audio system has a larger buffer than that anyway, as did most of the actual audio devices of the day. Keyboard input events are queued and delivered to the emulator in batches per-frame with no particular timestamps, we could perhaps improve that somewhat by converting the real input timestamp to a virtual one, but I haven't noticed all the keystrokes from a single frame arriving at once being enough of a problem. Most keystrokes arriving that fast are due to rollover, not intentional timing. It is difficult to mash a single key at 60Hz. Mouse inputs are also handled in a similar way - an original serial Microsoft mouse has a 50Hz update rate - I drive the mouse at 60Hz just so I have one mouse update per frame as it makes it easier.

The only real tricky bit is interfacing with the serial passthrough, since we are converting from virtual time to realtime and there are precise data rates we have to honor. This is just done through queues and scheduling. The virtual serial port will queue a frame's worth of outputs which are played back to the real serial port at the configured baud rate - when we receive inputs on the real serial port we queue them to the virtual serial port and the virtual serial port can only read them out at the virtual baud rate.

If we had to communicate with something that needed a response back much faster than 16.7 ms, this method wouldn't work, but we could always break our frame-based timeslice into smaller steps.

I hope this clears things up a bit - if I misunderstood your question and went on a huge rant, I apologize. If there's a specific performance scenario with some code you are concerned about, I'd be happy to try to specifically address it.

dbalsom commented 5 months ago

@s0rent Did that answer your question?