PCSX2 / pcsx2

PCSX2 - The Playstation 2 Emulator
https://pcsx2.net
GNU General Public License v3.0
11.49k stars 1.6k forks source link

A 64-bit version of dynarec: what are the advantages? #2055

Closed hlide closed 3 years ago

hlide commented 7 years ago

Hey,

I'm Hlide and I'm fond of console emulation, mostly for the their technical parts.

I had contributed several parts (Allegrex w/ VFPU and HLE modules) in JPCSP and worked on several versions of PSP emulator (first being PCSP and last bein PSPE4ALL in PSCE4ALL) until PPSSPP makes a show.

The last version was done from the scratch where I mostly work on the dynarec and made some discoveries.

Here is a link to PSP CompilerPerf Benchmarks I did on the same machine. It shows my interpreter-like dynarec (each guest instruction is translated into a host basic block) outperforms the other dynarecs which normally translate a guest basic block into a host basic block. Weird? yes, that was my first feeling.

Here is another link to the details how the dynarec works but be aware it is a little old and there have been probably some changes since then. And short image here: bne

My first dynarec was able to emit both 32-bit and 64-bit instructions according the binary you wanted. But quickly I decided to drop the 32-bit version because it isn't worth it - 64-bit version offers some goodies that I cannot have with a 32-bit version.

The winner key for a 64-bit dynarec is not the register allocation (very marginal gain), not even reducing the memory access (access to guest register slots) but the block-chaining you can do and being free of the ABI constraints. The more your CPU core is spending in your generated basic blocks, the best it is. The second link show how I link two basic blocks through a huge ICACHE so one basic block is just (indirectly) jumping to another basic block without having spent some cycles into C/C++ code (with constraining ABI). I also added generation of specific fast wrapper basic blocks to call C/C++ functions if needed.

With the interpreter-like mode, I also added some optional features which allow:

When launching emulator with debug enabled, another process is launched with a GUI display based on Qt which acts as a Windows debugger using Windows User Mode Debugging API and attaches the emulator process so I can debug a PSP program (step in/out/over, etc.).

Here somes images about the debugger: i392 pimgpsh_fullsize_distr i393 pimgpsh_fullsize_distr

I can display the generated x86 basic block associated to the guest instruction highlighted by the mouse.

I also added some global instrumentation:

Now, I was wondering if some adapted version of this dynarec I used for PSP emulation may be applied for the two MIPS (R3K and R5K) and maybe for coprocessors as well.

What do you think?

ghost commented 7 years ago

It can be a disavantage (use more ram) but offer a small perf increase.For x86 architecture, the improvement is less impressive than the ARM architecture. But in my opinion , it is more important to fix compatibility issue and remove patch around for some games as well as improve the acuracy of GSDX. But let's a dev member explain more :).

hlide commented 7 years ago

Hum, well my title is not that good. I was speaking about a dynarec coded into 64-bit. I may also post another topic about the graphics part as I have some ideas (I also worked on a software and hardware renderer of PSP GE) but I prefer to wait until I have a good knowledge about GS.

gregory38 commented 7 years ago

It isn't very clear for me. What did you do ? So you implement each opcode into a host block, is it correct ? So one host block == 1 emulated instruction + a jump.

The winner key for a 64-bit dynarec is not the register allocation (very marginal gain), not even reducing the memory access (access to guest register slots) but the block-chaining you can do and being free of the ABI constraints. The more your CPU core is spending in your generated basic blocks, the best it is. The second link show how I link two basic blocks through a huge ICACHE so one basic block is just (indirectly) jumping to another basic block without having spent some cycles into C/C++ code (with constraining ABI). I also added generation of specific fast wrapper basic blocks to call C/C++ functions if needed.

Why chaining is limited to 64 bits ? And how do you handle event/interrupt of the CPU, you still need to leave the dynarec.

hlide commented 7 years ago

My dynarec has several modes:

1) interpreter-like mode: each MIPS instruction (+ delay slot instruction if any) is translated into one X86-64 basic block (not a function!) with a dynamic indirect native jump in epilog through the huge icache table (can be up to a 4GB virtual memory segment). And so you are -almost- correct except for when a delay slot instruction must also be translated in the same basic block of the branching instruction. This mode acts as a very fast interpreter - hence the name because it allows a lot of tricks I enumerated in my first post. That's the one I used in the benchmarks comparison link showing it outperforms most "standard" dynarecs (Java and JS uses their JIT engine and PPSSPP has its dynarec done in the usual way). Again, the result only concerns x86 archictecture. I'm pretty sure I would have far less decent result with another architecture. EDIT: PPSSPP benchmark is not comparable so it cannot be taken into account here. I knew there was something strange about it.

2) full-dynarec mode: each basic block of a MIPS program is translated into one X86-64 basic block (still not a function!) with direct native jump between host basic blocks or a dynamic indirect jump in the same way as the other mode if jumping trough a MIPS register. In this case, the block chaining is even more direct - it jumps to the native address of the next x86 basic block with several MIPS instructions inside a basic block and not only one.

In both case, there are optional optimization regarding registers usage, reducing memory accesses through rewriting some instructions and reordering the registers order for less x86-64 instructions.

And how do you handle event/interrupt of the CPU, you still need to leave the dynarec

I agree with you on the fact that emulator is a low-level emulation type (vs. high level emulation type found in PSP emulation) and so you need to address external events. But external events are exceptions so they shouldn't needed to be handled per instruction so I usually use to test them when reaching a syscall because PSP threads always tend to sit in them very often. But in the case of PS2, I think you can easily check for an external event to handle in the epilog of a host basic block. Usually a simple read of word (that's something I already done in assembly code with my previous PSP emulator before adding a dynarec and which works pretty fine and very fast). If flag says new events occur, we can jump to the event handling block which calls a C++ function to examine the events to handle.

In my case, I have a "main" function which grows in size with dynamical creations of x86 basic blocks - some of them will call a C++ function (syscall). The only time I leave is when I leave the emulator. Because I'm executing a block of chain trough jumps, I don't need the stack and don't need to save restore local registers. To handle events, I just need to call a c++ function (it happens to be the last of a basic block where everything is clean with ABI constraints) if an event occurs and which returns the next address to jump (EPC) so it then jumps to it.

I usually handle events through a FIFO queue with callbacks in two steps: check a global event flag to determine if something exceptional must be done. If so, call a function to handle that FIFO queue to call callbacks for new events.

Ok, there are the special cases of exception like bad memory access to handle. I agree there could be more brainstorming to do here, since it looks PS2 may have a more complex memory handling (virtual memory through TLB?)

Why chaining is limited to 64 bits ?

You can in 32-bit or not, depending upon how the guest memory mapping works. I was able to do with PSP emulation because there is no need to handle the kernel access so the memory access is limited to the lower part and PSP has a limited RAM range and there is no direct access to I/O register mapping (everything done through syscalls). But there are severe constraints. 32-bit memory is very limited, you may need several full 4GB memory mapping. In my example I have a huge DCACHE (data guest memory) and ICACHE (instruction guest memory which is an array of x86-64 basic block address to jump into - there are special basic blocks like one to call a recompiler function which create the basic block of a guest instruction block and some others for special usage).

In the case of PS2, I can simply reserve a 4GB virtual DCACHE and a 4GB virtual ICACHE and to make them not overlapping if there is a need to emulate ALL the 32-bit space of PS2.

Moreover, R5900 is a 64-bit CPU (well, I'm not totally sure as most docs I read are not clear about it) so issuing x86-32 instructions becomes tricky to handle due to a very limited register set whereas x86-64 is totally compatible (natural zero-extending of 32-bit operations the same way MIPS64 does). With my PSP emulator, I had both 32-bit and 64-bit dynarecs but I decided to stop the 32-bit dynarec as it is was severely restrictive and maintaining both was a pain because they may diverge a lot strategically when you want the best emulation.

hlide commented 7 years ago

So you implement each opcode into a host block, is it correct ? So one host block == 1 emulated instruction + a jump.

For instruction with no delay slot instruction, you're correct. When a delay slot is needed, it is also translated in the same basic block (so two instructions + jump).

The first figure shows an old version in full-dynarec mode (not in interpreted-like mode) but you can see what happens when there is a branching instruction:

In a interpreter-like mode, the basic block would only contain the yellow part and the blue part with a indirect dynamical jump as an epilog that you cannot see in this figure since it uses a direct conditional jump here instead (full-dynarec mode).

hlide commented 7 years ago

Regarding memory access, I use the virtual memory mapping of Windows and also use memory mirroring (several address ranges to map the same content as it is common with MIPS). I guess PS2 can execute kernel code and so access I/O address ranges unlike PSP where a game developer cannot run a program in kernel mode and cannot have a direct access to I/O registers.

The first thing occurring to our mind: use page exception. Hum, well, it has a disadvantage. it can be very slow to emulate it through a page exception because Windows exceptions handled by user land are not known to be fast especially when reading/writing I/O registers in a loop.

Supposedly you track constant addresses (I don't track constant propagation as it is not a big deal for PSP emulation), there are some way to issue instructions to handle that specific hardware. Well it's probably simpler and enough to handle it through a read/write generic I/O register function.

The same effect may also be in used without tracking the constant addresses and used a memory exception to patch the offensive instruction with a direct jump to a generated special block which will call the generic I/O register function and puts the result in the right host register then jumps back after the offensive instruction (I know it is a little more complex than that, but i'll pass the details for now).

But what if the constant addresses are passed to a MIPS function to handle similar hardware? for instance for several instances of timers or dma, and so on. Or worse, if the address may be of different categories (normal RAM or I/O registers) when calling a MIPS function? like a memset or memcpy call ? The block patched by the memory exception leads to a code block which will first check if the address is the same category. If not, fallback to a more generic one.

If the normal memory access won't suffer so much from an indirect jump, we could use a huge cache like DCACHE and ICACHE which will contain an address to jump on a code block which accesses the memory by category:

Or better, just recompile the code once with both the way to access. Just alter the lowest bits in the address to call the normal operation or the I/O operation. For instance, my code block are aligned to 16-byte. The first 8 bytes may contain the normal memory operation (simple mov and ret) and at 8 bytes later, the more complexe code to call a I/O generic function.

JohnLoveJoy commented 7 years ago

Are you saying your code exceeds the performance of PPSSPP? That's amazing.

I'm sure @Hrydgard would like to know.

hlide commented 7 years ago

As I already answered, this dynarec is specific to x86 architecture and may not run faster for the other architectures like ARM. Since PPSSPP is a multi-archictecture and has a kind of common JIT infrastructure, it may be more difficult to adapt it to PPSSPP for this reason. In fact, I tried to see if it could happen (as I wanted to concentrate my effort upon the dynarec improvement and not on the rest like HLE firmware) but several reasons dissuaded me to do so.

gregory38 commented 7 years ago
Moreover, R5900 is a 64-bit CPU (well, I'm not totally sure as most docs I read are not clear about it) so issuing x86-32 instructions becomes tricky to handle due to a very limited register set whereas x86-64 is totally compatible (natural zero-extending of 32-bit operations the same way MIPS64 does).

EE supports some 64 bits instructions. Some 128 bits instructions (SSE equivalent). But most of the code uses 32 bits instructions. We don't have any issue with limited register of 32 bits as we don't do anymore any register allocation (we used to but it was removed). So far we manually extends the MSB.

I did a branch on the past to replace memory access with a nearly direct mapping of the 4GB process (virtual mem + mirroring). I did a trick to compress it to ~128/256MB (i.e. remove the useless bit in the middle). I didn't notice any speed increase (I didn't bench it a lot). Most registers are directly mapped (not 100% sure) to use space.

We have some page protection to detect self modifying code. And maybe some basic block chaining but I think we often need to check event/interrupt.

Anyway, a 64 bits recompiler will be nice. Pure 64 bits operation + register allocation + direct memory access (i.e. no TLB) might provide a small speedup even without an advance recompiler. However I don't think emulation will be faster due to floating emulation on the VU recompiler.

Main 64 bits issues is that we need to port the EE but also VU and IOP dynarec. And the JIT on GSdx must be done too (I partially did the AVX part, it misses mipmapping support)

hlide commented 7 years ago

So there are:

1) EE

2) IOP

So in the end, we may consider the need of 3 dynarecs (R5.9K + VU0, R3K, VU1).

For each dynarec a thread to allow them to operate in parallel:

And the JIT on GSdx must be done too

It looks as if VU1 is heavily coupled with GS, is that what you called the JIT on GSdx? or is it another unrelated processor inside GS to handle the commands (mostly like the GE processor in PSP)?

hlide commented 7 years ago

However I don't think emulation will be faster due to floating emulation on the VU recompiler

can you elaborate here?

hlide commented 7 years ago

What makes the EE different from other MIPS processors is that it has Multimedia Instructions (MMI), these are specialized instructions that do 128-bit data operations. The EE’s General Purpose Registers are 128-bit wide. The MMIs are quite handy when you work with data that gets sent of to to either the VUs or the GS. However I have not seen much usage of these instructions in PS2DEV.

Ok MMI is not something in regards to VU0.

Edit: Is MMI completely unrelated to the macro mode of VU0?

You can use the VU0 in two modes, macro or micro. Macro means that you inline the VU0 instructions along with the rest of your EE code and in micro you create a microprogram which you then upload the the VU0 and execute.

I see. Does that macro mode imply using COP2 space set for R5900 to execute special instructions through VU0 pipeline? are macro mode and micro mode exclusive so it could be safe to emulate VU0 instructions in the R5900 core (by allowing to access VU0 registers and state in R5900 core)?

gregory38 commented 7 years ago

The floating number of PS2 aren't compatible with IEEE. There is no NaN, infinite. Instead you have bigger number. Rounding behavior is different. So emulating a single VU instruction requires tons of code. Besides, (I'm not sure it impacts us) we don't emulate pipeline stall. Therefore we could execute more VU instructions than the PS2. Someone posted a basic loop that manage to process 10x more polygon than the PS2. I guess it depends how game are designed.

We already have a dedicated thread for the VU1 (called MTVU). So far IOP run in the EE thread. Perf wise, I'm not sure we need a thread for IOP. I hope that one day, the ROM code will be HLEd. Beside even if EE/IOP are asynchronous, I'm afraid some (bad) game might depends on the relative speed of CPU.

MMI is a kind of SSE for the EE. VUs are additional vector instructions (kind of GPU shader). EE and VU0 are tightly coupled. It is often use to process video (FMV)

We have another dynarec to handle the VIF unpack.

It looks as if VU1 is heavily coupled with GS, is that what you called the JIT on GSdx? or is it another unrelated processor inside GS to handle the commands (mostly like the GE processor in PSP)?

Yes there are coupled, overhead isn't small. But I mean the SW rasterizer of GSdx. The JIT allow to optimize all branches from the shader processing.

IMHO, having a decent (common?), easy to maintain, (AVX only ?), recompiler for EE/IOP should be enough. VU is another story.

hlide commented 7 years ago

Yes, I just happened to find out what the differences were.

Well I have a UP² Ultra with N4200 (4 cores running up tp 2.2Ghz) and HD Graphics 505. I tried God of War II with PCSX2 and had to choose a preset 5 and activate MTVU to have something close to 90-95%. With no changes I was around 30-50%. I realized PCSX2 was only a 32-bit program and I was wondering why.

I'm inclined to make something to have a PS2 emulation running smoothly on my UP² or even PCs having a x5-z8550 (less powerful than N4200) and so was looking here if it could be possible to reuse PCSX2 in 64-bit and concentrate my effort on the dynarec for all cores - including writing an accurate GS "software renderer" based on a library able to spit generated code using AVX2, AVX512, OpenGL, OpenCL or Cuda compiler. By the way, is your software renderer accurate enough so I can take it as a base? Just for your information, gid15 and I wrote a GE software renderer which generates SSE4/AVX2 codes as fragment shaders with the goal of being the most accurate. I hope being able to do something equivalent but with more generator providers (SSE4/AVX2/AVX512/OpenGL/OpenCL/Cuda).

hlide commented 7 years ago

Oh by the way, @gregory38 are you "francophone"?

hlide commented 7 years ago

We already have a dedicated thread for the VU1 (called MTVU).

So basically, you have EE (R5900 + VU0) + IOP (R3000A) in the same thread, haven't you? And a thread for GS with VU1 as well if MTVU not enabled? And an optional thread for VU1 when MTVU is enabled?

gregory38 commented 7 years ago

I doubt that you will get a factor 2 even with a shinny recompiler. GoW is quite heavy on the VU1. So even if you split the EE thread into 100 threads, you will still be limited by the VU1 thread.

Without MTVU, VU1 is inside the EE thread (to ensure synchronous EE/VU communication)

Yes, the software renderer is accurate. In 32 bits, we have SSE2/4, AVX1/2. And an OpenCL renderer Window only (but it should be close to work on linux too).

Yes, I'm French. Is my English that bad :)

hlide commented 7 years ago

Nah that's because of your name in the email ^^. I'm also French.

mirh commented 7 years ago

even PCs having a x5-z8550

Hardly, considering they are SSE4.2, OGL 4.3 and OCL 1.2 at most.

If on the other hand you meant "SoC in that power range", with intel nuking their SoC department I think we'll have to wait for amd to die shrink their embedded solutions for anything better in that ballpark.

Then, maybe your SoC is a bit more powerful, and also supports Vulkan but nothing else. Actually wikipedia tells me even the atom supports Vulkan (and ogl 4.5) on linux. Stupid datasheets.

CPU-instructions-wise there's nothing new/different though. The only improvement I could see is maybe a Vulkan renderer then. Not even that. EDIT: as always, if you want something quite interesting to wonder, I'd recommend checking HSA EDIT2: (another read on why it's more than the nvidia thing) EDIT3: this paper is enchanting EDIT4: also EDIT5: the definitive one. But TL;DR we either wait for oneAPI or HIP

hlide commented 7 years ago

For memory, Linux OpenGL wasn't even on par with Windows OpenGL whatever it is nVidia or AMD, so I won't be surprised it is the same with Vulkan. Moreover, for a better OpenGL experience with intel HD Graphics, you should use ClearLinux (that is, the Intel's linux).

For what I could see with God of War it was almost fullspeed with MTVU with DirectX so it looks like GSdx is very CPU-dependent - unless it sits too long until EE gives it orders and because one EE + VU1 thread takes too much time.

mirh commented 7 years ago

Your memory is pretty bad then. Nvidia since.. I dunno, ever, has always been praised for his basically 1:1 linux driver. Any performance difference if any is only due to bad porting. On the other hand, for as much as both intel and amd suck on Windows, they are prolly even better than nvidia on linux. And that comparison is between perfect GL on linux vs perfect Vulkan on linux, so that's it. Since gregory took care to optimize to a great extent GL, I'm not expecting that on linux to perform worse than dx on Windows (trivia: afaik D3D has nobody actively maintaining it) You should give a check.

Also, clearlinux slightly superior performance is more about shiny cpu tweaking than actual GL, for the records. ... In other news, I know zero-copy is a quite abused term these days, but albeit with a huge amount of caveats, it seemingly was already supported in OCL 1.2 (or hell, else I don't know why intel itself would talk about it) In a world where you had infinite time, and our CL renderer wasn't just quite basic, maybe they'd be able to squeeze some additional performance.

hlide commented 7 years ago

I don't keep track of them, especially about Linux versions. And I said "wasn't" not "isn't". I guess with SteamOS, it makes sense for nVidia. In fact, I don't care, that's not the topic here. If you say GSgl is better on Windows, I'll try it again (I didn't try it with MTVU on).

mirh commented 7 years ago

GL itself is better. GL + windows + intel ... Unsure.

EDIT: if you are CPU-limited though, it's quite difficult to predict what will happen. If intel gl dispatcher sucks (looks at amd), you may even get worse performance.

But if even their dx driver is not really the bestest (which I mean, they do igps after all, not like anybody is pressing on them for benchmarks) the benefit of linux could be huge. Because I mean.. When I previously used the words "perfect" it was more literal than anything. ... Or maybe you are just totally VU1 limited and the graphics API won't change anything, who knows 🙃🙃

fagoatse commented 7 years ago

Play! already has 64 bit recompilers for x86 and ARM(https://github.com/jpd002/Play--CodeGen). Wouldn't it make more sense to have a generic recompiler akin to https://github.com/MerryMage/dynarmic so that both projects(and future ones) could benefit?

hlide commented 7 years ago

Just to say, I tried to benchmark my PSPE4ALL and PPSSPP on a N4200 and found a curious result about PPSSPP (same performance as my main PC) so I decided to tweak one parameter because it seemed to imply you can set the virtual CPU frequency from the default 222 Mhz to 1000Mhz which is max. And yes it was that, PPSSPP seems to insert off-cpu execution to slow down the number of MIPS instruction executed so it looks close to the real number of MIPS instructions executed on a true PSP. Since I cannot override that 1000Mhz maximum (or to disable this synchronization - at least through this parameter), the result of PPSSPP benchmark is not comparable indeed and PPSSPP would certainly run faster that the other emulators (full-dynarec) without that frequency synchronization (does it burn a lot of CPU instruction!? or does it sleep some time!? I don't think it sleeps because PC timer is very coarse).

@JhonLoveJoy So, no, I don't think PSPE4ALL outperforms PPSSPP if there is a way to free the frequency synchronization, PPSSPP with 1000Mhz frequency does a little faster than PSPE4ALL on UP² due to its full-dynarec nature. On my main PC, it doesn't but I cannot tell because I would need to set PPSSPP virtual frequency to higher value.

gregory38 commented 7 years ago

Well the big speed gain of the recompiler is the fact that it is a recompiler. Implementation details is a matter of percentage.

In our case, the slow part is the floating point emulation. So 64 bits integer operation is nice but not so important. Extra register could help but I'm afraid the read/write saving is probably quite light versus the quantity of floating operation.

Biggest 64 bits gain will be easier install and people stop complaining about it ;) And it is a good opportunity to right a nice EE/iop recompilers.

hrydgard commented 6 years ago

@hlide If you want to experiment you can trivially remove the 1000MHz cap with a simple source change, it's not like PPSSPP is hard to build. Indeed we do sleep to limit execution, works fine to regulate the speed (timeBeginPeriod helps).

hlide commented 6 years ago

@hrydgard timeBeginPeriod is 1ms at minimal period (and 0.5ms if using the kernel32 version). So it is a coarse timer. I guess you are executing a big batch of instructions then sleep a "long" time (over 1 ms limit). Probably an event signal with a timeout so an external event can wake up the Allegrex processor at once if necessary, right? As for the simple source change, I'll try if the building is not involving any external dependencies to retrieve manually.

hlide commented 6 years ago

@hrydgard ok, while PPSSPP visuals are great when running, reading its source is totally giving me a headache (especially with functions like "NativeUpdate" that VS2017 is unable to follow properly). A "simple source change" is not that simple when you don't know which key to search. Do you happen to know which source line I should tweak?

hrydgard commented 6 years ago

Just change the UI to increase the range.

GameSettingsScreen.cpp

PopupSliderChoice *lockedMhz = systemSettings->Add(new PopupSliderChoice(&g_Config.iLockedCPUSpeed, 0, 1000, sy->T("Change CPU Clock", "Change CPU Clock (unstable)"), screenManager(), sy->T("MHz, 0:default")));

Change 1000 to whatever you want.

Or it might even work to just hack the setting in the .ini (memstick/PSP/SYSTEM/PPSSPP.ini) but that will revert to the correct range probably if you don't do the source change.

hlide commented 6 years ago

Thanks, I'm checking that.

hlide commented 6 years ago

@hrydgard Ok, I made some tests with my Haswell i7-4770K@3.7Ghz and I cannot set above 2147Mhz for a desired PSP frequency, otherwise psp program is not running and PPSSPP is not answering any longer. While increasing the PSP frequency, the overall performance index is converging to a stable 1490 ms. PSPE4ALL in interpreter-like mode is around 1500-1600 ms in comparison. I thought it would be worse.

hrydgard commented 6 years ago

Hm, I guess that hasn't been tested too well, as that's a preposterous frequency to run a PSP at.

Oh yeah, if you wanna see it top out without being limited by 60fps btw, you can just hold the Tab key to unthrottle it. It will still try to run the requested number of cycles per frame though.

hlide commented 6 years ago

Yes, I wanted to make "comparable" benchmarks between PSP emulators I know so I can have a rough idea how they perform in CPU emulation side with no throttling. If I try it while holding TAB, it still converges to stable 1490 ms whatever the PSP frequency setting (including an insane 10000 Mhz).