evmar / retrowin32

windows emulator
https://evmar.github.io/retrowin32/
Apache License 2.0
581 stars 24 forks source link

Support native execution on x86-64 Linux #11

Open evmar opened 5 months ago

evmar commented 5 months ago

Many Linux users will have native x86-64 hardware, and we could use their CPU directly in the same way Rosetta worked, by using the processor's 32-bit compatibility mode.

In other words, the idea here is take the existing x86-64 Mac support and port it to Linux.

I tinkered a bit in this area here: https://github.com/evmar/retrowin32/compare/linux?expand=1

Some notes:

For most of this I think the answer will be roughly "dig through Wine to see how they did it".

evmar commented 5 months ago

We have some functions defined in assembly.

To link them on Mac they are named e.g. _tramp64 but you write it without the underscore in the Rust extern declaration, this is some cdecl name mangling(?).

To link them on Linux you don't, so I'd somehow have to make the code handle both.

spiffyguy commented 5 months ago

We have some functions defined in assembly.

To link them on Mac they are named e.g. _tramp64 but you write it without the underscore in the Rust extern declaration, this is some cdecl name mangling(?).

To link them on Linux you don't, so I'd somehow have to make the code handle both.

Hi @evmar, big fan of what you are creating here. I am not proficient in Rust, Assembly or the win32 api... but I like to think I can read and somewhat get what's going on...

That being said..... I saw this assembly code from a completely different project, petool (MIT Licensed), I thought might apply to the assembly you are writing here.

Notice how in this assembly file incbin.S for the petool, there are both the underscore and the non-underscore definitions for functions, then include the assembly. I believe this was purposely written to allow this code to easily compile on macOS, Linux or Windows (via MinGW).

https://github.com/FunkyFr3sh/petool/blob/master/src/incbin.S

I hope seeing this helps, if not... sorry for the noise.

mateli commented 5 months ago

On an x64 processor the compatibility mode for 32 bit applications provides significant less performance than running 64 bit applications. For example there are no access to the extra registers and x64 SIMD extensions. Furthermore switching between 32 bit compatibility mode and 64 bit mode is expensive.

To make all code faster is premature optimization. For 95% of application code it's not going to make any difference. Finding the 5% where application's spend more of it's time and replacing that with optimized native code would provide more performance.

Take for example your favorite file zipping application. If we run it until it is supposed to run it's implementation of DEFLATE and instead use a zlib-ng library compiled for x64 we will make use of all the x64 optimizations including using all the extra SIMD instructions. The DEFLATE part of running this zipping application will now probably be faster than when the application is running in 32 bit compatibility mode. Other parts of the application is unlikely to matter much. With some probability it's faster than the actual 64 bit version of the same application as we are using what currently is the best DEFLATE implementation.

What I suggest instead is to create an emulator that can do this without doing expensive CPU mode switching. Furthermore enhancing the emulator with good profiling tools so that we can figure out where it spends time and create optimized native code that is faster. Sometimes this is as simple as writing the same code end compiling it. In other slow algorithms can be replaced by faster algorithms. Taking zlib as an example everything with an older and less efficient implementation of zlib would benefit from being redirected to the latest zlib-ng. There is no scenario where DEFLATE will be as fast running in 32 bit mode than when running zlib-ng in x64 mode taking full advantage of SIMD and extra registers.

While an emulator is slower when doing just running an application instructions by instruction it may very well be the fastest way to run applications if the above approach is used. It would even make applications faster if we use it to run 32 bit applications on 32 bit x86 mode. Being able to insert better algorithms and other performance optimizations into an application would have that effect. Even just switching old binary code with something compiled on a newer and better compiler may have the effect of creating great leaps in performance.

There are of course a significant amount of work that needs to be done for this approach. For the DEFLATE use case we would have to investigate all applications that uses it and make sure that we get it to utilize the zlib-ng. But once this is done zipping with be as fast as it can on any hardware. And if there are improvements in zlib-ng or another library replaces it as the best DEFLATE implementation, we can easily upgrade or switch to make all those applications faster. Although on many applications using DEFLATE is a very minor part of what they do and making it faster will not change the overall performance of the application, which is why profiling is important.

hardBSDk commented 5 months ago

The 32-bit compatibility mode was removed in the upcoming x86-S architecture.

mateli commented 5 months ago

Meaning 32 bit will require emulation or binary translation.

evmar commented 3 months ago

@cadmic got retrowin32 running on x86-64 Mac, which means we have at least all the CPU-initialization bits in place such that actual x86 hardware believes us. So in theory all that's left for x86-64 Linux is the memory layout and LDT initialization code, probably not too bad?

https://github.com/evmar/retrowin32/commit/02fa9202d6040a6a3a0894a8daa7051a0b39e278

mateli commented 3 months ago

I am way more interested in being able to run x86-64 applications on ARM. Such as modern Mac and Raspberry Pi.