JHRobotics / softgpu

SW and HW accelerated GPU driver for Windows 9x Virtual Machines
MIT License
707 stars 20 forks source link

DOSbox-X support #22

Open Torinde opened 1 year ago

Torinde commented 1 year ago

DOSbox-X supports Win9x, so it would greatly benefit from SoftGPU 3D accelerated driver.

What will be needed to achieve that?

DOSbox Pure is anothe DOSbox fork that officially supports Win9x. 86Box "Virtual PC" machine and related emulators would benefit from support as well. 86Box discussion dosemu2 discussion

JHRobotics commented 1 year ago

Hello, main problem with requirement of SSE was problematic CRT (C runtime) in MSYS MinGW distribution, which was compiled with SSE instructions, so they were inserted to code regardless on “-march” flag. Version 17.x I was able to compile with older complier but newer no (C++14 is required).

However I found working MinGW distribution (from here: https://github.com/niXman/mingw-builds-binaries/releases/tag/13.1.0-rt_v11-rev1) without this behaviour – so in last release the “Windows 95” binaries are without SSE instructions in runtime, but LLVMpipe is still able to use SSE instructions, if they are present.

I appreciate the effort to get some additional graphics acceleration options into DOSBox, but I'm not entirely sure if VMWare SVGA is the way to go. The problem with this is that quite a lot of calculations are done in the guest system, and in the newer version of the protocol (GPU gen. 10 or SVGA-III), for example, all the surfaces (textures, framebuffer, and other work buffers) and graphics structures are stored normally in RAM (that are on the real HW located in VRAM). Of course, this has its own logic - if you run multiple virtual machines, you must try to allocate resources as efficiently as possible, and if all things are in memory in the guest system, you have no hidden overhead, and you can also inflate the allocated RAM (memory ballooning) according to how much you really need and you don't have dead textures somewhere in the host memory.

But in the emulator, on the contrary, you have to try to compute as few things as possible in it. Because even with dynamic recompiler, emulation is very, very slow compared to native code. Therefore, it seems to me that a much better way is using shared memory for surfaces (textures, framebuffer, ... = the guest needs to write and read its data) and with addition some FIFO queue, where guest API calls will be pushed.

3D acceleration which is done in this way, for example here: https://github.com/kjliew/qemu-3dfx and basically also in DOSBox when GLIDE is emulated (the entire HW is not emulated, but individual calls are then passed to the library in the host system). It only has the problem, that a driver has to be written for it, which is not entirely easy in the case of Windows 9x, but it still seems easier to me than either optimizing a complex driver to be faster in the emulator or emulating a real (and mostly poorly documented) graphic HW.

Speaking of DOSBox, it should also be mentioned, that some S3 Trio and especially S3 Virge are capable of 3D acceleration - although the DDI version is only 5.0 (DirectX 7 maximum) and the real HW was so slow that it was nicknamed “decelerator”, in theory can work in the emulator better. In addition, there is a driver including source code (part of DDK98).

Sorry that my answer has a lot of letters, but even in virtual machine the performance of the driver depends a lot on the performance of the CPU itself (e.g. Intel i7 4th gen. + GTX 1650 vs. AMD Ryzen 5, 3th generation, or Intel i5 11th gen. + integrated GPU wins the newer CPU by quite a lot, regardless of the graphics card used). And I have a feeling that my driver will be really very slow in the emulator.

But I don’t want to only be saying "won't work", I'll try to follow the development of the implementation and, if I'm able, I'll try to contribute some code.

Torinde commented 1 year ago

Thank you! I appreciate the long answer!

So, you're saying that your implementation relies on high CPU performance in the guest, so you worry it may be too slow in emulators (vs hypervisors, joncampbell123/dosbox-x/issues/1089). Still worth to try.

Also, another implementation can be written, which shifts the computation to the host and which guest driver uses minimal resources.

@joncampbell123 - FYI, maybe you will be interested to work with @JHRobotics on that?

joncampbell123 commented 1 year ago

Maybe on x86/x86-64 platforms that support SSE, the emulator could execute the same or similar instruction for the guest to speed things up? Sort of like all the FPU code inherited from SVN.

Torinde commented 1 year ago

Latest readme says

If you decide to use 98/Me build, your (virtual) CPU needs support these instructions MMX, SSE, SSE2, SSE3, SSSE3, CX16, SAHF and FXSR (Intel Core2).

So, basically any real SSSE3 CPU will cover the requirements, except 32-bit Atom, which lack CMPXCHG16B (but maybe that instruction can be emulated additionally). For emulators lacking one of the above - there is the 95 build.

JHRobotics commented 1 year ago

Instruction requirements are from GCC manual: https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html

In theory build can be optimized for any CPU you want.

I'm only afraid code complexity of Mesa3D, because vmwsgl32.dll without debug symbols still have 15 MB. But if Mesa in software mode works in QEMU without acceleration (KVM or WHPX), it'll work in DOSbox :-)

Torinde commented 1 year ago

Thanks, so it's CMPXCHG16B.

EDIT: GCC link actually says "This option enables GCC to generate CMPXCHG16B instructions in 64-bit code" for the -mcx16 option, so I assume the 32-bit won't use any of those (e.g. that switche is ignored for SoftGPU)?

Same "in 64-bit code" is mentioned for the -msahf option, but SAHF/LAHF are always supported in 32-bit CPUs/modes anyway.

Torinde commented 1 year ago

the emulator could execute the same or similar instruction for the guest to speed things up?

Isn't that similar to implementing a hypervisor core? Or you plan some (easier to implement?) middle ground where only parts of the CPU execution are transferred to the host, but it's not a full hypervisor core?

Will that use AVX (or more) from the host (llvmlipe SoftGPU benefits from AVX)? Or it can only go up to the maximum emulated by DOSbox-X (SSE)?

JHRobotics commented 1 year ago

Isn't that similar to implementing a hypervisor core? Or you plan some (easier to implement?) middle ground where only parts of the CPU execution are transferred to the host, but it's not a full hypervisor core?

It's relative simpler - for example if emulator finds byte sequence 0F 58 CA (ADDPS xmm1, xmm2), it'll execute ADDPS xmm1, xmm2 on host. x87 FPU is implement on DOSBox same way (or it was, is long time ago, since I examine DOSBox core, and this is only x86 compatible). Dynamic DOSBox core is bit simitar to dynamic recompiler from older hypervisors (more than 10 years ago, for virtualization without HW assistance, but unsupported now). But there is huge difference for purpose - hypervisor is designed to run 32 or 64 bit RING-3 code without minimal performance penalty and rest it's emulated (very precise, but very slow) - every execution of I/O or privileged instruction cost very large performance penalty. But in DOSBox is important speed in real/virtual x86 mode and if some DOS game runs in PM-32[^1] is still using IO or BIOS interrupts to communicate with HW. DOSBox also need emulate precise timing of instructions but this is absolutely unimportant for hypervisors.

Will that use AVX (or more) from the host (llvmlipe SoftGPU benefits from AVX)? Or it can only go up to the maximum emulated by DOSbox-X (SSE)?

AVX is useful only if rendering is pure software (with llvmlipe) but it is slow even on real hardware, on VM is about /2 slower. I done some tests on QEMU with disabled CPU accelerator and it's really slow[^2]. Pure software 3D on guest isn't the way for now.

Anyway, I'm currently trying to passthrough 3D commands to host's GPU (like qemu-3dfx) in DOSBox-X with modified S3 ViRGE driver. If I'll be successful, I'll share the results (and code) :-)

[^1]: 32-bit protected mode [^2]: "slow" isn't exactly the right word, I think a new word needs to be invented for it, because only "slow" doesn't describe the "slowness" of this operation.

Torinde commented 1 year ago

Great to hear that!

Regarding DDI/DirectX levels

:) new word for slow... virgespeed?