gamercade-io / gamercade_console

A Neo-Retro Fantasy Console. Make WASM-powered, networked multiplayer games.
https://gamercade.io
Apache License 2.0
165 stars 10 forks source link

Gauging Interest in: Parallel Compute Instruction Set, Vector Processing Unit, or GPU-lite interface. #105

Closed RobDavenport closed 2 months ago

RobDavenport commented 2 months ago

Hey all,

I was looking at some interesting things about 5th generation consoles like the n64 and PlayStation. They had special hardware built for parallel data processing, similar to a modern GPU.

Is there any interest in a series of special processing instructions? Currently, the wasm standard and runtime support 128bit SIMD, so 4x 32bit integers, or 4x f32 floats at once. Modern CPUs of today have 256bit and even 512bit operations, but that pales in comparison to the n64's gpu. We're looking at maybe 256bit over 4-8 cores (for a modest device), which is still 32 to 64 instructions "at once."

According to this source, the n64 had 32x 128bit wide Vector Processing Unit. That's 128 f32 operations at once (4xf32 * 32 vectors). The ps1 also had a special geometry transformation engine. I can't find exact specs on this, but I assume it's similar to a vertex shader in the modern GPU pipeline, as it worked together with the actual PS1 GPU to draw pixels to the screen.

I think that this can be accomplished a few different ways:

Method (1) Exposing a few raw, simple large vector operations which can be run across multiple threads in the host machine

The lowest level. It would be similar to writing actual manual simd code like this. Could potentially have to add some additional fields for pointers to read/write variables into memory. Not very easy to do, and with the added complexities of passing larger values around between wasm & host this could be really annoying. Plus all the individual calls (load values, add values, multiply values, and pushing them in-and-out of the memory locations) might just kill any real performance benefit. Perhaps some kind of "Execution Buffer" could be written to execute many instructions and reduce the module<-->host call count.

Pros:

Cons:

Method (2) Exposing some kind of "fork-join" API within the console, and being able to define some code to call in parallel over a dataset.

This keeps everything in WASM land, but it would also be super easy to cause a desync or worse if this isn't handled correctly. WASM doesn't support read-only memory, so it would be very easy to access memory outside of the expected region. Honestly I think this is the best choice but it passes so much responsibiliy onto the developer I'm not sure if we want to open up this risk for easy desyncing. But when done properly I could see this being a really fun and powerful thing to experiment with, and I expect an API similar to CUDA could exist here. But it's easy to imagine how unsafe this could be...

Pros:

Cons:

Method (3) Exposing compute shaders, or a simplified GPU pipeline accessable and configurable from the game itself.

This is kinda the "easy way out." Would require devs to either write their own shaders, which could be typical vertex, fragment, or compute shaders. Alternatively, gamercade could provide a subset of shaders/shaders-as-functions to be called over a dataset. This method is still quite open-ended, as I'm personally not too experienced with modern graphics APIs like wgpu or vulkan.

Pros:

Cons:

Method (4) Something else...

Those three methods listed above aren't completely researched and I'm sure there are ways to implement some of them to solve the larger issues. For example, there is a wasm2spirv crate which compiles WASM code into shader code, which could greatly benefit method 3. I'm personally a big fan of method 2, but the lack of safety and prevention of desyncing makes me hesitant.

RobDavenport commented 2 months ago

Closing this in favor of upcoming 3d discussion post.