Metal Testing Guide and Performance Improvements

This issue is an overview of the macOS client testing methodology and remaining/known performance issues.

Testing

It's important on macOS to cover all GPU driver vendors for both OpenGL and Metal. Plasma on macOS supports any Metal 1.0 GPU. Supported vendors with Metal 1.0 drivers are:

AMD
Apple
Intel
Nvidia

Additionally - Metal is only supported on 2012 Macs and newer. This is inferred by the set of GPUs shipped in that generation and not an artifical limit. (An older Mac Pro can also run Metal with an upgraded GPU.)

The graphics drivers on macOS can be of varying quality. Apple and AMD are thought to have the most reliable GPU drivers, with Nvidia and Intel having the least reliable drivers.

Driver updates are also packaged with macOS updates and never provided seperately. For testing - it is assumed each device will be running the latest elligible version of macOS. We're not so far equipped to test older macOS versions. Older macOS versions will have different driver versions.

The currently tested and certified configurations are:

Hardware	macOS Version	GPU
Macbook Pro 15" 2012	macOS 10.15	Intel HD Graphics 4000 + GeForce 650m
Macbook Air 13" 2014	macOS 11	Intel HD Graphics 5000
Mac Mini 2014	macOS 12	Intel Iris Graphics 5100
Macbook Air 11" 2015	macOS 12	Intel HD Graphics 6000
Mac Pro 2019	macOS 14	AMD Radeon W5700X + AMD Radeon RX 6900 XT
Macbook Pro 15" 2021	macOS 14	Apple M1 Max
Macbook Air 13" 2022	macOS 14	Apple M2

For Intel Graphics - The Metal debugger requires a 6000 series or better GPU for the full set of shader debugging tools.

Per GPU features

Plasma's Metal renderer has several per GPU features - which is expected to grow over time.

Apple groups GPU feature sets into GPU families, which are comparable to tiers in other APIs. While Apple originally grouped Mac GPUs into Mac families, they've since moved on to having Apple GPU specific families. https://developer.apple.com/documentation/metal/mtlgpufamily?language=objc

The full set of possible GPU specific features is described in the Metal Feature Table, available here: https://developer.apple.com/metal/Metal-Feature-Set-Tables.pdf

Expected Performance

macOS Plasma is targetted to run at 120 fps on Apple Silicon hardware at native panel resolution.

Plasma on Intel hardware should run as well as the Windows version running natively (not through WINE or virtualization). This has not been thouroughly tested.

Most games do not support Intel graphics hardware due to driver issues on macOS. The Gametap release of Uru on Mac excluded Macs with Intel GPUs. So far - we expect Plasma to run on Macs with Intel graphics.

Current Results

Within the testing matrix - Plasma seems to be performing reasonably.

On the 15" M1 Max Macbook Pro - Plasma is currently acheiving the 120 fps at panel resolution goal - with frame renders taking approximately a milisecond or less.

On a newer Intel Mac GPU like the AMD Radeon W5700X - performance is good with a frame render taking several miliseconds or less at 5k resolution (5120x2880).

Older Intel hardware runs Plasma reasonably. Both the 2014 and 2015 Macbook Air can run Plasma at panel resolution between 30-60fps dependning on the scene. 60 fps is reliable at 800x600.

The 2012 Macbook Pro has a difficult time with Plasma due to it's relatively high resolution (2880x1800) and it's slow GPU. Halving the resolution to a more reasonable 1440x900 causes Plasma to behave better on this hardware. This was not uncommon on the first Retina Macs - which often had panels at much higher resolutions than the onboard GPU could handle in games.

@Hoikas has noted an issue on his 2014 Mac Mini with an Intel Iris Pro 5100 running macOS 12. The Iris Pro is not in our test matrix, but the 5000 series GPUs from Intel are represented. We have not tested macOS 12 with a 5000 series GPU as the Mac Mini is the only possible machine with that configuration. I am working on aquiring a 2014 Mac Mini with comparable specs and adding it to the test matrix.

Focused Performance Areas For Metal

Redundant Binding

The current Metal renderer does not track the state of all resources it is binding (such as textures) and thus may encode the same texture binding repeatedly. Unlike some other graphics APIs - bound resources are valid for the entire render pass in Metal and rebinding should be avoided. We need to track the currently bound resources across different shader passes and only bind the resources that have changed.

The Metal debugger will produce a list of redudant bindings for us - so all it requires is loading up different scenes, running the debugger, and tracking down redundant bindings.

There would likely be a minor performance improvement here - especially on machines like early Macbook Airs that have slower CPUs. Encoding GPU instructions mostly impacts CPU usage and bandwidth. In traces - redudant binding has not shown up as a major performance contributor. But it's a clear issue with easy tracibility that should be solved.

Tile Memory on Apple Silicon

There will likely be a series of optizations specifically for Apple Silicon. Apple Silicon is bandwidth constrained - so we need to make optimizations to improve performance. This will be helpful on lower end Apple Silicon and possibly iPad and Vision Pro hardware.

Apple Silicon does not have traditional VRAM - but it does have a tile based cache called tile memory. Specialized shaders can work in tile memory, reducing the need to move buffers on and off main memory through the system bus.

One specific thing I'm looking at is gamma correction in tile memory. Right now we need to flush the framebuffer, and then recall it as a source buffer to do gamma correction. With a traditional GPU it's not considered safe to read and write directly to the current framebuffer. With tile memory - this is a safe operation. We may be able to do a gamma correction directly in tile memory. The gamma correction can be an expensive part of the render.

Lighting

Metal emulates Direct3D style fixed function lighting - where lights are culled/enabled/disabled each time a mesh is renderer. This is expensive and consumes a lot of CPU, and bandwidth as the new lights are uploaded. It would be easier if the lights could all be loaded onto the GPU at the start of the pass, and then culled in tiles.

Plasma sometimes changes lights mid pass - which makes this less trivial to implement.

UBO Support/Argument Buffers

It's not uncommon in modern renderers to cache all the unforms for a material pass on the GPU - and just rebind the GPU state instead of rebinding each shader argument individually. This provides a large CPU and bandwidth savings by only needing to point to memory on the GPU instead of transfering an entire set of shader arguments.

Metal originally started to implement this path (fast encoding) by pre-caching material states onto the GPU but ran into several issues:

Layer overlays change material state on the fly - invalidating the data cached on the GPU.
Piggyback layers change the material state - also invalidating the data cached on the GPU.
The large number of inputs to the shader was causing register overflows, which in turn caused registers to be swapped to memory. Swapping registers back to memory is extremely expensive on Apple Silicon which is bandwidth limited and where registers need to be swapped all the way back to system memory. This was solved by moving a lot of uniforms to precompiled inputs - and caching multiple versions of the shader.

Character animation through GPU compute

The GPU currently seems underused in Plasma - while the CPU is overused. There is current a branch that implements character animation as a GPU compute stage instead of performing this work on CPU.

This branch is not quite ready as the work being done on the GPU means that the buffers on the CPU will remain unfilled. This state should be better tracked. Some parts of Plasma take these empty or unskinned buffers and still push them onto GPU memory - unaware that the GPU has already done that and filled it's copy of the buffer.

Performance seems reasonable. On newer Macs - several dozen characters could be skinned on the GPU for free within the unused GPU time. However - we may want to load balance work between the CPU and GPU.

General Performance Issues

Threading

Plasma should take better advantage of threading. On older Macs - we're frequently getting crushed by low single core performance. Tasks like processing audio buffers in a software renderer do have a measurable impact. We should look at moving these submissions onto a thread. Even if these audio buffers originate in the main thread - not holding back the entire rendering process for audio buffer rendering would be preferable.

H-uru / Plasma