Platform specific render pipeline implementations

james7132 commented 2 years ago

What problem does this solve or what need does it fill?

There are a number of platform related rendering limitations that has been a consistent lowest common denominator in our rendering pipelines. Features like storage buffer support and compute shaders may be addressed when WebGPU formally lands, but other options like push constants as a form of optimization are likely not to land until a V2 spec lands. Common engine features like morph targets, particle systems, etc. all will require optimizing for these limitations, which may hamper more intensive native use cases.

What solution would you like?

Split the bevy_pbr pipelines into two, one behind a #[cfg(target="wasm32")] and the other behind the opposite. We should attempt to reuse as much of the infrastructure made available right now between the two, but use case specific pipelines for meshes, particle systems, shadows, etc. should attempt to leverage as much platform support as possible instead of being limited

This can increase maintenance burden unless we figure out a good way to reconcile the these platform differences transparently.

What alternative(s) have you considered?

Relying on WgpuLimits and wgpu_types::Features to alter the pipeline depending on platform feature support. This would ulimately be the most flexible solution, though likely the hardest to debug and maintain going forward.

superdump commented 2 years ago

My gut reaction to this is that it’s a bit too soon to ‘split pbr’ according to some feature sets and limits where we don’t know where the boundaries will be exactly and until we have the features, we don’t need the profiles.

I also don’t like testing for platforms when the platform could be a moving/variable target or have differences for some reason such that dividing things by platform would impose artificial limitations.

For now I think our goals should be to try to build things in ways that support WebGL2/OpenGLES 3.0 and if we cannot, then that feature would become available where the exact dependency (e.g. storage buffers) become available.

The limitation for storage buffers for example is not wasm32. One situation where where storage buffers are not available is when targeting WebGL2. However, wgpu also supports using OpenGL as a backend and I think it is constrained to OpenGLES 3.0, which corresponds to the same limitations (I think, I’m not sure if it’s an exact match) as WebGL2 but is supported on native platforms.

If/when we at some later point have render features depending on multiple different wgpu features and limitations, and we want to provide some kind of limited set of targets, which I agree we likely will, I think it should probably be done by creating profiles that are basically wgpu Limits and Features as a set of requirements for the profile and test if the available features and limits from the adapter meet the profile and gate whether render features are enabled/disabled based on that profile. However, I think the actual render features will still need to test for more specific things to be able to take advantage of more or fewer bindings bring available or more or less binding/texture size or say if the adapter and platform support mappable primary buffers then we want to avoid unnecessary memory copies and duplication. Those won’t be parts of render features but will be things we would not want to artificially limit.

james7132 commented 2 years ago

Perhaps limiting by compilation target isn't the right solution here, but IMO our rendering solutions for certain platforms should not be limited by the least common denominator amongst them all. Size limits and others can be scaled to match the base limitations, but support for entire classes of low level features seems like it's a maintenance nightmare to test every combination of exposed features here. Particularly when these limitations may require rearchitecting larger parts of the renderer to match.

For example, if push constants are both available and provably more efficient for mesh rendering, we would need to add conditional runtime checks at each draw call against the supported feature set to enable its use, as well as disable the mesh uniform or instance buffer as a fallback when it isn't available. This has both a runtime performance and developer maintenance cost.

Another example here is morph targets, where the most common way I've seen it handled now is via repeated runs of compute shaders to generate a final displacement storage buffer which is read during the vertex shader stage. This requires compute shaders, storage buffers, and the feature to read write-capable storage buffers in the vertex shader. The alternative is to use texture arrays, which also has fragmented support and has heavy size constraints. Both solutions here have limitations on platform support and are not localized in how it affects the end to end pipeline: it'd require rearchitecting to provide some level of support on each platform.

It would be easier to define a base feature set for each targeted platform, factor out the commonalities, then architect solutions for supporting renderer features on each of them instead of relying on runtime specified feature set.

superdump commented 2 years ago

Ok, then I propose the following coarse feature sets:

WebGL2/OpenGLES 3.0 (WgpuSettingsPriority::WebGL2)
WebGPU (includes compute shaders and storage buffers - you can see what is available using WgpuSettingsPriority::Compatibility)
'Full' bindless - my understanding is that the level of bindless supported in WebGPU is limited and therefore to make use of it for a full GPU-driven renderer, native extensions through wgpu are needed

I would appreciate input from @cwfitzgerald and @aclysma on this, though I think they were advocates for the bottom two and were less interested in WebGL2 support for their own renderers (rend3, and rafx respectively).

Aside from that, @james7132, I think the feature to be able to read write-capable storage buffers could be an optimisation rather than a requirement. You could have 2 storage buffers with buffer usages of writeable and copy_src for the compute shader dispatch, then copy it into another buffer that has copy_dst and is readable for the vertex shader. As it is a VRAM to VRAM copy it should be super fast.

bevyengine / bevy