Scthe / nanite-webgpu

UE5's Nanite implementation using WebGPU. Includes the meshlet LOD hierarchy, software rasterizer and billboard impostors. Culling on both per-instance and per-meshlet basis.
https://scthe.github.io/nanite-webgpu/?scene_file=jinxCombined&impostors_threshold=4000&softwarerasterizer_threshold=1360&nanite_errorthreshold=0.1
MIT License
743 stars 15 forks source link

Questions #1

Closed alienself closed 3 weeks ago

alienself commented 3 months ago

Hi there!

First of all, incredible project! Hats off!

I had a few question for you:

Scthe commented 2 months ago

Why does the software rasterizer output untextured meshes? In the Jinx demo, there are white-ish bodies/heads visible in the background, which disappear when the software rasterizer is turned off.

With a hardware rasterizer, the depth test does the following (pseudocode):

if (fragmentDepth < depthTexture[fragmentPosition.xy]) {
  depthTexture[fragmentPosition.xy] = fragmentPosition.z;
  gBufferTexture0[fragmentPosition.xy] = color;
  gBufferTexture1[fragmentPosition.xy] = normalVector;
}

The write to each of the textures depends on the comparison. If you do this over many threads, you get a race condition. Hardware can implement this easily. Think something like Java's synchronized blocks.

Software rasterizers cannot do this. You only get atomic operations, which are not enough. With millions of triangles each affecting multiple pixels on each frame (so around 60/144 Hz) it's not a question if the race condition happens. The solution is to use visibility buffer. For each pixel, the rasterizer outputs sceneUniqueTriangleId (combination of instanceId + meshletId + triangleId, 32-bit total) of the closest triangle. Combine it with 32-bit depth into a 64-bit value ((depth << 32) | sceneUniqueTriangleId). Notice that comparisons between 2 such values are always decided based on the depth. We can safely use 64-bit atomic operations without worrying about race conditions. In a separate pass, we retrieve the sceneUniqueTriangleId, rasterize the triangle again, compute barycentric coordinates, and shade the fragment. Surprisingly not that expensive.

Unfortunately, WebGPU lacks 64-bit atomics. Even if the hardware and the driver support it. We cannot do what I've outlined above. There are other algorithms to achieve this, but they are much slower. And people will want to use my app to reimplement Nanite in other APIs (which have this feature). No point in bogging down my implementation for an API that barely anyone uses.

With this limitation, my only concern for this app is to show that the software rasterization works. If you see the software rasterized model in the background it will be white and it will have reasonable shading. Reprojecting depth and "compressing" normals is enough to get something.. not offending.


What kind of performance can be expected in a more realistic use case? Consider a scenario with a variety of static meshes and less aggressive reuse of a single mesh, such as a terrain featuring multiple rocks (e.g., 10-15 different types, each instanced 20-30 times). What performance can be anticipated in this context?

Well, you've not specified how many triangles per rock. But I'm not going to be able to provide a direct answer either way. It just depends on a lot of factors.

Keep in mind this repo is intended as a research project, not an app for everyday use. A lot of people will want to know how one can implement Nanite. The code solves most of the "how to implement in the API" problems. Simplification and error metrics are more of a research problems. Nanite will never be a simple drop-in into an existing codebase. It's a total rewrite of the rendering pipeline.


Any plan for a lumen-webgpu?

No, it does not seem that interesting to implement.


PS. Sorry for late reply. A bit unexpectedly there were some other Nanite-related discussions on GitHub and I wanted to make sure I have time to pay full attention to both discussions.