Meshlet simplification might be a bottleneck in bevy's Nanite solution

Scthe commented 1 month ago

I don't think this classifies as a feature request, but there is no better issue category for this report. I'm not a Bevy user, but I hope you will find the content valuable. I also know that Bevy's Nanite tech is still WIP.

Hi, I've recently finished my implementation of Nanite - WebGPU. It includes meshlet hierarchy, software rasterizer, impostors, etc. It's a webpage and has sliders and checkboxes for every setting imaginable. Then I noticed that the Bevy team also started working on this feature. We use the same algorithm to simplify meshlets (based on meshoptimizer). Unfortunately, it will not be enough. Examples (triangles count summed over the root/coarse level meshlets):

Jinx
- Based on: Arcane - Jinx 3D model by Sketchfab user Craft Tama.
- Original mesh: 44.4k triangles.
- Nanite WebGPU simplification: 3k triangles (59 root meshlets).
- Bevy simplification: 5k triangles (120 root meshlets).
Modular Mecha Doll Neon Mask by Sketchfab user chambersu1996.
- Preprocess: I've merged duplicate vertices and recalculated normals in blender.
- Original mesh: 900k triangles.
- Nanite WebGPU simplification: 160k triangles (3k root meshlets).
- Bevy simplification: 246k triangles (5k root meshlets).

3k triangles for Jinx's most coarse meshlets level is not amazing, but workable. OTOH 170k triangles are unusable for the robot mesh. It's unusable with Nanite-like technology. With enough man-hours, you could fix the algorithm (in my readme.md I've thrown some ideas). But you don't have to.

Iterating on Nanite

The problem with the simplification is that it happens on the 4-merged-meshlets level. Resigning from the meshlet hierarchy allows you to implement a large part of culling solutions first. And software rasterizer is a big win for both dense meshes and objects that are far (but have a bit more triangles than expected). Meshlets could then be based on the old-school discrete LOD-levels. The flow would be similar to the following presentations:

"Optimizing the Graphics Pipeline with Compute" by Graham Wihlidal.
"GPU-Driven Rendering Pipelines" by Ulrich Haar and Sebastian Aaltonen.

Another plus: without Nanite you don't have to research the error function that selects the correct meshlet level. From my own tests, e.g. normals are a pain.

Ofc. nothing stops you from going all-in on Nanite from the start. However the optimal implementation will require a lot of experiments and heuristics.

Pinging @JMS55 as she is the main architect for this feature from Bevy's side.

Disclaimer

I don't know much about Bevy, nor the current plans for it. I suppose you will want to do your own research. Yet I suppose knowing about the potential problems upfront will be beneficial. I don't assume you will drop your plans for Nanite, but having more data points might affect prioritization.

JMS55 commented 1 month ago

Hey, thanks for opening this! I haven't seen your project before - awesome to see more work in this space, especially with so many toggles for trying different things out. This is valuable data to have.

I've only tested on the Stanford Bunny so far, I haven't tested anything complex yet. The bunny is an easy simplification case as it has simple materials and pretty uniform geometry. Thanks for the links to more models, I'll have to test my code on them.

I'll also definitely be checking out your code, I'm interested in seeing how you did occlusion culling and software raster. Occlusion culling has been bugged in my code (https://github.com/bevyengine/bevy/issues/14062), which I need to fix :P

To answer stuff you've brought up and in your project's readme:

I plan to stick with hierarchical meshlet LODs. The initial plan at the start of the project was to use discrete LODs broken up into meshlets, but eventually I switched as you really want hierarchical if the goal is "unlimited geometry". If part of a huge mountain is far away, and part is close, you very much want to be able to render those with different level of detail, and be able to stream out the unneeded sections.

From what I've seen, meshoptimizer does not actually perform that well, especially since it doesn't account for vertex attributes at the moment. I saw another Nanite-like project that switched from meshoptimizer to a custom implementation and got much better results. Iirc @zeux is also researching this area. In general, I haven't really worried about the offline preprocessing step of the project all that much, just an initial implementation that's "good enough" to test the runtime section, which is my main focus. I'm highly suspect that switching from e.g. 64 -> 128 triangles per meshlet would improve raster performance a lot, but I haven't experimented yet as I already struggle to get >32 triangles/meshlet as-is. Definitely an area that Bevy needs to work on a lot more. As the Nanite slides say, this is the hardest part of Nanite to get right :)

In case you haven't see it btw, the current PR for Bevy is software raster https://github.com/bevyengine/bevy/pull/14623 (haven't seen large perf improvements over my HW method, potentially because of low triangle fill rate per meshlet). After that, I have a roadmap up here: https://github.com/bevyengine/bevy/issues/11518, of which a large part is parity with Nanite.

zeux commented 1 month ago

From what I've seen, meshoptimizer does not actually perform that well, especially since it doesn't account for vertex attributes at the moment. I saw another Nanite-like project that switched from meshoptimizer to a custom implementation and got much better results.

FWIW this would be more precisely formulated as "Bevy currently uses the version of the algorithm that doesn't take the value of vertex attributes into account". I'm also curious about "much better results", unsure if you have references - while incremental work is planned to improve various aspects of simplification and clustering, I'm not aware of an existing alternative that is obviously superior. There's some caveats around being able to generate new vertices during simplification which is a difficult tradeoff.

zeux commented 1 month ago

Wrt mecha doll model, I'm not sure what exactly the limiting factor there is - it simplifies down to ~25k triangles in gltfpack (this is not using clustered hierarchical simplification to be clear!). In general it feels like the hierarchical simplification should always, in principle, be able to reach the same number of triangles as the full-mesh simplification - while there's restrictions around meshlet connectivity, and currently clustering definitely can sometimes be suboptimal, you can always increase the number of meshlets you merge per step if you see the simplifier stuck.

For full mesh the debug visualization shows a particular region of the model that is all locked from the simplifier perspective; this is probably unwelded vertices for whatever reason which restricts the simplifier's freedom and takes the budget away from the rest of the process. In general simplification needs to be combined with some amount of mesh pre-processing that can get more aggressive with smaller LOD levels, and maybe that's what would be necessary here - don't have time to analyze this model in detail atm.

edit: Ah and a couple more regions like these; these are all using faceted shading so normals create attribute discontinuities that restrict simplification. Sufficiently aggressive welding fixes this; I also plan to look into providing some option to do this as a byproduct of simplification. This is a notable area where "custom" simplifiers can get this wrong if they simplify across discontinuities in an uncontrollable fashion, which improves the behavior for models like this but can break other cases where that creates issues with UVs or skinning.

Scthe commented 1 month ago

Response to JMS55's comment:

I'm interested in seeing how you did occlusion culling and software raster. Occlusion culling has been bugged in my code (#14062), which I need to fix :P

Oh, no! There are some bugs in my code too! I also don't know WGSL that much. I was only interested in getting something "passable". I don't even use two-pass occlusion culling! Without even the most barebone occlusion culling, the whole app.. struggles.

I haven't really worried about the offline preprocessing step of the project all that much, just an initial implementation that's "good enough" to test the runtime section, which is my main focus.

I had the same approach. My last real commit is named "Holistic system overview" where I checked how the whole project works a whole. Then I saw that the meshlet hierarchy simplification does not work that well.

(...) the current PR for Bevy is software raster (haven't seen large perf improvements over my HW method, potentially because of low triangle fill rate per meshlet).

Maybe my hardware raster path is just crap :P. But you definitely want a scene that plays to software rasterizer strengths. Tons of tiny triangles per pixel (!), tons of overdraw. Have dense mesh up close. Or have a 3k triangle model rendered into a 10x10 px square. The first link in my readme is 640m tris, the second is 1.7b. IIRC the opening of UE5's reveal had 1b. Ofc. in my app I can always cheat and use impostor billboards. But I also have a slider to check how the scene performs both hardware and software rasterized. The change is noticeable.

I've only tested on the Stanford Bunny so far, I haven't tested anything complex yet.

Well, at least now you have more info about possible roadblocks! I consider this value-added!

JMS55 commented 1 month ago

@zeux sorry, I wrote my response in a rush, didn't mean to diss your work on meshoptimizer. I'm aware of the experimental vertex attribute-aware simplifier function, but haven't tested it yet.

No hard data about the project that did better than meshoptimizer, but you can find the post/author in the DirectX discord: https://discord.com/channels/590611987420020747/1259493087395184680/1259493252151775233

I am really curious why I'm getting poor meshlet fill-rate (at least on the stanford bunny/dragon) though (independent of simplifcation); I originally decided to go down to 64 from 128 triangles because I could barely get 32. Something to investigate for the future though, I'm still focused on runtime atm and don't have a ton of free time unfortunately.

JMS55 commented 1 month ago

In general simplification needs to be combined with some amount of mesh pre-processing that can get more aggressive with smaller LOD levels.

Yeah currently I allow iirc 1-20% deformation per meshlet as the LOD level increases, but it's very much just a guess, I haven't spent time tweaking it. Could be we need to be much more aggressive.

I've heard from other authors that vertex welding helps a lot for faceted meshes (e.g. the GLTF flight helmet), but I haven't tested that myself.

JMS55 commented 1 month ago

Well, at least now you have more info about possible roadblocks! I consider this value-added!

Yep for sure, thanks for opening this!

My test scene is currently a large grid of bunnies (3092), more details in https://jms55.github.io/posts/2024-06-09-virtual-geometry-bevy-0-14/#frame-breakdown. I definitely need a more realistic scene to test on though, probably using the free quixel megascans.

Scthe commented 1 month ago

Response to zeux's comment:

Wrt mecha doll model, I'm not sure what exactly the limiting factor there is - it simplifies down to ~25k triangles in gltfpack (this is not using clustered hierarchical simplification to be clear!).

Unfortunately, it's the meshlet hierarchy that is responsible for the problem. I did 0 investigation on where the actual problem is. I was hoping that someone else had a similar issue. I assume that the proper solution would be to import the meshlets into Blender and see what is actually happening. But since nanite-webgpu was just a throwaway side project for me..

So the issue is that Nanite meshlet hierarchy is a specialised use case. I admit, I don't know much about the simplification algorithms.

PS. The model itself is just a random mesh that I've grabbed from Sketchfab. I assume any other high poly models would have the same problem.

PS2. If you want, you can close this GitHub issue. Not many actionable things to do right now. (I've realized I should also respond to zeux's comment, so I've removed this postscriptum from my previous response.)

zeux commented 1 month ago

Unfortunately, it's the meshlet hierarchy that is responsible for the problem. I did 0 investigation on where the actual problem is.

Yeah I understand that - and if you have a restricted merge (eg only merge up to 4) then it's possible that issues in clusterization are compounded through a long simplification process. This is on my list to investigate wrt meshopt's clustering, I'm just making an observation that it doesn't necessarily need to impact simplification if you remove restrictions on merges. An ideal outer flow of clustering+simplification should be able to have correct reduction down to a minimum number of meshlets regardless of the defects in clustering. For example, if recursive split/merge stops when the cluster can't be simplified substantially, it may be worthwhile to merge on a higher level (aka a larger group) and retry as that reduces the number of locked edges. There's still going to be cases where mesh topology prevents simplification of course - I believe UE's Nanite simplification suffers from similar issues as I don't think they allow completely unrestricted merges - so a completely arbitrary model from Sketchfab may still experience issues easily in collapse based simplification workflows.

zeux commented 1 month ago

No hard data about the project that did better than meshoptimizer, but you can find the post/author in the DirectX discord: https://discord.com/channels/590611987420020747/1259493087395184680/1259493252151775233

Thanks. Hard to evaluate without visual differences, and this topic is deep and complicated; just noting that attribute simplification is working fairly well now but also some more improvements will be coming later this year to the simplifier in this and other areas.

And yeah the "optimizing position placement" is the tradeoff I was referring to: UE5's Nanite does do this, and I have plans to implement this as an option, but crucially it means that every level in the meshlet hierarchy when you simplify ends up generating a lot of new vertices. This, in aggregate, is a significant increase in vertex memory; Nanite solves this with LOD streaming and highly efficient compression which makes the total system work acceptably but is not a guaranteed win overall, as you can rebalance LOD levels in certain cases to accomodate that. There is also a ton of complications wrt attributes with this. Overall I think it will eventually be a useful option to add, and some visual issues in shading are difficult to fix without introducing new vertex data, but I don't think it's the case that every use of simplification would benefit from it.

JMS55 commented 4 weeks ago

@zeux I've spent some more time on this project and come to the conclusion that the simplify-cluster step is a bottleneck, and producing low-quality meshlet DAGs. Besides trying out the new attribute-based simplification and tuning my cluster grouping heuristics for METIS, I'd like to know more about improving meshoptimizer for "optimizing position placement".

Would it be ok to open an issue on the meshoptimizer github so we can discuss this further? Maybe you can give me some pointers on how difficult this would be for me to implement, if you're open to adding it to the library.

zeux commented 4 weeks ago

Isn't the DAG quality a property of the clusterizer? I don't think attribute simplification will help if you are running into the graphs being suboptimal, and the same goes for position placement. But I'm not sure where the problem lies exactly; I'd recommend creating a discussion thread.

Scthe commented 4 weeks ago

I've already moved to other project, but I still had some thoughts about this problem (readme updated). I think there might be another approach to simplification. ATM Meshoptimizer allows you to specify approx. how many triangles do you want in the result.

Let's introduce a guarantee/constraint: "You will always get the EXACT number of triangles you want". If you have "fully filled" meshlets (128 tris each), merge them, and meshlet split/simplify them into 128 tris, you will get an exact 128 tris ALWAYS. Due to the recursive/transitive process of tree building this guarantees "fully filled" meshlets on all DAG levels. You also get an efficient log2 number of levels. The only problem is to have the original/bottom level meshlets be also "fully filled".

What's funny, since you know all meshlets are "fully filled" you no longer think in terms of the triangles. There is DAG, there are nodes and it's all you have to know.

I'm not sure how one would implement this. I'd guess some kind of greedy algorithm (I think also mentioned by Brian Karis, not sure in this context). But inevitably also adding new vertices (I think?). As I've said, I don't know much about simplification algorithms.

This analysis matches following postscript from Brian Karis' presentation:

The number of triangles being worked with is always in multiples of some fixed granularity meaning we can exactly fill our 128 triangle clusters. We do not form groups with individual triangles. The merge and split formulation in symmetrical and understands clearly that no correspondence or constraints in terms of triangle to cluster assignment needs to be made between parents and children. All siblings are connected to all their parents. The structure ultimately being built is clearly a DAG which enables reasoning about it correctly.

JMS55 commented 4 weeks ago

Thank you, that's an interesting point. Something like the greedy algorithm from https://jcgt.org/published/0012/02/01/paper-lowres.pdf night perform better if we want to ensure we always get 128 triangles.

zeux commented 4 weeks ago

Greedy algorithm is meshopt_buildMeshletsScan (don't remember OTOH if this is exposed via Rust), but it has the same constraint as meshopt_buildMeshlets: if you restrict the number of unique vertices used by a meshlet, you will not get to the target number of triangles in some cases because your meshlets will be vertex bound. Of course if this constraint is lifted, then you will get "perfect" splits in terms of counts - for both algorithms. But counts alone may not be enough.

Similarly, because simplification is constrained by the cluster boundary, you have no guarantee of reaching that number. I suspect there's always going to be cases where you simplify down to eg 80 triangles while targeting 64 due to a boundary restriction (that you have to have to avoid cracks) and end up with a group that has to be split into under-filled meshlets; I would be very surprised if Nanite has any guarantees about the cluster fill in the general case. I think their split routine doesn't try to necessarily reach the maximum triangle count either but I'm not 100% sure on this.

Scthe commented 3 weeks ago

Yes, I agree - locking meshlet borders means it's not possible to have meshlet's tris count always be a multiple of 128.

I assume "Simplify-cluster step is a bottleneck, and producing low-quality meshlet DAGs" is a hypothesis? Would be quite hard to measure the impact, esp. without better DAG ("Batched Multi Triangulation" section 3.2) to compare. But e.g. it does not count things like meshlet fill rate or thin diagonal triangles, etc.

I took a quick look at Bevy's simplification code and a few things stand out. If you want to experiment, here are a few ideas.

What is a border?

You give some triangles to meshoptimizer to simplify with LOCK_BORDER. Mesh optimizer has its own definition of a border. But is it the same as Nanite's? The vertices we want to lock are the borders between meshlets. This is not the same as meshlet borders. See the image below for 3 meshlets (red, green, and blue). If we do not have to persist the "contour" we can simplify each meshlet into 3 tris (red outline).

meshlet-1-2-3

A fair simplification

For a quick check you can try following fair simplification algorithm:

Meshlet := mergeMeshletsSelectedByMetis()
BorderVertices := meshlet vertices shared with other meshlets. We cannot touch them.
1. You already have find_connected_meshlets() that creates a list of border edges. When you have merged 4 meshlets, all you have to do is check which border edges are internal and which are external after merge.
InsideVertices := meshlet's vertices that are not a BorderVertex.
Repeat:
1. Pick random InsideVertex.
2. EdgeCollapse it into a random neighbor. The edge does not have restrictions if the other vertex is BorderVertex or InsideVertex.
3. Repetition continues till:
  1. Reached assigned triangle count, or
  2. There are no InsideVertices left.

Edge collapsing a random vertex will introduce artefacts. This is something you can improve (as well as using heuristic, handling concave surfaces, etc.). See below for a discussion of other simplification constraints. But might be good enough to at least test your hypothesis - "Simplify-cluster step is a bottleneck, and produces low-quality meshlet DAGs".

Can you combine this with meshoptimizer? E.g. optimize with meshoptimizer first, and (if triangle count is still too high) edge collapse at random? The above algorithm might not improve much, only change the definition of the border vertex. However, it's implemented inside your Rust code, which allows for fast iteration of other ideas.

TL;DR: Every vertex that is not a border vertex (using shared meshlet vertices definition) can potentially be removed.

Error metric DOF

Error metric is another degree of freedom. In your current implementation it is restricted by meshoptimizer's target_error (it's not the only limitation, but indulge me). Let's imagine you collapse the entire mesh into a tetrahedron. Would it be noticeable if the mesh is rendered into a single pixel? What happens if you increase the target_error value? Does Nanite's simplification algorithm even need target_error? What if you drop other constraints too? The simplification algorithm I've outlined above does exactly this.

Basically, you can try dropping all constraints from simplification. Instead, "bake" them into the value used for selecting meshlets at runtime (simplificationError). The only remaining constraint is Nanite's - for immutable border edges between meshlets. This should still guarantee a valid DAG? It would probably be better if we start the simplification following the constraints, and only drop them if needed.

Simplify to a single meshlet

You might be surprised, but Bevy's implementation already simplifies any mesh into a single meshlet. At least with a little bit of wordplay. You might think the problem is that your DAG has "many roots". If you cannot simplify more, you stop processing the meshlets. You add nothing to simplification_queue to be processed in the next iterations. Yet, you can take such "fake" roots and assign them to a parent meshlet that has simplification error INFINITY. Continue recursively till a single meshlet is left. Such parents are never rendered, thus we never bother to assign triangles to them.

Now, if we only were able to analytically assign some other value than INFINITY. No matter how high the value is (which indicates stupid things done during simplification), we leave for an error metric to decide if the meshlet is actually useful.

Jokes aside, I had described in my readme how this works in practice. Sometimes, even a mesh with terrible simplification can end up as a single meshlet. At a price of many "useless" intermediate LOD levels.

Other stuff

Using greedy algorithm (meshopt_buildMeshletsScan()) should increase the count of "filled" meshlets. There is still a problem of leftover triangles above 2 * 128 that were not simplified cleanly. But you already have this problem right now. Does this change anything?
Here you are halving the triangle count. The simplification tris target depends on the input. But if your goal is to have "full" meshlets, you could simplify to 256 (2 full meshlets) instead.
- The current group_indices.len() / 2 is also correct. This probably is a tradeoff of tree size vs ideal triangles per meshlet. I'd say there is no reason to change it. I don't remember if there was a reason for one way or the other.
How would your code handle a mesh that has flat shading? Every vertex is unique wrt. position and normal combination.
If your merged meshlet contains a hole inside (think the bottom of Stanford bunny), can you collapse it? The size of the collapsed hole is roughly proportional to the meshlet size, which plays nicely with the error metric.

bevyengine / bevy