BVE-Reborn / rend3

MAINTENCE MODE ---- Easy to use, customizable, efficient 3D renderer library built on wgpu.
https://rend3.rs
Apache License 2.0
1.07k stars 59 forks source link

Refactor Vertex Formats to use Less Data #346

Closed John-Nagle closed 6 months ago

John-Nagle commented 2 years ago

I'm using NVidia X server settings to see how much GPU memory is in use on an 8 GB RTX 3070.

Idle: 427 MB in use. Loaded "Babbage Palisade" scene: 3684 MB in use.

Scene stats:

Loaded: 14231 meshes, 19765372 vertices, 58364409 triangles, 14623 textures, 357197184 texture bytes.
Reused: meshes: 39054,  textures: 32514

"texture bytes" is the total number of bytes fed into create_texture_2d. Each mesh loaded or reused has its own Object and Material.

So that's roughly 20 million vertices, 58 million triangles, and 357 million texture bytes. Each vertex should be 12 bytes (3 of f32) and each triangle 12 bytes (3 of u32). So, 936 MB of mesh. 357 MB of texture 1,293 GB total. plus initial usage of 427MB That's 1.720 GB.

Actual usage is about twice that.

Memory usage increases as textures are swapped in and out, and I've had the 8GB GPU 79% full. But that situation is more complicated. This situation is simple - no texture or mesh deletions have occurred.

This is the version of Rend3 from the repository as of two days ago. Memory consumption seems to be much higher than in Rend3 0.2.2. But that needs to be reverified with an old version of my program.

Also, frame rate now maxes out around 47 FPS, while, before, it was at 59 FPS with the scene initially loaded. The only compute in progress is the refresh loop; everything else is quiet. CPU utilization is 100% olf one CPU.

cwfitzgerald commented 2 years ago

The one thing that changed is that there are now more vertex attributes attached to each mesh, so each mesh is bigger.

That's 80 bytes/vertex. Add to that vertex buffer lengths are rounded up to the nearest power of two, you get a length of 2^25 = 33M vertices. 33M verts * 80 bytes a vertex = ~2.6GB of vertex data. Add the 168M indices (rounded up to 268M) that should be 1GB of indices. That's 3.6GB right there.

That being said this shouldn't substantially affect performance. The vertices are intentionally stored in separate buffers so there should be little to no runtime performance loss because of these excess buffers (just memory usage). I will eventually allow these other attributes to be completely omitted (and custom ones to be added), but that requires more internal reworks.

I'd need more profiling information to diagnose the performance regression.

cwfitzgerald commented 2 years ago

I think two ways the memory usage can be improved: instead of rounding up to nearest power of two, I can round up to a fixed power of two (maybe 2M/4M verts which is a bit over 150/300MB maximum waste)

I'll also look into using smaller datatypes if possible without loss of needed precision.

John-Nagle commented 2 years ago

OK, thanks. So all vertices are set up for rigged mesh, whether or not it is needed. 24 more bytes per vertex explains the increase. That's a high cost.

Let me look more at how the memory consumption grows with texture loading and releasing. I may have a memory leak.

Is there any way yet to monitor GPU memory consumption from inside a program? That just became a lot more important.

Any instructions on how to profile?

John-Nagle commented 2 years ago

Suggestion:

Position (f32x3 = 12 bytes)
Normals (f32x3 = 12 bytes) - consider i16x3. Errors wil be subpixel.
Tangents (f32x3 = 12 bytes) - consider i16x3.
Uv0 (f32x2 = 8 bytes)
Uv1(f32x2 = 8 bytes) - optional, unless using the second set of UVs.
Color (u8x4 = 4 bytes)
Joint Index (u16x4 = 8 bytes) - optional, unless rigged mesh
Joint Weight (f32x4 = 16 bytes) - optional, unless rigged mesh

That gets it down to 36 bytes.

Typical Steam user has a 6 GB GPU.

cwfitzgerald commented 2 years ago

Optional components I can't do immediately because of complexity issues, but I had some discussions about what other engines do and I think I can get it down even further.

Position (f32x3 = 12 bytes)
Normals (i8x3 = 4 bytes)
Tangents (i8x3 = 4 bytes)
Uv0 (f16x2 = 4 bytes)
Uv1 (f16x2 = 4 bytes)
Color (u8x4 = 4 bytes)
Joint Index (u16x4 = 8 bytes)
Joint Weight (u8x4 = 4 bytes)

That's 44 bytes/vertex without optional-ness and 24 with. (In reality, the only one I'm going to require is position, though normals and tangents will be required for skinning)

John-Nagle commented 2 years ago

That sounds very good. Thanks.

John-Nagle commented 2 years ago

Regarding texture memory usage, I now have a better idea of what's going on.

The only tool I have for seeing what the GPU is doing is NVidia's X Server utility, which tells me the total memory in use in the GPU. What I see is that, after loading and unloading many textures, the memory usage number goes up, and never comes back down much, although it does come back down by 5-10% or so.. The amount it shows is closely related to the peak texture usage, not the current texture usage. It turns out my peak is too big, because, as the program frantically loads and unloads textures as the camera moves, the ones taking up the most screen space have priority, which means downsizing distant textures is low priority and fast camera movement causes a large but temporary texture memory usage spike.

So this looks like something where GPU memory allocation allocates memory for texture purposes and does not give it back in a way that the NVidia X server utility can see. This is common behavior for allocators that get their memory from some lower level allocator, like the operating system's, and don't give it back, or only give it back if they can give back a big block.

Is that what's going on down at the allocation level? That would explain what I'm seeing. Thanks.

cwfitzgerald commented 1 year ago

After #449 is merged, this isn't entirely solved, but vastly helped. The only required attribute is position, the rest are optional and only needed if you use the feature for them. If you use all attributes, we're still larger than we have to be.