gfx-rs / wgpu

A cross-platform, safe, pure-Rust graphics API.
https://wgpu.rs
Apache License 2.0
12.16k stars 891 forks source link

DX12 Sampler Heap Requirements Overly Strict #3350

Open cwfitzgerald opened 1 year ago

cwfitzgerald commented 1 year ago

DX12 is totally weird about samplers. At any one time there may be exactly one sampler heap bound to a command buffer. Each sampler heap may contain no more than 2048 sampler descriptors. Currently we only ever use a single sampler heap. This means that the maximum amount of sampler bindings we can allocate in all live bind groups is 2048. This is extremely restrictive, especially when compared to other apis.

You can switch which sampler heap is bound to the command buffer at any time, but according to the command list documentation this may be expensive and cause flushes.

We need a better strategy of dealing with sampler descriptors. I suspect this will involve some kind of sampler descriptor cache where we try to re-use the same sampler heap as much as possible, only overflowing if we run out of space, then just switching sampler heaps as rarely as possible.

Davidster commented 1 year ago

After experimenting with PR #3499 and some discussions in Matrix, we've made a more concrete plan for how we'll attempt to solve this.

The naïve solution is to dedupe the samplers that are uploaded to the GPU sampler heap. The problem with this is that the root signature is currently set up to have 1 descriptor table for every bind group in the pipeline where that bind group's samplers are stored. This means that the samplers of each bind group currently need to be contiguous in the heap, since descriptor tables are contiguous ranges in memory. This is a problem for deduping, since the samplers would no longer be contiguous.

One option would be to add more descriptor tables, but this would mean that we'd have one descriptor table per sample in the root signature. This is bad since the root signature is precious space, containing a max of 64 DWORDS where each descriptor table takes up 1 DWORD. As is recommended by the dx12 docs, we don't want to waste any space in the root signature as it is used for lots of stuff such as push constants.

One proposed solution to this is to use root constants instead of descriptor tables to store indices into the sampler heap. The entire sampler heap will be bound as a single descriptor table pointing to an array all the samplers. Since there's a maximum of 2048 samplers in the heap, we would only need 12 bits to represent the sampler's index in the heap. This means that we could store 32 sampler indices in the root signature in 12 root constants (root constants are 1 DWORD or 32 bits each, and 32*12bits == 12*32bits). So that would be 13 DWORDS (12 root constants + 1 descriptor table) instead of 32 DWORDS if we added 1 descriptor table per sampler as per the previous paragraph.

Another similar solution is to allocate an extra buffer which contains arrays of sampler indices. We can use that to store the list of samplers for each bind group and then store a pointer to that list in the root signature. This would add an extra indirection to fetching the samplers but would take up less space in the root signature (1 DWORD per bind group + 1 for the sampler array). It's also a simpler solution that would work for both fixed samplers and also sampler arrays. In fact, even if we use the other approach for fixed samplers, we would still need to do this to keep supporting sampler arrays. @cwfitzgerald and I are in agreement that we start with this approach for all cases and see how it goes.

Some other relevant info:

teoxoy commented 1 year ago

Thanks for the investigation; I'd also say that the 2nd approach sounds like a good start!

cwfitzgerald commented 1 year ago

Nice! Also looping in @jimblandy on this

jimblandy commented 11 months ago

@teoxoy If the CTS were running into problems caused by this not-great allocation system in Mozilla CI, what would the symptoms be?

Basically, as long as this is an optimality problem, it's low priority; once it's a blocker to CTS health, then it's a higher priority.

Davidster commented 11 months ago

you would get a panic with the "not enough memory left" error message, as in this issue: https://github.com/gfx-rs/wgpu/issues/2857

teoxoy commented 9 months ago

WebGPU requires FL11_1 or (FL11_0 + resource binding tier 2) (ref).

The table here lists FL11_1 as having a minimum resource binding tier of 3, so I thought we can assume tier 2 as a minimum but the table here lists haswell & broadwell as being FL11_1 with a resource binding tier of 1.

We already dropped support for haswell on d3d12 (https://github.com/gfx-rs/wgpu/pull/4709) but I thought we could still support broadwell.

Is the table on wikipedia wrong? If not, is it valid that broadwell only supports resource binding tier 1 given that the microsoft docs say FL11_1 should support tier 3?

cwfitzgerald commented 9 months ago

The table is probably wrong in some way - I'm not particularly interested in supporting it, especially when there's no good solution for samplers without at least RBT2.

teoxoy commented 9 months ago

I guess dropping it would be fine since tier 1 has the additional limit of:

Max Samplers in all descriptor tables per shader stage: 16

from https://microsoft.github.io/DirectX-Specs/d3d/ResourceBinding.html#levels-of-hardware-support

which sounds even more restricting.

It would be good to know if the table was wrong or it's actually the case that it only supports tier 1. Are you aware of any other resources that list this kind of information? That table on wikipedia is the only resource I could find.

cwfitzgerald commented 8 months ago

Detailing the resolved plan a little more:

Samplers are fully de-duplicated in the backend. On creation, they are added to the sampler heap. On destruction, that slot in the sampler heap is up for re-use. Sampler creation fails if there is no more slots on the heap. This is allowed because we are updating descriptors that are not actively used by the gpu.

Globally, for all shaders that bind samplers, there is a single descriptor table which binds the entire sampler heap. This is not part of a bind group.

For each bindgroup that contains a sampler, there is a single descriptor pointing to a buffer with u32s. Each u32 is an index into the sampler heap corresponding to that sampler in the bind group. When accessing the sampler, the appropriate position in the buffer is read and the index is used to index into the sampler descriptor set binding.

When creating a bind group, this sampler index buffer is created and populated with all the appropriate sampler indexes. This is then bound as part of binding a bind group.

@group(0) @binding(0) var sampler_0: sampler; // sampler buffer 0 index 0 
@group(0) @binding(1) var texture_0: texture2d<f32>;
@group(0) @binding(2) var sampler_1: sampler; // sampler buffer 0 index 1

@group(1) @binding(0) var sampler_2: sampler; // sampler buffer 1 index 0 
@group(1) @binding(1) var sampler_3: sampler; // sampler buffer 1 index 1
@group(1) @binding(2) var sampler_4: sampler; // sampler buffer 1 index 2

// In the HLSL output, the equiv of:
var sampler_descriptors: binding_array<sampler>; // Contains 2032 possibly invalid samplers.
@group(0) @binding(wherever) var<storage> sampler_array_bind_group_0: array<u32>; // Contains two values in this example.
@group(1) @binding(wherever) var<storage> sampler_array_bind_group_1: array<u32>; // Contains three values in this example.

// When reading a sampler, wgsl of this:
textureSample(texture_0, sampler_1, uv);
textureSample(texture_0, sampler_4, uv);

// Gets translated to:
textureSample(texture_0, sampler_descriptors[sampler_array_bind_group_0[1]], uv);
textureSample(texture_0, sampler_descriptors[sampler_array_bind_group_1[2]], uv);

Note for the Future

If we use SM6.6 bindless, we can index directly into the sampler heap, skipping the sampler_descriptor binding.

nical commented 8 months ago

I suspect that some bevy demos such as https://bevyengine.org/examples-webgpu/3D%20Rendering/load-gltf/ are hitting this issue very easily.