dougallj / applegpu

Apple G13 GPU architecture docs and tools
BSD 3-Clause "New" or "Revised" License
537 stars 38 forks source link

Texture and sampler descriptors #37

Open mr-mobster opened 1 year ago

mr-mobster commented 1 year ago

I was trying to understand how resource descriptors work on the low level (especially in the context of this very informative blog post). I hope this is a good place to ask.

The texture_load instruction is sparsely documented and I've been having difficulties wrapping my head around it. From various code snippets (using Metal texture slots as well as argument buffers), it appears that both texture and sampler descriptor can either come from some sort of dedicated register space (tsN, ssN) — could anyone tell me more about it? Are they mapped to uniform registers or is it a completely separate hardware state? The instruction itself is disassembled like this (what is the purpose of that last uniform register?):

// tex0.sample(s0, float2(0, 0)).x 
texture_sample  0, 0b00, 0b01100, 0b0, 0b00000, x, 0b000, r0, None, ts0, ss0, tex_2d, r0_r1.discard, lod_min, u2l
                                                           ^        ^ texture+sampler ^ coordinates           ^  what's this?
                                                           |
                                                           |
                                                       output register

When using argument buffers, texture_load can also use regular registers, and if I understand the disassembly correctly, it's always a continuous register pair which presumably holds a 64-bit value. This is the disassembly:

// bindings[i].tex.sample(bindings[I].s, float2(0, 0)).x 
texture_sample   0, 0b00, 0b01100, 0b0, 0b00000, x, 0b000, r7, u0_u1, r0, r2l, tex_2d, r4_r5.discard, lod_min, u6l
                                                            ^  ^       ^   ^ sampler descriptor (previously loaded into r2_r3)
                                                            |  |       | 
                                                            |  |       texture  descriptor (previously loaded into r0_r1)                  
                                                            |  |                  
                                                            |  what is the purpose of this uniform?
                                                            |
                                                            output register

A couple of mysteries here: what is the purpose of the u0_u1 uniform? This seems to be a base address of some sort, but it's not the address of the argument buffer. Why is the sampler register referred to as r2l and not r2? And again, there is that mystery uniform at the end of the instruction — what's that about?

Bonus question: is there any information about what these descriptors represent? Are they pointers or some sort of table entries or something else entirely? Maybe there is something in the Asahi driver code? I tried to have a look but don't know hot to navigate the codebase...

Thank you for any pointers you might have!

alyssarosenzweig commented 1 year ago
  1. Uniform passed to lod_min is the minimum LOD use as a clamp.
  2. With the traditional "binding table" path, there are dedicated "texture state" and "sampler state" registers which are read-only and initialized by the driver. Logically these hold the address to textures/samplers, unknown how implemented physically. These are tsX and ssY.
  3. With the bindless path (argument buffers), textures are specified as pointers to the texture descriptor in GPU memory, specified as a 64-bit uniform base address plus a 32-bit dynamic (GPR) offset.
  4. With the bindless path, samplers seem to be specified as 16-bit indices into a global heap of sampler descriptors (meaning at most 64k samplers can be used with argument buffers at the same time, Metal's limit is much lower and reflects this weird hardware detail). Details TBD here, I haven't figured out where the heap actually lives in memory yet.
  5. Texture descriptors in memory are https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/asahi/lib/cmdbuf.xml#L240 and sampler descriptors are https://gitlab.freedesktop.org/mesa/mesa/-/blob/main/src/asahi/lib/cmdbuf.xml#L303 .. these map closely to the APIs.

HTH. Please send a documentation patch with what you've learned :)

dougallj commented 1 year ago

Good questions, good answers. I'm leaving this issue open to track adding this to the documentation.

mr-mobster commented 1 year ago

Thanks Alyssa, that's exactly the in-depth reply I was hoping to get! The bit about hardware texture and sampler state registers are very surprising, I would have expected these to be done with buffer-based bindings.

I was also very curious about the 32-bit texture index/16-bit sampler index, so I did some experiments on Metal side. Creating textures give you objects with the GPU IDs 24, 48, 72, etc. — exactly matching the texture descriptor size! I also experimented with different metal shaders using different buffer configuration — the 64-bit uniform base is always the same, which leads me to believe that this probably an application specific base address. Using unsigned 32-bit indices therefore allows ~180 unique million texture objects. I was not able to get even close to this amount in my tests as creating close 500k textures already slows things down to a crawl (no crash, and new textures continue to be created, but at a much much slower rate). Incidentally, that's the limit Apple mentions with Metal. Probably an implementation detail that hits a slow path when allocating/resizing tables.

With samplers things are also interesting — Metal will cache samplers created with the equal descriptors — all of them get the same ID. If the descriptor is different, you get incremental IDs, starting with 29 (on 13.4 22F5037d). It's impossible to create more than 1024 unique samplers per application — Metal will simply crash. If instead you set descriptor's supportArgumentBuffers to false, you can create as many as you want (I did up to a million, although the system started lagging like crazy). For such samplers, the GPU ID is always 28 (will probably trigger a fault when one tries to use it in an argument buffer, I didn't verify this).

Preliminary conclusions (following Faith Ekstrand's nomenclature): Apple uses a descriptor buffer (B) for textures and a global heap (H) for samplers. These tables and their details are completely hidden from the user in Metal, it's not possible, for example, to use multiple texture buffers, for example, even though the hardware appears to be capable of that via different base pointers. Instead, creating a new texture will simply add a new descriptor to the (per-application?) descriptor buffer and you get back an index. Metal's bindless model revolves around packing these indices into the regular data buffer (and unlike textures and samples, you get full access to the actual buffer pointer and can manipulate and dereference it freely using regular C rules).

This is probably most comparable to the recent DX12 binding model with heaps (and fairly similar to what Nvidia seems to do), except that Metal hides the heap details and handles the indirection sugar for you. Now I can understand the challenge of mapping Vulkan's model to Metal a bit better. Direct access to the texture descriptor buffer would be advantageous. It also remains to be seen whether the 500k textures is some sort of hardware limitation or whether it's an implementation limit (e.g. how tables are grown etc.).

I'll try to find some time to organise all this info and submit a PR for the docs!

alyssarosenzweig commented 1 year ago

Everything you wrote agrees with my understanding :+1: