KhronosGroup / Vulkan-Docs

The Vulkan API Specification and related tools
Other
2.76k stars 463 forks source link

Image view handle/reference #2311

Closed timeester3648 closed 6 months ago

timeester3648 commented 7 months ago

It would be very nice if you could get the handle of a VkImageView, similar to buffer reference, to use in a shader. In a shader, you would then convert the uint64_t (or a uvec2 for GPUs that do not support uint64_t, like GL_EXT_buffer_reference_uvec2) to a samplerXXX (i.e. sampler2D/samplerCube, etc...).

This can speed up a GPU-driven renderer immensely. By using custom vertex-fetch and using image/buffer handles, descriptors almost become obsolete, aside from the buffer that holds those handles (which could also be a reference in a push constant). This prevents potential huge descriptor updates. Although VK_EXT_descriptor_buffer has somewhat alleviated this, not a lot of GPUs support this extension. This extension also does not fix the issue, it makes it a bit less painful for the small percentage of GPUs that do support it, since in an architecture where the used descriptors change and the order changes, you still need to retrieve a handle every time (especially since not all GPUs support using the same handle for different image layouts).

Nvidia already has such an extension: VK_NVX_image_view_handle, but this is not only not cross-vendor, but also just an extension to retrieve an opaque handle, where you cannot do anything with it. Although Nvidia does require this extension for DLSS, which makes it seem like they do already have this functionality internally.

OpenGL also supports something similar: ARB_bindless_texture, which both Nvidia and AMD support.

This would be even better if you could also retrieve a handle from a VkSampler and then use it like: texture(sampler2D(texture_handle, sampler_handle), texture_coords).rgb, but I do not know if this is viable on current GPU architectures.

In the future, if this is (hopefully) accepted, this in combination with something like #1758 or pipeline tables where an indirect draw-call could have a buffer with indices, pointing per-draw-call which pipeline to use, could make sorting materials/pipelines beforehand obsolete. This means reduced CPU load and a streamlined rendering architecture, eliminating the necessity for structures to store similar materials/pipelines, which could lead to decreased RAM usage. However, the acceptance of this extension/proposal is required for realizing these benefits.

nanokatze commented 7 months ago

VK_EXT_descriptor_indexing (which is core but optional in 1.2) does exactly what you want, has existed for 7 years at this point and is very widely supported

VK_NVX_image_view_handle you're mentioning exists purely for interop purposes

Hugobros3 commented 7 months ago

Duplicate of https://github.com/KhronosGroup/Vulkan-Docs/issues/871 ?

timeester3648 commented 7 months ago

I already told VK_EXT_descriptor_indexing is slow and not properly usable. Although I did say it indirectly that may be my fault. VK_EXT_descriptor_indexing is widely supported (core in Vulkan 1.2) but this makes descriptors huge and slow as I already mentioned. Updating a descriptor is still needed (which can take long) and there are limitations on some descriptor counts but that is mostly for uniform buffers on older (Nvidia 10XX GPUs) so that does not matter. VK_EXT_descriptor_indexing is nice to have compared to the normal Vulkan descriptor model, but it is a failed attempt to fix the lack of bindless support in Vulkan.

timeester3648 commented 7 months ago

Duplicate of #871 ?

Seems like it, although that was more of a question rather than a request I think.

And I guess the issue I was trying to show was already recognized, I quote from @oddhack in issue #871: "We recognize that VK_EXT_descriptor_indexing does not fully address all GL_ARB_bindless_texture capabilities".

DethRaid commented 7 months ago

Slow and not properly usable? I've found descriptor indexing to be very fast and very usable. Not sure what issues you're running into, but you should probably profile your code and identify your bottlenecks

Hugobros3 commented 7 months ago

You might also enjoy reading Faith Ekstrand's blog post on descriptors, after which it should be clear that "descriptors handles" the way you desire them aren't really workable, not in a cross platform way anyways. tl;dr some hardware uses magical tables internally and can't be helped.

timeester3648 commented 7 months ago

Slow and not properly usable? I've found descriptor indexing to be very fast and very usable. Not sure what issues you're running into, but you should probably profile your code and identify your bottlenecks

In a dynamic scene, the meshes that get rendered and in what order constantly change so caching descriptors is not an option. This means that descriptors have to be constantly updated for each frame. Especially in a GPU-driven architecture not only the updating itself, but getting the descriptors ready and separating the draw calls to prevent very big descriptors. With only 4k instances being able to be rendered in 1 call, ~32k (combined sampler) images are needed for a simple PBR material. So the limit has to be set lower, and then you have to hope that there are not too many differing materials so multiple instances can use the same descriptor index (right now I limit 1 call to max 64 different materials or lower if the current GPU does not support that many).

Although it would not be as clean and nice to work with, descriptor indexing (which enough GPUs support) combined with VK_EXT_descriptor_buffer (which not enough GPUs support) could alleviate the problems if only not only more GPUs support it, but also the supported GPUs support VkPhysicalDeviceDescriptorBufferFeaturesEXT::descriptorBufferImageLayoutIgnored, which prevents the handle from being easily and efficiently cached (multiple layouts get used in a large renderer for i.e. rendering and compute shaders), so all "binds" basically become just a std::memcpy. Although I could maybe hack something together for the common layouts if it appears that not a lot of different layouts are used.

You might also enjoy reading Faith Ekstrand's blog post on descriptors, after which it should be clear that "descriptors handles" the way you desire them aren't really workable, not in a cross platform way anyways. tl;dr some hardware uses magical tables internally and can't be helped.

But GL_ARB_bindless_texture could have it over 10 years ago, which seems widely supported. Does that mean GL_ARB_bindless_texture is sort of "emulated" on GPUs? (I guess aside from Nvidia at least due to their internal use for VK_NVX_image_view_handle?)

Hugobros3 commented 7 months ago

A common approach these days is to "bind everything" in a giant runtime-sized descriptor array, which can be partially updated with relaxed sync requirements thanks to the descriptorIndexing extension/feature. This "everything" set just stays bound all the time and is updated in-place when new resources are created, essentially eliminating overhead.

But GL_ARB_bindless_texture could have it over 10 years ago, which seems widely supported. Does that mean GL_ARB_bindless_texture is sort of "emulated" on GPUs? (I guess aside from Nvidia at least due to their internal use for VK_NVX_image_view_handle?)

Yes. In fact, Jeff Bolz said about as much in the other issue: https://github.com/KhronosGroup/Vulkan-Docs/issues/871#issuecomment-480002208 I feel like this issue is just treading over the same ground again...

nanokatze commented 7 months ago

but this makes descriptors huge and slow as I already mentioned. Updating a descriptor is still needed (which can take long) and there are limitations on some descriptor counts but that is mostly for uniform buffers on older (Nvidia 10XX GPUs) so that does not matter.

This makes zero sense to me. Drivers for NV's and Intel's hardware already use a huge (on the order of several MB) table already, it's there all the time, and every vkCreateImageView/vkCreateSampler and the respective destroy calls access that table and also do the book keeping.

vkUpdateDescriptorSets writing a single descriptor absolutely does not take "long" and you only have to do it once (one vkUpdateDescriptorSets per vkCreateImageView or vkCreateSampler) unless there are some constraints (e.g. existing codebase) that preclude you from doing so.

On drivers for other hardware, VK_EXT_descriptor_indexing is very close to how things actually work. You will not get any performance wins with vkGetImageViewHandle/ARB_bindless_texture approach there.

This means that descriptors have to be constantly updated for each frame. Especially in a GPU-driven architecture not only the updating itself, but getting the descriptors ready and separating the draw calls to prevent very big descriptors. With only 4k instances being able to be rendered in 1 call, ~32k (combined sampler) images are needed for a simple PBR material. So the limit has to be set lower, and then you have to hope that there are not too many differing materials so multiple instances can use the same descriptor index (right now I limit 1 call to max 64 different materials or lower if the current GPU does not support that many).

Create a descriptor set with 1e6 entries. The index into this Single Huge Descriptor Set is your descriptor. You can make the index as big as you want, uint16, uint32, ... Put those indices into your PBR material struct. This will allow you to render with 1e6 different textures, all within a single draw call, with at most 4 bytes per descriptor (which are indices.)

Use sampled image and sampler descriptors, instead of combined image-sampler descriptors, so that you don't need to vkUpdateDescriptorSets for each combination. Combine them in your shader at runtime, as in texture(nonuniformEXT(sampler2D(textures[i], samplers[j])), ...).

What "limits" are you talking about here?

Although it would not be as clean and nice to work with, descriptor indexing (which enough GPUs support) combined with VK_EXT_descriptor_buffer (which not enough GPUs support) could alleviate the problems if only not only more GPUs support it

You don't need VK_EXT_descriptor_buffer here. That extension helps with other problems.

but also the supported GPUs support VkPhysicalDeviceDescriptorBufferFeaturesEXT::descriptorBufferImageLayoutIgnored, which prevents the handle from being easily and efficiently cached (multiple layouts get used in a large renderer for i.e. rendering and compute shaders),

With sync2, there are pretty much just 2 layouts a subresource accessible by shader should ever be in: GENERAL and READ_ONLY_OPTIMAL (notice it's just READ_ONLY, not SHADER_READ_ONLY or some_other_thing_READ_ONLY.) Images that is only ever imageStored into and then sampled from, you can just keep in GENERAL at likely no performance loss (READ_ONLY helps when you're transitioning away from ATTACHMENT_OPTIMAL, on some hardware.)

so all "binds" basically become just a std::memcpy. Although I could maybe hack something together for the common layouts.

See the second paragraph of this comment.

But GL_ARB_bindless_texture could have it over 10 years ago, which seems widely supported. Does that mean GL_ARB_bindless_texture is sort of "emulated" on GPUs? (I guess aside from Nvidia at least due to their internal use for VK_NVX_image_view_handle?)

See the first paragraph of this comment.

The ergonomics issues of the current thing aren't the best, that much I agree, but I'm extremely skeptical that the thing you're proposing (which I read as "image_view_handle but KHR") is the right tradeoff, at the very least the way it appears currently in the NVX ext.

timeester3648 commented 7 months ago

This makes zero sense to me. Drivers for NV's and Intel's hardware already use a huge (on the order of several MB) table already, it's there all the time, and every vkCreateImageView/vkCreateSampler and the respective destroy calls access that table and also do the book keeping.

This may be worsened by validation layers, but since most of the development has to be done with it to actually catch errors and fix them quickly, debugging any large scene (even on high-end hardware) becomes slow. I do not know if it is still the case (especially since you did not mention AMD here) but I noticed huge update times on an AMD laptop, but they may have improved that I don't know.

On drivers for other hardware, VK_EXT_descriptor_indexing is very close to how things actually work. You will not get any performance wins with vkGetImageViewHandle/ARB_bindless_texture approach there.

This proposal was not for making GPUs render more efficiently, but for making the CPU side far simpler, cleaner and also lower overhead.

Create a descriptor set with 1e6 entries. The index into this Single Huge Descriptor Set is your descriptor. You can make the index as big as you want, uint16, uint32, ... Put those indices into your PBR material struct. This will allow you to render with 1e6 different textures, all within a single draw call, with at most 4 bytes per descriptor (which are indices.)

Can I though? When I designed this architecture some time ago I came across GitHub comments from AMD driver engineers stating that they immediately allocate the entire descriptor size, so they recommended against creating huge descriptors since it will take all the memory it could use before it even needs it, even if never in its lifetime it'd need the entire buffer. I do not know if this was changed though.

I'm already using the indices in the PBR struct technique + descriptor indexing (and descriptor buffer where supported)

Use sampled image and sampler descriptors, instead of combined image-sampler descriptors, so that you don't need to vkUpdateDescriptorSets for each combination. Combine them in your shader at runtime, as in texture(nonuniformEXT(sampler2D(textures[i], samplers[j])), ...).

That makes syncing the image and samplers more of a pain since now you have to handle them separately (I already support this but it was an option instead of a requirement). And I don't think it'd help that much since the texture has to be put in a descriptor anyway and in the same struct for updating the descriptor or in getting the handle for descriptor_buffer, you can put the sampler. So that'd mean instead of setting the image (which optionally has a custom sampler else it will use the default one) you have to set the image AND the sampler, which now are 2 per entry in the descriptorset, although it could lower the amount of samplers due to reuse, but still the same amount of images would have to be bound + the now (although lower amount of) samplers.

What "limits" are you talking about here?

If you're referencing this part of my response:

I limit 1 call to max 64 different materials or lower if the current GPU does not support that many The answer lies in the paragraph about AMD's driver engineers.

You don't need VK_EXT_descriptor_buffer here. That extension helps with other problems.

I did not say that I need it, but that it helps since if I could just std::memcpy and that was the "bind" (no syscalls i.e. vkGetDescripor or vkUpdateDescriptorSets). And since descriptor_buffer is almost the same as image handles (just pretend that the retrieved data is just the supposed "image handle") it could help, but would of course not be as nice.

With sync2, there are pretty much just 2 layouts a subresource accessible by shader should ever be in: GENERAL and READ_ONLY_OPTIMAL (notice it's just READ_ONLY, not SHADER_READ_ONLY or some_other_thing_READ_ONLY.) Images that is only ever imageStored into and then sampled from, you can just keep in GENERAL at likely no performance loss (READ_ONLY helps when you're transitioning away from ATTACHMENT_OPTIMAL, on some hardware.)

I'm already using sync2 (I'm using Vulkan 1.3 at the moment), and that's also why I said:

Although I could maybe hack something together for the common layouts.

So that I can still cache them.

See the second paragraph of this comment.

Again AMD issue, and I do not know if Nvidia and Intel are different, but I guess I'll have to test on Nvidia.

The ergonomics issues of the current thing aren't the best, that much I agree, but I'm extremely skeptical that the thing you're proposing (which I read as "image_view_handle but KHR") is the right tradeoff, at the very least the way it appears currently in the NVX ext.

Even if it's not faster on the GPU, it would still make the CPU side a lot easier, but I just hope that the image side gets improved like it did with buffer_reference, which I assume then does not have to be "emulated" like images. Or if descriptor_buffer would be more widespread and the descriptor size limit would not be so big.

LVSTRI commented 7 months ago

Can I though? When I designed this architecture some time ago I came across GitHub comments from AMD driver engineers stating that they immediately allocate the entire descriptor size, so they recommended against creating huge descriptors since it will take all the memory it could use before it even needs it, even if never in its lifetime it'd need the entire buffer.

I don't understand why this is a problem, descriptors themselves take very little memory and even if you allocate a million VK_DESCRIPTOR_TYPE_SAMPLED_IMAGE the impact on memory should be extremely limited, not to mention easily managed.

HansKristian-Work commented 6 months ago

so they recommended against creating huge descriptors since it will take all the memory it could use before it even needs it, even if never in its lifetime it'd need the entire buffer. I do not know if this was changed though.

I don't think that advice was in the context of descriptor indexing, most likely creating huge VkDescriptorPools that wouldn't be used by a descriptor set. The normal approach for bindless is to allocate some large-ish buffer up-front. Do note that if all drivers had to support GetViewHandle, most drivers would be forced to allocate some $large internal shadow buffer in the driver, which might spuriously hit out of memory conditions once enough views have been created.

but I noticed huge update times on an AMD laptop, but they may have improved that I don't know.

Descriptor indexing with PARTIALLY_BOUND_BIT isn't that slow, since it blocks the CPU from doing validation. Descriptor indexing without PARTIALLY_BOUND_BIT means the CPU is expected to validate a ton of descriptors, which can perform poorly in some situations, that is true, but ViewHandle just pushes the problem to GPU VA, similar as PARTIALLY_BOUND_BIT does.

I think @nanokatze summarized things well, so no need to explain further.

TomOlson commented 6 months ago

Thanks all for this. This issue has come up in the last couple of WG meetings, and @HansKristian-Work's comments reflect that discussion. We agree that @nanokatze's comment is the right way to think about it. To be clear, we absolutely do recommend using descriptor indexing wherever it is supported - performance is supposed to be good on all supported platforms (modulo the very hard problem of validation), and it is required in Roadmap 2022, so over time it will become available everywhere except on very low-end / deeply embedded devices.

Having said that, resource referencing is one of the fundamental problems in both GPU and rendering engine design, and there is likely to be further evolution in how Vulkan exposes it. View handles are not a great direction for reasons others have discussed on this thread. But we are always grateful to hear about use cases that current solutions aren't working for.

If there are no open questions about this specific proposal, I'd suggest we close the issue and use #871 for more general discussion of resource referencing, or new issues for new use cases or bugs in the current model.

timeester3648 commented 6 months ago

Then I'll close this issue.