Try / Tempest

3d graphics engine
MIT License
83 stars 24 forks source link

Bindless support #36

Open Try opened 2 years ago

Try commented 2 years ago

Bindless is quite messy in every api, so need to design nice top-level api with reasonable underlying implementation.

GLSL

GLSL is main language in Tempest, so dedicated section is must. GLSL features 2 ways:

  1. Unbound array of descriptors. - nice and easy to use
  2. Device address. - not portable to metal; hard to track hazards
layout(binding = 0) uniform sampler2D tex[]; // unbound array of textures
layout(binding = 1) uniform sampler2D img[]; // another unbound array of textures
layout(binding = 1, std140) readonly buffer Input {
  vec4 val[];
  } ssbo[]; // unbound array of buffers

Engine-side

std::vector<const Tempest::Texture2d*> ptex(tex.size());
for(size_t i=0; i<tex.size(); ++i)
  ptex[i] = &tex[i];
auto desc = device.descriptors(pso);
desc.set(0,ptex); // taking vector or c-array

Doesn't fit the engine perfectly - need to add support for sampler and textures(non-combined) on top of it.

Vulkan

Caps-list:

VkPhysicalDeviceDescriptorIndexingFeatures::runtimeDescriptorArray; // support for unbound array declaration (tex[])
// Support of nonuniformEXT, per resource-type 
VkPhysicalDeviceDescriptorIndexingFeatures::shaderUniformBufferArrayNonUniformIndexing;
VkPhysicalDeviceDescriptorIndexingFeatures::shaderSampledImageArrayNonUniformIndexing;
VkPhysicalDeviceDescriptorIndexingFeatures::shaderStorageBufferArrayNonUniformIndexing;
VkPhysicalDeviceDescriptorIndexingFeatures::shaderStorageImageArrayNonUniformIndexing;

VK_DESCRIPTOR_BINDING_VARIABLE_DESCRIPTOR_COUNT_BIT can be used (in theory), but only for the very last binding in descriptor set, what doesn't fit GLSL side. Alternatively, it's sufficient to use VK_DESCRIPTOR_BINDING_PARTIALLY_BOUND_BIT_EXT with very-large descriptor array. Size of array has to be defined in C++ upfront, at VkDescriptorSetLayout creation. Current implementation of Tempest can recreate VkDescriptorSetLayout and VkDescriptorSet on a go, if preallocated array is not big enough. But it also requires reallocation of VkPipeline, at runtime, based of descriptor set size - this is hard to implement without extra performance cost.

VK_DESCRIPTOR_BINDING_UPDATE_AFTER_BIND_BIT - useless by itself, but there is a special behavior for this type of descriptors in spec:

... layouts which may be much higher than the pre-existing limits. The old limits only count descriptors in non-updateAfterBind descriptor set layouts, and the new limits count descriptors in all descriptor set layouts in the pipeline layout.


maxUpdateAfterBindDescriptorsInAllPools = 500,000+ // Eh, probably can't do anything sensible about it
maxPerStageUpdateAfterBindResources   = 500,000+

maxPerStageDescriptorUpdateAfterBindSamplers = 500,000+ maxPerStageDescriptorUpdateAfterBindUniformBuffers = 12+ maxPerStageDescriptorUpdateAfterBindStorageBuffers = 500,000+ maxPerStageDescriptorUpdateAfterBindSampledImages = 500,000+ maxPerStageDescriptorUpdateAfterBindStorageImages = 500,000+ maxPerStageDescriptorUpdateAfterBindAccelerationStructures = 500,000+

maxDescriptorSetUpdateAfterBindSamplers = 500,000+ maxDescriptorSetUpdateAfterBindUniformBuffers = 72+ // n × PerStage maxDescriptorSetUpdateAfterBindStorageBuffers = 500,000+ maxDescriptorSetUpdateAfterBindSampledImages = 500,000+ maxDescriptorSetUpdateAfterBindStorageImages = 500,000+ maxDescriptorSetUpdateAfterBindAccelerationStructures = 500,000+

Naturally as there is only single descriptor-set, can just take min of `PerStage` and `DescriptorSet` limits.

Other limits to concern (obsolete):

VkPhysicalDeviceLimits::maxPerStageDescriptorSamplers = 16+; VkPhysicalDeviceLimits::maxPerStageDescriptorUniformBuffers = 12+; VkPhysicalDeviceLimits::maxPerStageDescriptorStorageBuffers = 4+; VkPhysicalDeviceLimits::maxPerStageDescriptorSampledImages = 16+; VkPhysicalDeviceLimits::maxPerStageDescriptorStorageImages = 4+; VkPhysicalDeviceLimits::maxPerStageResources = 128^2+;

VkPhysicalDeviceLimits::maxDescriptorSetSamplers = 96^8+; VkPhysicalDeviceLimits::maxDescriptorSetUniformBuffers = 72^8+; VkPhysicalDeviceLimits::maxDescriptorSetStorageBuffers = 24^8+; VkPhysicalDeviceLimits::maxDescriptorSetSampledImages = 96^8+; VkPhysicalDeviceLimits::maxDescriptorSetStorageImages = 24^8+;

With such limits, `realloc` has to manage per-stage + per-resource + per_set limit somehow.

#### DirectX12
Note: Tempest uses spirv-cross to generate HLSL, except produced HLSL is not valid:

// error: more than one unbounded resource (ssbo and tex) in space 0 ByteAddressBuffer ssbo[] : register(t1, space0); Texture2D tex[] : register(t0, space0); SamplerState _tex_sampler[] : register(s0, space0); RWTexture2D ret : register(u2, space0);

Apparently spirv-cross follows `VARIABLE_DESCRIPTOR_COUNT` workflow. This maps directly to
`D3D12_DESCRIPTOR_HEAP_DESC::NumDescriptors = -1` with same limitation of only one runtime array per set. I theory can workaround with instrumenting spir-v:
`OpDecorate %tex DescriptorSet 0 -> OpDecorate %tex DescriptorSet UNIQ_SPACE` 

Limits:
| Resources Available to the Pipeline | Tier 1 | Tier 2 | Tier 3 |
|----------------|-------|-------|---|
|  Feature levels | 11.0+ | 11.0+ | 11.1+ |
| Maximum number of descriptors in a CBV/SRV/UAV heap used for rendering  | 1,000,000 | 1,000,000 | 1,000,000+ |
|  Maximum number of CBV in all descriptor tables per shader stage | 14 | 14 | full heap |
|  Maximum number of SRV in all descriptor tables per shader stage | 128 | full heap | full heap |
|  Maximum number of UAV in all descriptor tables per shader stage | 64 for feature levels 11.1+ 8 for feature level 11  | 64 | full heap |
|  Maximum number of Samplers in all descriptor tables per shader stage |  16 | 2048 | 2048 |

---
`ID3D12GraphicsCommandList::SetDescriptorHeaps`
Only one descriptor heap of each type can be set at one time, which means a maximum of 2 heaps (one sampler, one CBV/SRV/UAV) can be set at one time.
DX12 is a bit awkward, because limit is shared for all types of descriptors, except sampler. Probably can "just" split heap in equal partitions.

#### Metal [3]

Limits (per-app resources available at any given time are):
| Resources Available to the Pipeline  | Tier1(ios) | Tier1 | Tier2   |
|-------------------------------------|------------|-------|---------|
| Buffers(and TLAS'es)                       | 31         | 64    | 500,000 |
| Textures                                         | 31         | 128   | 500,000 |
| Samplers                                        | 16         | 16    | 2048    |

For both tiers, the maximum number of argument buffer entries in each function argument table is 8.

*Writable textures aren’t supported within an argument buffer.
Tier 1 argument buffers can’t be accessed through pointer indexing, nor can they include pointers to other argument buffers.
Tier 2 argument buffers can be accessed through pointer indexing, as shown in the following example.

T1 argument are practically same as descriptor-set's in vulkan and have nothing usefull in it.
T2 allows for pointer-indexing and can be leveraged for bindless-array.

Sources:
https://gist.github.com/DethRaid/0171f3cfcce51950ee4ef96c64f59617
https://docs.microsoft.com/en-us/windows/win32/api/d3d12/ns-d3d12-d3d12_descriptor_range
https://learn.microsoft.com/en-us/windows/win32/direct3d12/hardware-support?redirectedfrom=MSDN
https://developer.apple.com/documentation/metal/buffers/about_argument_buffers
https://developer.apple.com/documentation/metal/buffers/managing_groups_of_resources_with_argument_buffers

### GLSL
Unbound array of descriptors has 2 meanings:
Base spec: 
`uniform sampler2D tex[]` -> `OpTypeArray %8 %uint_1`
size of array depend on highest index that been used in code.

`GL_EXT_nonuniform_qualifier`:
May work same as base spec, if runtime-index is not in use, and otherwise:
`uniform sampler2D tex[]` ->`OpTypeRuntimeArray %8` // legal only if driver supports descriptor-indexing

### Engine side
[wip]
Generally metal-like model is good middle ground:

maxUAV = 500'000; // ssbo + tlas + imageStore maxTextures = 500'000; maxSamplers = 2048; // can skip maxUbo - hard in vulkan and not very usefull // combined image consumes both Texture and Samplers limits


In DX UAX/Tex - can be achieved by splitting heap in 2 parts
In Vulkan UAV is probably min for all applicable resources
Try commented 1 year ago

TODO, for DX12:

Try commented 2 months ago

error: number of textures with read_write access exceeds maximum supported (8)

apparently undocumented. MoltenVK allows 500k, if argument buffer tier 2 is supported(why?) and 8 otherwise

Try commented 5 days ago

New Mac/iOS feature to track residency of resources: https://developer.apple.com/documentation/metal/resource_fundamentals/simplifying_gpu_resource_management_with_residency_sets?language=objc

According to apple: You don’t need to call the following methods for any allocation in a residency set that you associate with the command buffer: useResource, useHeap