MircoWerner / VkRadixSort

GPU Radix Sort implemented in Vulkan and GLSL.
MIT License
40 stars 4 forks source link

barrier issue on Intel UHD Graphics #3

Closed hyperlogic closed 6 months ago

hyperlogic commented 6 months ago

Hi, I'm using VkRadixSort shader in my application, it's working great on NVIDIA cards! However, I ran into an issue on Intel UHD Graphics, possibly due to this warning during shader compilation:

[ERROR] shader compilation warning for "shader/multi_radixsort.glsl"!
WARNING: 0:101: '' : barrier call inside control flow
WARNING: 0:116: '' : barrier call inside control flow
WARNING: 0:137: '' : barrier call inside control flow

Any ideas?

One big difference is that I'm using OpenGL not Vulkan for my app. Do I need to maybe change the glsl version number currently #version 460? Any advice would be appreciated.

If you need more context the source for my app is here

The line numbers might be different then your original source so here's a full dump of the version I'm using:

0001: /**
0002: * VkRadixSort written by Mirco Werner: https://github.com/MircoWerner/VkRadixSort
0003: * Based on implementation of Intel's Embree: https://github.com/embree/embree/blob/v4.0.0-ploc/kernels/rthwif/builder/gpu/sort.h
0004: */
0005: #version 460
0006: //#extension GL_GOOGLE_include_directive: enable
0007: #extension GL_KHR_shader_subgroup_basic: enable
0008: #extension GL_KHR_shader_subgroup_arithmetic: enable
0009: #extension GL_KHR_shader_subgroup_ballot: enable
0010:
0011: #define WORKGROUP_SIZE 256// assert WORKGROUP_SIZE >= RADIX_SORT_BINS
0012: #define RADIX_SORT_BINS 256
0013: #define SUBGROUP_SIZE 32// 32 NVIDIA; 64 AMD
0014:
0015: #define BITS 32// sorting uint32_t
0016:
0017: layout (local_size_x = WORKGROUP_SIZE) in;
0018:
0019: uniform uint g_num_elements;
0020: uniform uint g_shift;
0021: uniform uint g_num_workgroups;
0022: uniform uint g_num_blocks_per_workgroup;
0023:
0024: layout (std430, set = 1, binding = 0) buffer elements_in {
0025:     uint g_elements_in[];
0026: };
0027:
0028: layout (std430, set = 1, binding = 1) buffer elements_out {
0029:     uint g_elements_out[];
0030: };
0031:
0032: layout (std430, set = 1, binding = 2) buffer indices_in {
0033:     uint g_indices_in[];
0034: };
0035:
0036: layout (std430, set = 1, binding = 3) buffer indices_out {
0037:     uint g_indices_out[];
0038: };
0039:
0040: layout (std430, set = 1, binding = 4) buffer histograms {
0041: // [histogram_of_workgroup_0 | histogram_of_workgroup_1 | ... ]
0042:     uint g_histograms[];// |g_histograms| = RADIX_SORT_BINS * #WORKGROUPS = RADIX_SORT_BINS * g_num_workgroups
0043: };
0044:
0045: shared uint[RADIX_SORT_BINS / SUBGROUP_SIZE] sums;// subgroup reductions
0046: shared uint[RADIX_SORT_BINS] global_offsets;// global exclusive scan (prefix sum)
0047:
0048: struct BinFlags {
0049:     uint flags[WORKGROUP_SIZE / BITS];
0050: };
0051: shared BinFlags[RADIX_SORT_BINS] bin_flags;
0052:
0053: void main() {
0054:     uint gID = gl_GlobalInvocationID.x;
0055:     uint lID = gl_LocalInvocationID.x;
0056:     uint wID = gl_WorkGroupID.x;
0057:     uint sID = gl_SubgroupID;
0058:     uint lsID = gl_SubgroupInvocationID;
0059:
0060:     uint local_histogram = 0;
0061:     uint prefix_sum = 0;
0062:     uint histogram_count = 0;
0063:
0064:     if (lID < RADIX_SORT_BINS) {
0065:         uint count = 0;
0066:         for (uint j = 0; j < g_num_workgroups; j++) {
0067:             const uint t = g_histograms[RADIX_SORT_BINS * j + lID];
0068:             local_histogram = (j == wID) ? count : local_histogram;
0069:             count += t;
0070:         }
0071:         histogram_count = count;
0072:         const uint sum = subgroupAdd(histogram_count);
0073:         prefix_sum = subgroupExclusiveAdd(histogram_count);
0074:         if (subgroupElect()) {
0075:             // one thread inside the warp/subgroup enters this section
0076:             sums[sID] = sum;
0077:         }
0078:     }
0079:     barrier();
0080:
0081:     if (lID < RADIX_SORT_BINS) {
0082:         const uint sums_prefix_sum = subgroupBroadcast(subgroupExclusiveAdd(sums[lsID]), sID);
0083:         const uint global_histogram = sums_prefix_sum + prefix_sum;
0084:         global_offsets[lID] = global_histogram + local_histogram;
0085:     }
0086:
0087:     //     ==== scatter keys according to global offsets =====
0088:     const uint flags_bin = lID / BITS;
0089:     const uint flags_bit = 1 << (lID % BITS);
0090:
0091:     for (uint index = 0; index < g_num_blocks_per_workgroup; index++) {
0092:         uint elementId = wID * g_num_blocks_per_workgroup * WORKGROUP_SIZE + index * WORKGROUP_SIZE + lID;
0093:
0094:         // initialize bin flags
0095:         if (lID < RADIX_SORT_BINS) {
0096:             for (int i = 0; i < WORKGROUP_SIZE / BITS; i++) {
0097:                 bin_flags[lID].flags[i] = 0U;// init all bin flags to 0
0098:             }
0099:         }
0100:         barrier();
0101:
0102:         uint element_in = 0;
0103:         uint index_in = 0;
0104:         uint binID = 0;
0105:         uint binOffset = 0;
0106:         if (elementId < g_num_elements) {
0107:             element_in = g_elements_in[elementId];
0108:             index_in = g_indices_in[elementId];
0109:             binID = (element_in >> g_shift) & uint(RADIX_SORT_BINS - 1);
0110:             // offset for group
0111:             binOffset = global_offsets[binID];
0112:             // add bit to flag
0113:             atomicAdd(bin_flags[binID].flags[flags_bin], flags_bit);
0114:         }
0115:         barrier();
0116:
0117:         if (elementId < g_num_elements) {
0118:             // calculate output index of element
0119:             uint prefix = 0;
0120:             uint count = 0;
0121:             for (uint i = 0; i < WORKGROUP_SIZE / BITS; i++) {
0122:                 const uint bits = bin_flags[binID].flags[i];
0123:                 const uint full_count = bitCount(bits);
0124:                 const uint partial_count = bitCount(bits & (flags_bit - 1));
0125:                 prefix += (i < flags_bin) ? full_count : 0U;
0126:                 prefix += (i == flags_bin) ? partial_count : 0U;
0127:                 count += full_count;
0128:             }
0129:             g_elements_out[binOffset + prefix] = element_in;
0130:             g_indices_out[binOffset + prefix] = index_in;
0131:             if (prefix == count - 1) {
0132:                 atomicAdd(global_offsets[binID], count);
0133:             }
0134:         }
0135:
0136:         barrier();
0137:     }
0138: }
hyperlogic commented 6 months ago

Switching #define SUBGROUP_SIZE 32 to 64 fixed the issue. The warning is still there, but the sort is correct. I guess I'll have to change that depending on platform.

Closing

Thanks for making this open source!

hyperlogic commented 6 months ago

Er, I take that back. for small inputs < 32 the sort is correct, but for larger inputs it is not.

hyperlogic commented 6 months ago

Circling back... The sort works! The warning is still there however. My error was something different (z-buffer depth was default 16bits on Intel UHD, resulting in z-fighting). Closing this issue.

Thanks again.

MircoWerner commented 5 months ago

Hi, glad that you figured it out! Just for completeness: The warning is correct that the barrier is inside the control flow. However, as g_num_blocks_per_workgroup is constant, all threads of a warp/subgroup will enter the for-loop the same number of times and all threads will call the barriers, so no deadlocks can occur.