I was analyzing how the Multi Radix Sort works and I don't fully understand how you make sure it works correctly in all circumstances.
The part that concerns me is the scattering/reordering part. Here, if I get it right you use bit masks to count the number of occurrences of a particular binID within the workgroup, next you account this difference on the global level with atomicAdd. So if workgroups were scheduled one after another, "sequentially", your approach would be 100% correct, however the workgroup order is generally undefined and workgroups can be executed in parallel, so I wonder how your algorithm deals with the case when two workgroups are scheduled in parallel or out of order. For the sake of simplicity let's see the parallel case:
So it looks to me that p1 and p2 are local to the respective WG, so in the case of parallel execution of WG1 and WG2, [output[binOffset + p{1,2}] = input] can step on each other toes and write to the same output indices, just because the global level atomicAdd(global_offsets[binID], c{1,2} hasn't been reached yet.
Hi!
I was analyzing how the Multi Radix Sort works and I don't fully understand how you make sure it works correctly in all circumstances. The part that concerns me is the scattering/reordering part. Here, if I get it right you use bit masks to count the number of occurrences of a particular
binID
within the workgroup, next you account this difference on the global level with atomicAdd. So if workgroups were scheduled one after another, "sequentially", your approach would be 100% correct, however the workgroup order is generally undefined and workgroups can be executed in parallel, so I wonder how your algorithm deals with the case when two workgroups are scheduled in parallel or out of order. For the sake of simplicity let's see the parallel case:So it looks to me that
p1
andp2
are local to the respective WG, so in the case of parallel execution of WG1 and WG2,[output[binOffset + p{1,2}] = input]
can step on each other toes and write to the same output indices, just because the global levelatomicAdd(global_offsets[binID], c{1,2}
hasn't been reached yet.Or do I miss something obvious?