What makes STQ faster versus SAP on the GPU?

Hi @zfergus ,

I was recently learning a bit more about parallelizing contact simulation on the GPU, specifically focused on broad phase collision candidate determination. Your related article Time of Impact Dataset for Continuous Collision Detection and a Scalable Conservative Algorithm, as well as this public repository were both quite helpful to me.

I wanted to inquire about the reasoning behind STQ mapping better to the GPU than SAP. I'm a bit confused about this. In both approaches, the preprocessing and sorting phases are the same, i.e. we

Compute bounding boxes
Compute their statistical mean and variance
Sort them along largest variance axis

The difference between STQ and SAP is the sweep phase. In both approaches, algorithmically, we associate 1 thread per bounding box i, which incrementally does look-aheads, i.e. starting from its immediate next bounding box neighbor in sorted order, it keeps looking at the following neighbor as a potential overlap candidate j until j does not overlap with i along the sort axis. For every potential overlap candidate j, if there is no topological incidence between i and j and if they both overlap along all other axis', then they should be enqueued as collision candidates in a device-wide atomic fashion. However, STQ (as opposed to SAP) has additional synchronization steps, i.e. warp-wide synchronization, on top of the baseline device-wide synchronization. Work does not seem to be shared between threads of a common warp, i.e. there is no work stealing, so each thread in STQ has the same amount of work as it would have in SAP.

Do you mind elaborating on why STQ is a more favorable approach on the GPU versus SAP? I'm sure I'm missing something. There must be a reason that using a warp-wide shared queue is preferable.

Hello,

Thank you for your interest in our work.

The different CUDA implementations can be found here. The difference between SAP and STQ on the GPU is how the work is divided between the threads.

In SAP, much like on the CPU, each thread is assigned a box and it iterates until it finds the next box in the sorted list that does not intersect. This has the limitation that the workload can be unbalanced between different threads. E.g., one thread may only find 1 intersection while another may find 100. On the CPU this works well as the thread with little work can be freed and assigned a new box. However, on the GPU the thread with little work will have to wait for all threads in its block to finish before being assigned a new box to process.

In contrast, STQ uses a shared queue of boxes to examine and each thread pulls a single box from this queue, processes it, and (possibly) pushes a single box onto the queue to be processed in the next iteration. By processing boxes in this way, each thread will be assigned the same amount of work. The only downside is that threads need to sync up at times to make sure the queue's size is not subject to race conditions.

Hopefully, this clears up the difference between the two algorithms. Feel free to follow up with any other questions, and hopefully, we will have an updated publication to share soon.

Best, Zach

Continuous-Collision-Detection / Scalable-CCD

What makes STQ faster versus SAP on the GPU? #2