List the warp dependent characteristics of the CUDA kernel

ancahamuraru commented 9 years ago

The CUDA kernel relies on warp size and warp behavior, e.g. for synchronization. Read the code and list such characteristics.

dkarkoulis commented 9 years ago

The original kernel is using a block of 64 threads which is 2 warps for Nvidia. The kernel is configured to explicitly or implicitly take into account the number of warps. Given that, with the given work-group configuration, an AMD GPU will launch one warp (wave-front) per block and the CPU (debugging) will launch 64, there can be some issues.

1) Local memory related 1a) xqib and exclusion forces: This is thankfully not an issue. There is an explicit barrier between the writing and reading off the local memory. It causes no problem for AMD GPU or CPU. The barrier is redundant in the case of AMD GPU as there is only one warp.

1b) cjs and pre-loading cj: This was an issue for the CPU as the use of shared memory depends on the implicit synchronisation of the warps. It is not an issue for an AMD GPU. For the CPU it has been resolved by adding barriers (it can also be resolved by increasing the size of the local memory, but we are not interested for performance on the CPU case). For an AMD GPU we will only need half that local memory (and other buffers can be reduced too, this will be explained later)

1c) Reduction pow2 functions: Local reduction, It is readily usable with an AMD GPU (just the offset of the local buffer to reduce "tidx&WARP_SIZE" will always be 0 and warping the index around the WARP_SIZE has no effect, so the call can be simplified). It should also work properly for other warp sizes except for the CPU. There, some modifications are needed, or the generic reduction can be used instead.

2) Warp-dependant allocations 2a) Im_ei (pl_cj4[j4].imei[widx]) and pair for the pair-lists: Im_ei has the size of the number of warps (2) and it is hard-coded on both the host and the device side. Pair has the size of the warp (32) and it is also hard-coded. Using them in a device with different warp size should not be an issue as long as the warp size is >= 32. So for the CPU barriers are needed (and have been added). For an AMD GPU this part can be simplified.

2b) Warp vote and pruning: Warp voting has been implemented using shared memory (1 element per warp). For now it is hard-coded for 2 warps. It causes no problem for an AMD GPU or the CPU. I am not sure that by just replacing the two pruning masks from 2 warps with 1 will be correct, as it might need modifications on the input mask on the host side.

dkarkoulis commented 9 years ago

Concerning (1b) "cjs and pre-loading cj" , on the CPU part the barriers were not correct and were found to cause issued. They were removed along with the pre-loading of cj to the local memory in the case of 1 thread warp. This way no synchronisation is needed at all.

StreamHPC / gromacs

List the warp dependent characteristics of the CUDA kernel #38