Optimize data layout for f, f_shift, shift_vec device buffers (float3)

ancahamuraru commented 9 years ago

These buffers contain elements of type (float, float, float). Because a float3 value is stored on 4 x 4 = 16 bytes, the data type for f, f_shift, shift_vec has been changed to float.

However, the content of the buffers has remained the same. Each element consists of 3 consecutive float values.

This needs to be optimized.

Possible solutions:

switch to float4
use 3 different buffers: one for the .x components, one for the .y components, one for .z components

See also issue #17

pszi1ard commented 9 years ago

Why is that important? At least on NVIDIA we make sure to not have shared memory bank conflicts and the global memory is accessed through atomic operations anyway so coalescing is less of an issue. In the CUDA implementation I was initially using float4 force buffers (for the well known coalescing issues with float3), but as this was not relevant I soon switched to float3 to shave off 25% force D2H transfer time.

ancahamuraru commented 9 years ago

First of all, there is no atomic_add for float3 in OpenCL 1.1 or 1.2. As a consequence the OpenCL kernels issue one atomic_add like call for each of the 3 components of a float3 values.

Second, the shared buffers used to implement reductions also store float3 values in different planes.

The only downside of changing the data layout of those float3 global memory buffers is that it requires a translation from the host side layout to the device side layout.

pszi1ard commented 9 years ago

There are no atomic operations for vector types in CUDA either, I simply overload atomicAdd() to make the code a bit more elegant, see: src/gromacs/gmxlib/cuda_tools/vectype_ops.cuh

What's the advantage of using arrays of force components rather than arrays of float3's? I wouldn't think it's more efficient on NVIDIA, is this a problem on AMD?

Unless there is a reasonably large gain to it, I don't think we should transform the global memory buffer layout. Also note that with domain-decomposition, the current storage layout allows blocking coordinates/forces so that the first part of the array is local data, the second non-local. Splitting into components would mean that three copies would need to be done for a single force buffer transfer which has both API overhead and as it reduced the size of the buffer copied in one transfer, it makes even harder to get close to peak PCI-E bandwidth.

If such an optimization is still warranted, the transformation should be done in nbnxn_atomdata.c:nbnxn_atomdata_add_nbat_f_to_f().

ancahamuraru commented 9 years ago

Because f, fshift, shift_vec arrays are all declared as float buffers instead of float3 buffers (see "3 component vector data type size" here: https://github.com/StreamComputing/gromacs/wiki/A1.3-log), their elements are accessed at all times float by float instead of float3 by float3.

This raised the question whether memory coalescing could be achieved by separating float3 components on 3 different planes. However, rechecking how the kernels access the the three buffers and considering your notes, such a change would bring little to no performance improvement.

This is how the kernels do reads/writes for the three buffers (correct me if I'm wrong):

shift_vec - all threads in a block read a float3 value (in OpenCL: 3 x 1 float read) from the same offset. Therefore there will be no performance gain if changing shift_vec data layout.
fshift - all threads in a block with threadIdx.y == 0 write a float3 value (in OpenCL: 3 x 1 float write) to the same offset in fshift. Again, no performance gain if changing fshift data layout.
f - this buffer is accessed in two places: reduce_force_j_generic and reduce_force_i. Referring to the initial CUDA implementation and simplifying things a little bit, this is how the two functions access the buffer:<br> reduce_force_j_generic - Only threads with threadIdx.y == 0 read and write a float3 value (3 x 1float) from/at an offset dependent on threadIdx.y. Given the scattered access to f buffer, changing the data layout would have no impact on the performance.<br> reduce_force_i - Only threads with threadIdx.x == 0 read and write a float3 value (3 x 1float) from/at an offset dependent on threadIdx.x. Here the first CL_SIZE threads of a block will access a contiguous memory area, so switching from 1buffer of float3 elements to 1buffer having .x components first, than .y components, than .z components may improve performance. However, such a change would hardly be noticed as reduce_force_i has a very low impact on the overall kernel performance.

StreamHPC / gromacs

Optimize data layout for f, f_shift, shift_vec device buffers (float3) #27