StreamHPC / gromacs

OpenCL porting of the GROMACS molecular simulation toolkit
http://www.gromacs.org
Other
25 stars 4 forks source link

Optimize data layout for f, f_shift, shift_vec device buffers (float3) #27

Closed ancahamuraru closed 9 years ago

ancahamuraru commented 9 years ago

These buffers contain elements of type (float, float, float). Because a float3 value is stored on 4 x 4 = 16 bytes, the data type for f, f_shift, shift_vec has been changed to float.

However, the content of the buffers has remained the same. Each element consists of 3 consecutive float values.

This needs to be optimized.

Possible solutions:

See also issue #17

pszi1ard commented 9 years ago

Why is that important? At least on NVIDIA we make sure to not have shared memory bank conflicts and the global memory is accessed through atomic operations anyway so coalescing is less of an issue. In the CUDA implementation I was initially using float4 force buffers (for the well known coalescing issues with float3), but as this was not relevant I soon switched to float3 to shave off 25% force D2H transfer time.

ancahamuraru commented 9 years ago

First of all, there is no atomic_add for float3 in OpenCL 1.1 or 1.2. As a consequence the OpenCL kernels issue one atomic_add like call for each of the 3 components of a float3 values.

Second, the shared buffers used to implement reductions also store float3 values in different planes.

The only downside of changing the data layout of those float3 global memory buffers is that it requires a translation from the host side layout to the device side layout.

pszi1ard commented 9 years ago

There are no atomic operations for vector types in CUDA either, I simply overload atomicAdd() to make the code a bit more elegant, see: src/gromacs/gmxlib/cuda_tools/vectype_ops.cuh

What's the advantage of using arrays of force components rather than arrays of float3's? I wouldn't think it's more efficient on NVIDIA, is this a problem on AMD?

Unless there is a reasonably large gain to it, I don't think we should transform the global memory buffer layout. Also note that with domain-decomposition, the current storage layout allows blocking coordinates/forces so that the first part of the array is local data, the second non-local. Splitting into components would mean that three copies would need to be done for a single force buffer transfer which has both API overhead and as it reduced the size of the buffer copied in one transfer, it makes even harder to get close to peak PCI-E bandwidth.

If such an optimization is still warranted, the transformation should be done in nbnxn_atomdata.c:nbnxn_atomdata_add_nbat_f_to_f().

ancahamuraru commented 9 years ago

Because f, fshift, shift_vec arrays are all declared as float buffers instead of float3 buffers (see "3 component vector data type size" here: https://github.com/StreamComputing/gromacs/wiki/A1.3-log), their elements are accessed at all times float by float instead of float3 by float3.

This raised the question whether memory coalescing could be achieved by separating float3 components on 3 different planes. However, rechecking how the kernels access the the three buffers and considering your notes, such a change would bring little to no performance improvement.

This is how the kernels do reads/writes for the three buffers (correct me if I'm wrong):