Fixing gpudirect issue where gather kernels aren't synced

This pull request fixes #239.

The problem was that the CUDA calls in gather_data_to_buffer_ptr_cuda were being synced in op_download_buffer_async (i.e. gathering the data to the buffer on the device was synced just before the buffer was downloaded from the GPU to the host). However, op_download_buffer_async is not called when using the -gpudirect flag so gathering data to the halo buffer is not synced before the halo is sent over MPI.

The fix I have implemented is to add op_gather_sync() (similar to op_scatter_sync()) so that the gather kernels can be explicitly synched when using -gpudirect. Then depending on OP_gpu_direct, either op_gather_sync or op_download_buffer_sync is called just before the halos are send over MPI. I've tested this both with the airfoil sample app and my high order FEM code, the issue is no longer occurring in either.

If there is a different way to fix this that would be better just let me know and I'll implement that instead.

Thanks, Toby

OP-DSL / OP2-Common

Fixing gpudirect issue where gather kernels aren't synced #240