improve cuda::reduce() performance

GoogleCodeExporter commented 9 years ago

thrust::reduce() maxes out at ~80GB/s as opposed to 130GB/s before

Original issue reported on code.google.com by wnbell on 28 Feb 2010 at 12:10

GoogleCodeExporter commented 9 years ago

The 80 GB/s figure was measured on a Tesla C1060 rather than a GeForce GTX 280, 
so
the true regression is not as severe.  Still, the current implementation is
measurably slower than one with a fixed block and/or grid dimensions.  The 
'initial'
code chooses a blocksize 512, so the difference between it and the 'blocksize 
(512)'
result is due to the overhead associated with
arch::max_blocksize_with_highest_occupancy().  The 'blocksize (256)' result 
happens
to be faster still.  Finally, setting the gridsize statically and cutting out 
the
call to arch::max_active_blocks() improves performance by another 5%.

initial:                        109 GB/s @ 16M and 123 GB/s MAX
blocksize (256):                116 GB/s @ 16M and 128 GB/s MAX
blocksize (512):                114 GB/s @ 16M and 124 GB/s MAX
blocksize (256) gridsize (120): 122 GB/s @ 16M and 129 GB/s MAX 

Pushing this back to Thrust v1.3 where we'll take a more comprehensive look at
performance.

Original comment by wnbell on 28 Feb 2010 at 2:31

Changed title: improve cuda::reduce() performance
Added labels: Milestone-Release1.3, Type-Enhancement
Removed labels: Milestone-Release1.2, Type-Defect

GoogleCodeExporter commented 9 years ago

We should also consider conserving __shared__ memory by using less than one 
value per
thread.  For example, we could use half the space by doing

if (threadIdx.x < blockDim.x / 2)
  smem[threadIdx.x] = sum;

__syncthreads();

if (threadIdx.x >= blockDim.x / 2)
  smem[threadIdx.x] = binary_op(smem[threadIdx.x], sum);

This would allow us to improve thread occupancy for large data types.

Original comment by wnbell on 17 Mar 2010 at 4:32

GoogleCodeExporter commented 9 years ago

Revisions ba327ef5c5 and 0ee976ed07 cache the results of expensive CUDART calls 
made by reduce.

Original comment by jaredhoberock on 6 Jul 2010 at 8:45

GoogleCodeExporter commented 9 years ago

Moving the remaining optimization mentioned in comment #2 to Milestone 1.x

Original comment by wnbell on 1 Sep 2010 at 5:48

Added labels: Milestone-Release1.x
Removed labels: Milestone-Release1.3

GoogleCodeExporter commented 9 years ago

Fixed by Revision a828f9be9fbb

Original comment by wnbell on 21 Aug 2011 at 10:28

Changed state: Fixed
Added labels: Milestone-Release1.5
Removed labels: Milestone-Release1.x

allendaicool / thrust

improve cuda::reduce() performance #105