Closed GoogleCodeExporter closed 9 years ago
The 80 GB/s figure was measured on a Tesla C1060 rather than a GeForce GTX 280,
so
the true regression is not as severe. Still, the current implementation is
measurably slower than one with a fixed block and/or grid dimensions. The
'initial'
code chooses a blocksize 512, so the difference between it and the 'blocksize
(512)'
result is due to the overhead associated with
arch::max_blocksize_with_highest_occupancy(). The 'blocksize (256)' result
happens
to be faster still. Finally, setting the gridsize statically and cutting out
the
call to arch::max_active_blocks() improves performance by another 5%.
initial: 109 GB/s @ 16M and 123 GB/s MAX
blocksize (256): 116 GB/s @ 16M and 128 GB/s MAX
blocksize (512): 114 GB/s @ 16M and 124 GB/s MAX
blocksize (256) gridsize (120): 122 GB/s @ 16M and 129 GB/s MAX
Pushing this back to Thrust v1.3 where we'll take a more comprehensive look at
performance.
Original comment by wnbell
on 28 Feb 2010 at 2:31
We should also consider conserving __shared__ memory by using less than one
value per
thread. For example, we could use half the space by doing
if (threadIdx.x < blockDim.x / 2)
smem[threadIdx.x] = sum;
__syncthreads();
if (threadIdx.x >= blockDim.x / 2)
smem[threadIdx.x] = binary_op(smem[threadIdx.x], sum);
This would allow us to improve thread occupancy for large data types.
Original comment by wnbell
on 17 Mar 2010 at 4:32
Revisions ba327ef5c5 and 0ee976ed07 cache the results of expensive CUDART calls
made by reduce.
Original comment by jaredhoberock
on 6 Jul 2010 at 8:45
Moving the remaining optimization mentioned in comment #2 to Milestone 1.x
Original comment by wnbell
on 1 Sep 2010 at 5:48
Fixed by Revision a828f9be9fbb
Original comment by wnbell
on 21 Aug 2011 at 10:28
Original issue reported on code.google.com by
wnbell
on 28 Feb 2010 at 12:10