Open gzagaris opened 1 year ago
This is expected behavior. The reducers need memory to function so the first reducer call causes allocations and initialization in internal memory pools (device, device zeroed, and pinned pools). You should be able to see this happening if you profile with something like nsight systems.
I am thinking about how to allow users to pass in their own allocators so we don't have these separate RAJA only pools of memory.
In other reduction news, try using this policy for loops with reductions in them RAJA::cuda_exec_occ_calc_async<BLOCK_SIZE>
. Note that using this policy adds overhead the first time each loop using this policy is run because the occupancy calculator is called. This improves reduction performance for loops with extents significantly over a million. This policy uses the occupancy calculator to only launch as many GPU threads as can run concurrently on the device, then uses a grid stride loop to run all the iterates of the loop. This results in a simpler reduction tree and less temporary memory usage in the implementation. These policies will likely become the recommended policy to use with reductions in the near future.
Thank you for the detailed explanation and suggestions @MrBurmark!
The RAJA::cuda_exec_occ_calc_async< BLOCK_SIZE >
sounds fancy! I'll definitely try it out.
RAJA::Kernel
, i.e., as the policy with the RAJA::statement::For
? Would you recommend that?On a slight tangent, I was thinking a bit more about this the other day and I was wondering if hiding initial overheads like this justifies as a use case for providing methods, such as, RAJA::initialize(argc, argv)
and RAJA::finalize()
. Moreover, having these methods may provide a good way for users to pass allocators and memory pools from the application space for RAJA to use internally. I am not certain if that was something that has been considered in the past and wanted to bring it up for your consideration.
Thank you again for all your help.
You can't use that policy with RAJA::kernel
.
There are occupancy calculator policies for kernel already but using them is a bit more complicated.
You have to use an occupancy CudaKernel statement for the occupancy calculator to be used when choosing launch params like the number of blocks. In addition block_direct and global_direct policies should probably be replaced with block_loop or global_loop policies to ensure the number of blocks required stays below the bounds imposed by the occupancy calculator.
RAJA::KernelPolicy<
RAJA::statement::CudaKernelFixedAsync<block_sz,
RAJA::statement::For<0, RAJA::cuda_global_size_x_direct<block_sz>,
RAJA::statement::For<1, RAJA::cuda_block_y_direct,
RAJA::statement::For<2, RAJA::cuda_thread_y_direct,
RAJA::statement::Lambda<0> > > > > >;
vs
RAJA::KernelPolicy<
RAJA::statement::CudaKernelOccAsync<
// or RAJA::statement::CudaKernelExt<cuda_occ_calc_launch<block_sz, true>,
RAJA::statement::For<0, RAJA::cuda_global_size_x_loop<block_sz>,
RAJA::statement::For<1, RAJA::cuda_block_y_loop,
RAJA::statement::For<2, RAJA::cuda_thread_y_direct,
RAJA::statement::Lambda<0> > > > > >;
Thanks @MrBurmark -- it's always complicated with RAJA::kernel
😄
The example you provided is exactly what I was looking for!
@gzagaris several years ago, we considered adding RAJA::initialize and RAJA::finalize methods. Kokkos does that, for example. We didn't really see a strong need at the time and thought RAJA would be more flexible without it. We will reconsider and let you know.
@gzagaris several years ago, we considered adding RAJA::initialize and RAJA::finalize methods. Kokkos does that, for example. We didn't really see a strong need at the time and thought RAJA would be more flexible without it. We will reconsider and let you know.
Sounds good, thanks @rhornung67!
Hi everyone,
We've observed that the execution time of the first reduction is notably high (in some cases, slower than sequential). However, subsequent reductions do not exhibit this behavior, suggesting that there might be some overhead (perhaps internal initialization?) with the first reduction that is invoked by the application.
At your convenience, can you confirm if that is expected behavior with the current implementation and elaborate a bit on what is happening?
Steps To Reproduce
using LOOP_EXEC = RAJA::cuda_exec< 256 >; using REDUCE_EXEC = RAJA::cuda_reduce;
// Allocate and initialize the data int* data_1 = nullptr; cudaMalloc((void**)&data_1, NUM_BYTES); RAJA::forall< LOOP_EXEC >( RAJA::RangeSegment(0, NUM_ITEMS), [ = ] RAJA_HOST_DEVICE(const int& idx) { data_1[ idx ] = TEST_VAL_1; } );
// Reduce data { auto start = std::chrono::high_resolution_clock::now();
}
// Reduce data again { auto start = std::chrono::high_resolution_clock::now();
} cudaFree(data_1);
elapsed[1] = 0.0221074 // <--------------- EXECUTION TIME OF FIRST REDUCTION elapsed[1] = 1.5151e-05. // <--------------- RUNNING THE REDUCTION AGAIN IS SIGNIFICANTLY FASTER