Open delcmo opened 3 years ago
You can call a function inside a kernel. This is completely independent of RAJA. Here's an example: https://github.com/LLNL/RAJAPerf/blob/develop/src/basic/TRAP_INT-Cuda.cpp#L139 The macro defining the kernel body is here: https://github.com/LLNL/RAJAPerf/blob/develop/src/basic/TRAP_INT.hpp#L39 and the function being called is defined here: https://github.com/LLNL/RAJAPerf/blob/develop/src/basic/TRAP_INT-Cuda.cpp#L27
@rhornung67 thanks for your email. In the example you provided, the function is defined in the RAJA space:
RAJA_INLINE
RAJA_DEVICE
Real_type trap_int_func(Real_type x,
Real_type y,
Real_type xp,
Real_type yp)
{
Real_type denom = (x - xp)*(x - xp) + (y - yp)*(y - yp);
denom = 1.0/sqrt(denom);
return denom;
}
My code was written for CPU platforms and all functions are written in C++. Based on the example you provided, I would have to re-define / convert all C++ function to RAJA so that they can be called from a RAJA kernel. Am I correct?
What do you mean by "RAJA space"? What do you mean by "convert all C++ functions to RAJA"?
You shouldn't have to convert any functions, just add the appropriate hot/device annotations so the compiler will generate appropriate versions that are callable in host or device code.
In the above example, the variables RAJA_DEVICE
and RAJA_INLINE
precede the definition of the C++ function. If I were to run the same C++ function on a CPU machine, I would have to change the value of the variables RAJA_DEVICE
and RAJA_INLINE
. Where are these variables define?
The macros are defined here https://github.com/LLNL/RAJA/blob/develop/include/RAJA/util/macros.hpp#L41 and here https://github.com/LLNL/RAJA/blob/develop/include/RAJA/util/macros.hpp#L41. We use them inside RAJA and provide them to users as a convenience. You don't have to use them. You can define your own if you want.
The questions you are asking are related to topics independent of RAJA. For example, if you wrote native CUDA code, you still have to deal with making your functions callable from host and/or device code sections as needed. There is no avoiding that.
Hi @delcmo , RAJA_DEVICE
is just a macro for CUDA/HIP's __device__
qualifiers. In general, if you want to call a function on the device you need to add the __device__
qualifier. For example
__device__
void foo()
{
}
RAJA::forall<RAJA::cuda_exec<CUDA_BLOCK_SIZE>>(RAJA::RangeSegment(0, 10),
[=] __device__ (int i) {
foo();
});
If you have an example, I can try to help. Additionally you can add both __host__ __device__
to the functions and they will be callable from the CPU or GPU.
Hi @artv3 thanks for your reply. This is what I am after. I just started with RAJA and may have overlooked a few things in the documentation.
If I understand correctly your respond, by including __device__
in the definition of the function foo()
I should be able to call it from either CPU or GPU. I will have to try it with a simple hello function.
Hi @artv3 thanks for your reply. This is what I am after. I just started with RAJA and may have overlooked a few things in the documentation.
If I understand correctly your respond, by including
__device__
in the definition of the functionfoo()
I should be able to call it from either CPU or GPU. I will have to try it with a simple hello function.
If you want the option to call it from either CPU or GPU, you will need include the __host__
qualifier as well. For example:
__host __ __device__
void foo()
{
printf("Hello world");
}
foo(); //call on the CPU
RAJA::forall<RAJA::cuda_exec<CUDA_BLOCK_SIZE>>(RAJA::RangeSegment(0, 10),
[=] __host__ __device__ (int i) {
foo(); call from the GPU.
});
Ok, perfect let me try this and see if I can get it to compile and run. Thanks for taking the time to reply and to provide that working example. I will get back to you if I have any questions.
Sounds good, I can definitely help. I use RAJA in a couple of applications and can help with integration/usage questions.
Hi,
I was able to compile and the function from both the CPU and GPU using the following syntax:
__device__ __host__
void foo()
{
printf("Hello world");
}
call foo();
cout << "GPU call" << endl;
const int CUDA_BLOCK_SIZE = 16;
RAJA::forall<RAJA::cuda_exec<CUDA_BLOCK_SIZE>>(RAJA::RangeSegment(0, 10),
[=] __device__ (int i) {
foo(); // call from the GPU.
});
cout << "end GPU call" << endl;
I am now trying to convert one of my function to get it to run on GPU:
__host__ __device__
void compute_macro_srt2(int indices [], double *f, double *g,
bool is_energy, double &rho_xyz, double &T_xyz,
double &jx_xyz, double &jy_xyz, double &jz_xyz)
{
int xcpp = indices[0];
int ycpp = indices[1];
int zcpp = indices[2];
int rowscpp = indices[3];
int colscpp = indices[4];
int depthcpp = indices[5];
int ndir = indices[6];
int vjx [] = {1,-1,0,0,0,0,1,-1,1,-1,1,-1,1,-1,0,0,0,0,0};
int vjy [] = {0,0,1,-1,0,0,1,1,-1,-1,0,0,0,0,1,-1,1,-1,0};
int vjz [] = {0,0,0,0,1,-1,0,0,0,0,1,1,-1,-1,1,1,-1,-1,0};
rho_xyz = 0.0;
T_xyz = 0.0;
jx_xyz = 0.0;
jy_xyz = 0.0;
jz_xyz = 0.0;
for (int d = 0; d < ndir; d++)
{
int offsetxyza = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * d) );
rho_xyz += f[offsetxyza];
if (is_energy) T_xyz += g[offsetxyza];
jx_xyz += vjx[d] * f[offsetxyza];
jy_xyz += vjy[d] * f[offsetxyza];
jz_xyz += vjz[d] * f[offsetxyza];
}
}
This function computes the macro variables (density, temperature and momentum components) in a Lattice Boltzmann Method code. I encountered two issues here:
The variables *f
and *g
are distribution functions and are passed to the GPU using:
const int sizef = ndir * (rowscpp) * (colscpp) * (depthcpp);
RAJA::View<double, RAJA::Layout<1>> f_(f, sizef);
RAJA::View<double, RAJA::Layout<1>> g_(g, sizef);
When calling the function compute_macro_srt2
with f_
and g_
I get a compilation error message:
error: no suitable conversion function from "const RAJA::View<double, RAJA::Layout<1UL, RAJA::Index_type, -1L>, double *>" to "double *" exists
The code compiles and runs fine if I use the following method to declare f_
:
double *f_;
cudaMallocManaged(&f_, sizef*sizeof(double));
What is the correct way to pass an array from the host to the device? Is my approach correct here? If not how should I change it?
The second issue is related to the function compute_macro_srt2
itself. Initially that function was declared and defined in a different file. I naively added the __host__
and __device__
to its definition but got the following compilation error message:
/gpfs/alpine/cfd136/proj-shared/delchini/lbmapp/PRATHAM/src/cpp_mrtutilities.C(24): warning: a __host__ function("compute_macro_srt2") redeclared with __host__ __device__, hence treated as a __host__ __device__ function
/gpfs/alpine/cfd136/proj-shared/delchini/lbmapp/PRATHAM/src/cpp_mrtcollision.C(417): error: calling a host function("compute_macro_srt2") from a device function(" const") is not allowed
/gpfs/alpine/cfd136/proj-shared/delchini/lbmapp/PRATHAM/src/cpp_mrtcollision.C(417): error: identifier "compute_macro_srt2" is undefined in device code
I avoided the above issue by moving the function from `cpp_mrtutilities.C` to `cpp_mrtcollision.C` where it is being called.
How do I make a function defined in a different file available both on the host and the device?
Thanks,
@delcmo Sounds like you're getting the hang of things. This is great.
1) Why are you making RAJA Views if you are not using them in your kernel? The args to your function are double* types. View types do not convert to that type. Also, Views do not allocate memory. They hold a pointer that you pass to their constructor. They are mostly used to simplify multi-dimensional indexing, pointer offset computations, etc.
Also, a word of caution. cudaMalloc calls are expensive, much more than CPU malloc calls. Moreover, cudaMallocManaged are even more costly, plus you incur the cost of host-device memory transfers triggered by runtime page faults. The allocation sizes will be an integer multiple of the page size, which is large, even if you think you are allocating a few bytes. On systems like Summit or Sierra, these are implemented in hardware. If you run on other devices that do not support managed memory in hardware, then the transfers are handled in runtime software and are more expensive. If you are going this route just to get the code to run, that's fine. But you need to think carefully about host-device memory management and should try to minimize memory space transfers as much as possible for performance reasons.
2) It's difficult to answer this without knowing the organization of your source code and how you are building/linking it.
Hi @delcmo,
Regarding 1.
The compute_macro_srt2
function signature expects a double * for f, g, not RAJA::View<double, RAJA::Layout<1>>
types which is what led to the compilation error.
If you would like to pass in RAJA::View<double, RAJA::Layout<1>>
then the function signature would have to change to
void compute_macro_srt2(int indices [], RAJA::View<double, RAJA::Layout<1>> f, RAJA::View<double, RAJA::Layout<1>> g,
bool is_energy, double &rho_xyz, double &T_xyz,
double &jx_xyz, double &jy_xyz, double &jz_xyz)
and inside the function you would have to use the parenthesis operator to access data in f, g. The role of the views however; is not to transfer data from the CPU to the GPU, but rather to enable different reshaping of the data. In potential you choose to treat the data as a 4 dimensional array using views
RAJA::View<double, RAJA::Layout<4>> f_(f,ndir, rowscpp, colscpp, depthcpp);
f_(dir,row, col, depth); //access data this way.
Using the cudaMallocManaged
method as you have is fine. In fact using cudaMallocManaged makes it so that the data is available on both the host and the device (CUDA would take care of moving data automatically).
Regarding 2.
I believe you want to add __host__ __device__
to both the header and implementation file.
Hi @rhornung67 and @artv3,
thanks for the reply. The piece of code I provided does not include the RAJA Views. Below is the full code:
RAJA::View<int, RAJA::Layout<1>> sev_(sev, 11);
RAJA::View<double, RAJA::Layout<1>> params_(params, 9);
int dims [] = {rowscpp, colscpp, depthcpp, ndir};
RAJA::View<int, RAJA::Layout<1>> dims_(dims, 4);
int dim_comp [] = {0, 1, 2}; // {1, 2, 3} in Fortran 90
RAJA::View<int, RAJA::Layout<1>> dim_comp_(dim_comp, 3);
// Allocate memory for the distribution functions 'f' and 'g' on the GPU
const int sizef = ndir * (rowscpp) * (colscpp) * (depthcpp);
RAJA::View<double, RAJA::Layout<1>> f_(f, sizef);
double *f2_, *g2_;
cudaMallocManaged(&f2_, sizef*sizeof(double));
f2_ = f;
cudaMallocManaged(&g2_, sizef*sizeof(double));
g2_ = g;
RAJA::View<double, RAJA::Layout<1>> g_(g, sizef);
// Vectors to compute momentum components
int vjx [] = {1,-1,0,0,0,0,1,-1,1,-1,1,-1,1,-1,0,0,0,0,0};
int vjy [] = {0,0,1,-1,0,0,1,1,-1,-1,0,0,0,0,1,-1,1,-1,0};
int vjz [] = {0,0,0,0,1,-1,0,0,0,0,1,1,-1,-1,1,1,-1,-1,0};
RAJA::View<int, RAJA::Layout<1>> vjx_(vjx, ndir);
RAJA::View<int, RAJA::Layout<1>> vjy_(vjy, ndir);
RAJA::View<int, RAJA::Layout<1>> vjz_(vjz, ndir);
// Pass vector 'obs' to RAJA
const int sizexyz = (rowscpp) * (colscpp) * (depthcpp);
RAJA::View<int, RAJA::Layout<1>> obs_(obs, sizexyz);
// Pass vector 'v' to RAJA
RAJA::View<int, RAJA::Layout<1>> v_(v, 3*ndir);
// Pass vector 'w' to RAJA
RAJA::View<double, RAJA::Layout<1>> w_(w, ndir);
// Load RAJA tools
using RAJA::Segs;
using RAJA::Params;
// Range segment for x, y, z and directions
RAJA::RangeSegment dir_range(0, ndir);
RAJA::RangeSegment x_range(1, ex - sx + 2);
RAJA::RangeSegment y_range(1, ey - sy + 2);
RAJA::RangeSegment z_range(1, ez - sz + 2);
// CUDA kernel policy
using EXEC_POL5 =
RAJA::KernelPolicy<
RAJA::statement::CudaKernel<
RAJA::statement::For<3, RAJA::cuda_thread_z_loop, // RAJA::cuda_block_z_loop,
RAJA::statement::For<2, RAJA::cuda_thread_y_loop, // RAJA::cuda_block_y_loop,
RAJA::statement::For<1, RAJA::cuda_thread_x_loop, // RAJA::cuda_block_x_loop,
RAJA::statement::Lambda<0, Segs<1,2,3>, Params<0,1,2,3,4> >,
RAJA::statement::For<0, RAJA::seq_exec, // cuda_thread_x_loop, // RAJA::cuda_block_x_loop,
RAJA::statement::Lambda<1, Segs<0,1,2,3>, Params<0,1,2,3,4> >
>,
RAJA::statement::Lambda<2, Segs<1,2,3>, Params<0,1,2,3,4> >,
RAJA::statement::For<0, RAJA::seq_exec, // RAJA::cuda_block_x_loop, // RAJA::cuda_block_x_loop,
RAJA::statement::Lambda<3, Segs<0,1,2,3>, Params<0,1,2,3,4> >
>
>
>
>
>
>;
RAJA::kernel_param<EXEC_POL5>(
RAJA::make_tuple(dir_range, x_range, y_range, z_range),
RAJA::tuple<double, double, double, double, double>{0.0, 0.0, 0.0, 0.0, 0.0}, // thread local variable for 'rho'
// lambda 0: initialize macro variables
[=] RAJA_DEVICE (int x, int y, int z, double &rho, double &jx, double &jy, double &jz, double &T) {
// Allocate array with indices to pass to the function computing the macro variables
int indices[7];
indices[0] = x;
indices[1] = y;
indices[2] = z;
indices[3] = dims_[0];
indices[4] = dims_[1];
indices[5] = dims_[2];
indices[6] = dims_[3];
compute_macro_srt2(indices,f2_,g2_,lgclbool_[1],rho,T,jx,jy,jz);
int offsetxyz = x + dims_[0] * (y + dims_[1] * z );
//printf("%f, %f, %f, %f, %d ", rho, jx, jy, jz, offsetxyz);;
//rho = 0.0;
//jx = 0.0; jy = 0.0; jz = 0.0;
//if (lgclbool_[1]) T = 0.0;
},
// lambda 1: compute macro variables by looping over all directions
[=] RAJA_DEVICE (int dir, int x, int y, int z, double &rho, double &jx, double &jy, double &jz, double &T) {
int offsetxyza = x + dims_[0] * (y + dims_[1] * (z + dims_[2] * dir) );
//rho += f_[offsetxyza];
//jx += vjx_[dir] * f_[offsetxyza];
//jy += vjy_[dir] * f_[offsetxyza];
//jz += vjz_[dir] * f_[offsetxyza];
//if (lgclbool_[1]) T += g_[offsetxyza];
},
// lambda 2
[=] RAJA_DEVICE (int x, int y, int z, double &rho, double &jx, double &jy, double &jz, double &T) {
int offsetxyz = x + dims_[0] * (y + dims_[1] * z );
//foo();
//printf("x = %d ", x);
//printf("y = %d ", y);
//printf("z = %d ", z);
//printf("rho_gpu = %f, %ld ", rho, offsetxyz);
//printf("jxyz_gpu = %f, %f, %f, %ld ", jx, jy, jz, offsetxyz);
//printf("jxyz_gpu = %f, %f, %f, %ld ", jx, jy, jz, offsetxyz);
//if (lgclbool_[1]) printf("T = %f, %ld ", T, offsetxyz);
},
// lambda 3: compute equilibrium function for SRT scheme
[=] RAJA_DEVICE (int dir, int x, int y, int z, double &rho, double &jx, double &jy, double &jz, double &T) {
//printf("dir = %d ", dir);
int offsetxyz = x + dims_[0] * (y + dims_[1] * z );
// fluid node
if (obs_[offsetxyz] == 0)
{
//printf("dir = %d %d %d %d ", dir,x,y,z);
// Compute 'taup'
const double taup = params_[2]; // tau0
// Get components for each direction stored in 'v'
//int x_comp = 0, y_comp = 1, z_comp = 2;
int offsetxdir = dir * 3 + dim_comp_[0]; // x_comp;
const double vxa = v_[offsetxdir];
int offsetydir = dir * 3 + dim_comp_[1]; // y_comp;
const double vya = v_[offsetydir];
int offsetzdir = dir * 3 + dim_comp_[2]; // z_comp;
const double vza = v_[offsetzdir];
// Compute 'edotu'
const double edotu = (vxa*jx + vya*jy + vza*jz)/rho;
//printf(" edotu = %f", edotu);
// Compute local equillibrium value for NS equation
const double usq = (jx*jx+jy*jy+jz*jz)/(rho*rho);
double feq = 0.0, fcl = 0.0;
feq = 1.0 + 3.0 * edotu + 4.5 * edotu * edotu - 1.5 * usq;
feq *= rho * w_[dir];
// Compute collision
int offsetfeq = x + dims_[0] * (y + dims_[1] * (z + dims_[2] * dir) );
fcl = (1.0 - 1.0/taup) * f_[offsetfeq] + feq / taup;
//printf(" feq_gpu = %f %d ", feq, offsetfeq);
//printf("fcl_gpu = %f %d %f %f %f ", fcl, offsetfeq, f_[offsetfeq], feq, jz);
// Update 'f_' value
f_[offsetfeq] = fcl;
}
// Energy equation for fluid and solid nodes
if (lgclbool_[1])
{
const double tau_g = params_[4];
// Get components for each direction stored in 'v'
//int x_comp = 0, y_comp = 1, z_comp = 2;
int offsetxdir = dir * 3 + dim_comp_[0]; // x_comp;
const double vxa = v_[offsetxdir];
int offsetydir = dir * 3 + dim_comp_[1]; // y_comp;
const double vya = v_[offsetydir];
int offsetzdir = dir * 3 + dim_comp_[2]; // z_comp;
const double vza = v_[offsetzdir];
// Compute 'edotu'
const double edotu = (vxa*jx + vya*jy + vza*jz)/rho;
// Compute local equillibrium value for NS equation
const double usq = (jx*jx+jy*jy+jz*jz)/(rho*rho);
// Compute local euilibrium value for energy equation
double geq = 1.0 + 3.0 * edotu + 4.5 * edotu * edotu - 1.5 * usq;
geq *= T * w[dir];
// Add 'geq' value to 'g_equilibrium' array
int offsetgeq = x + rowscpp * (y + colscpp * (z + depthcpp * dir) );
// Collision kernel 'gcl' (same for SRT and MRT)
double gcl = (1.0 - 1.0 / tau_g) * g_[offsetgeq];
gcl += geq / tau_g;
}
}
);
When I read the RAJA documentation, the RAJA Views object seemed a good option as it avoids the cudaMallocManaged
call. That being said, since the function compute_macro_srt2
takes double *
as an input I will have to re-write a version of the function that is compatible with the RAJA Views
. I wanted to avoid having the same function implemented twice, one for the host and the other one for the device since we want that code to run both on CPU and GPU platforms.
As for the second issue, the main function is implemented in cpp_mrtcollision.C
and the function compute_macro_srt
is implemented in cpp_mrtutilities.C
. The file cpp_mrtcollision.C
contains #include "cpp_mrtutilities.h"
at the top. Is there an example that shows how to defined a function in a header and then call that same function in file that implements the main body of the code?
Hi @delcmo, the RAJA Views just hold a pointer, the host code is still responsible for allocating the memory. Views will work on both the host and the device
The usage should look like this:
double *f_ptr;
cudaMallocManaged(&f_ptr, sizef*sizeof(double)); //allocate memory (cudaMallocManaged enables the data to be accessible on both the CPU and GPU)
RAJA::View<double, RAJA::Layout<1>> f_(f_ptr, sizef); // is accessible on both the CPU and GPU.
I also see you are using RAJA::kernel
, looking at that policy I believe it is incomplete. The CUDA programming model uses hierarchical parallelism (blocks and threads), and your policy is missing how the blocks should be mapped.
Given the structure of your kernel, have you looked at RAJA Teams? We have a basic example that compares writing C style nested loops and how they could look like with Teams here https://github.com/LLNL/RAJA/blob/develop/examples/tut_teams_basic.cpp and another example of general usage here: https://github.com/LLNL/RAJA/blob/develop/examples/raja-teams.cpp#L156-L177
RAJA Teams creates an execution space where you express kernels as nested for loops. It also supports choosing how to run kernels based on a run-time parameter (host or device).
If you could post the C++ version of your kernel, I can give an outline of what the Teams variant would look like.
Regarding the second issue, let me see if I can come up with an example.
@artv3
I have not looked at the block mapping yet. I spent quite a bit of time learning nested blocks and making sure the code compiles and gives the correct solution. I will have to look at this later.
I looked at the matrix multiplication example to get started with the nested for loops and built from it.
The C++ code is quite long (I think) as you can see below:
// Loop over z, y and x axis, and directions 'dir'
for (int z = sz; z <= ez; z++)
{
for (int y = sy; y <= ey; y++)
{
for (int x = sx; x <= ex; x++)
{
// Shift indices
int xcpp = x - (sx - 1);
int ycpp = y - (sy - 1);
int zcpp = z - (sz - 1);
//cout << "xpp = " << xcpp << " ycpp = " << ycpp << " zcpp = " << zcpp << endl;
// Stored all shifted indices in 'indices' array and number of directions
int indices[7];
indices[0] = xcpp;
indices[1] = ycpp;
indices[2] = zcpp;
indices[3] = rowscpp;
indices[4] = colscpp;
indices[5] = depthcpp;
indices[6] = ndir;
// Compute offset value
int offset = xcpp + rowscpp * (ycpp + colscpp * zcpp);
// Compute local density and components of the 'j' flux
compute_macro_srt(indices,f,g,is_energy,rho_xyz,T_xyz,
jx_xyz,jy_xyz,jz_xyz);
//cout << "x = " << x << " y = " << y << " z = " << z << endl;
//cout << "x = " << x << " ";
//cout << "y = " << y << " ";
//cout << "z = " << z << " ";
//cout << "rho_cpu = " << rho_xyz << " " << offset << " ";
//cout << "jxyz_cpu = " << jx_xyz << " " << jy_xyz << " " << jz_xyz << " " << offset << " ";
// Compute local source terms for NS and energy equations
double force_xyz [] = {0.0, 0.0, 0.0};
double energy_xyz = 0.0;
compute_source_xyz(xcpp,ycpp,zcpp,time,is_energy,
gforceval, E_source,
rho_xyz,T_xyz,jx_xyz,jy_xyz,jz_xyz,
force_xyz,energy_xyz);
// Check if node is a fluid node
if (obs[offset] == 0)
{
// Compute density 'rhop'
const double rhop = is_lbgk ? rho_xyz : rho0;
// Compute 'taup' to be used in collision kernel
double sigma_xyz, taup;
if (is_les)
{
// Compute old values
double rho_xyz_old, T_xyz_old, jx_xyz_old, jy_xyz_old, jz_xyz_old;
compute_macro_srt(indices,f_older,g,is_energy,rho_xyz_old,
T_xyz_old,jx_xyz_old,jy_xyz_old,jz_xyz_old);
// Compute norm of the stress tensor
compute_sigma_les(indices,v,w,f,f_older,rho0,RT,is_lbgk,time,
restart_file_time+1,
rho_xyz_old,jx_xyz_old,jy_xyz_old,jz_xyz_old,
tau[offset],sigma_xyz);
// Compute relaxation time and store it in array 'tau'
taup = tau0;
taup += 0.5 * (sqrt(tau0*tau0 + 18.0*Csmag*Csmag*sigma_xyz) - tau0);
tau[offset] = taup;
}
else
{
taup = tau0;
}
// Extract ux, uy and uz values
const double ux = jx_xyz / rhop;
const double uy = jy_xyz / rhop;
const double uz = jz_xyz / rhop;
// Compute magnitude of velocity vector
const double usq = ux * ux + uy * uy + uz * uz;
// Compute uxcl, uycl and uzcl values for collision kernel
const double uxcl = jx_xyz / rho_xyz;
const double uycl = jy_xyz / rho_xyz;
const double uzcl = jz_xyz / rho_xyz;
// Compute collision term for MRT
if (!is_lbgk)
{
mrt_collision_ns(indices,taup,rho0,is_les,
rho_xyz,jx_xyz,jy_xyz,jz_xyz,
f);
}
// Loop over directions to compute equilibrium and collision kernels
//cout << "x = " << x << " y = " << y << " z = " << z << endl;
for (int a = 0; a < ndir; a++)
{
// Get components for each direction stored in 'v'
int offsetxdir = a * 3 + x_comp;
const double vxa = v[offsetxdir];
int offsetydir = a * 3 + y_comp;
const double vya = v[offsetydir];
int offsetzdir = a * 3 + z_comp;
const double vza = v[offsetzdir];
// Compute 'edotu'
double edotu = vxa*ux + vya*uy + vza*uz;
//cout << " edotu = " << edotu;
// Compute local equillibrium value for NS equation
double feq = 0.0;
if (is_lbgk)
{
feq = 1.0 + 3.0 * edotu + 4.5 * edotu * edotu - 1.5 * usq;
feq *= rho_xyz;
}
else
{
edotu *= 1.0 / RT;
feq = rho_xyz + rho0 * (edotu + 0.5 * edotu * edotu - 0.5 * usq / RT);
}
feq *= w[a];
// Add 'feq' to 'f' array
int offsetfeq = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
//cout << "offsetfeq = " << offsetfeq << " ";
//cout << "dir = " << a << " ";
// Collision kernek 'fcl'
double fcl = 0.0;
if (is_lbgk)
{
fcl = (1.0 - 1.0/taup) * f[offsetfeq] + feq / taup;
//cout << " fcl_cpu = " << fcl << " " << offsetfeq << " " << f[offsetfeq] << " " << feq << " " << jz_xyz << " ";
//cout << " fcl_cpu = " << fcl - f2[offsetfeq] << " ";
f[offsetfeq] = fcl;
}
//cout << " feq_cpu = " << feq << " offsetfeq = " << offsetfeq << " ";
//cout << " feq = " << feq << " fcl = " << fcl;
// Add force terms to the NS equations
fcl = force_xyz[0] * (vxa - uxcl);
fcl += force_xyz[1] * (vya - uycl);
fcl += force_xyz[2] * (vza - uzcl);
fcl *= delta_t * feq / RT;
f[offsetfeq] += fcl;
} // end for loop over direction
//cout << endl;
//cout << endl;
} // end if on fluid node
// Energy equations
if (is_energy)
{
// Compute velocity values to be used in energy equation
const double ux = jx_xyz / rho_xyz;
const double uy = jy_xyz / rho_xyz;
const double uz = jz_xyz / rho_xyz;
// Compute magnitude of velocity vector
const double usq = ux * ux + uy * uy + uz * uz;
// Loop over directions to compute f_equilibrium
for (int a = 0; a < ndir; a++)
{
// Get components for each direction stored in 'v'
int offsetxdir = a * 3 + x_comp;
const double vxa = v[offsetxdir];
int offsetydir = a * 3 + y_comp;
const double vya = v[offsetydir];
int offsetzdir = a * 3 + z_comp;
const double vza = v[offsetzdir];
// Compute 'edotu'
double edotu = vxa*ux + vya*uy + vza*uz;
// Compute local euilibrium value for energy equation
double geq = 1.0 + 3.0 * edotu + 4.5 * edotu * edotu - 1.5 * usq;
geq *= T_xyz * w[a];
// Add 'geq' value to 'g_equilibrium' array
int offsetgeq = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
// Collision kernel 'gcl' (same for SRT and MRT)
double gcl = (1.0 - 1.0 / tau_g) * g[offsetgeq];
gcl += geq / tau_g;
// Add source term to energy equations
gcl += delta_t * w[a] * energy_xyz;
g[offsetgeq] = gcl;
} // end loop over directions
}
} // end for loop over x
} // end for loop over y
} // end for loop over z
I will look at your examples and try to understand the logic of the RAJA Teams. Thanks again for your help.
Hi @delcmo, I took a look at your code, and one possible implementation with RAJA Teams which enables running sequentially or with CUDA could look like this:
This first part could go in a header, away from source code since it can be reused in other kernels.
/*
* Define host/device launch policies
*/
using launch_policy = RAJA::expt::LaunchPolicy<
RAJA::expt::seq_launch_t
#if defined(RAJA_ENABLE_CUDA)
,
RAJA::expt::cuda_launch_t<true>
#endif
>;
/*
//Define global thread policies, regular for loops on the CPU, global thread ID on GPU
*/
using global_thread_x = RAJA::expt::LoopPolicy<RAJA::loop_exec
#if defined(RAJA_DEVICE_ACTIVE)
,
RAJA::expt::cuda_global_thread_x
#endif
>;
using global_thread_y = RAJA::expt::LoopPolicy<RAJA::loop_exec
#if defined(RAJA_DEVICE_ACTIVE)
,
RAJA::expt::cuda_global_thread_y
#endif
>;
using global_thread_z = RAJA::expt::LoopPolicy<RAJA::loop_exec
#if defined(RAJA_DEVICE_ACTIVE)
,
RAJA::expt::cuda_global_thread_z
#endif
>;
And the kernel is below. Here I chose to use global thread policies which flattens thread/team model to create unique thread IDs based on the compute grid rather than the standard hierarchical structure. On the GPU each loop iteration will get assigned to a thread; while on the CPU they fall back to regular for loops.
RAJA::RangeSegment x_range(sx, ex+1);
RAJA::RangeSegment y_range(sy, ey+1);
RAJA::RangeSegment z_range(sz, ez+1);
const int NThreads = 8; //Each team will have 8 threads per dim
//Calculate number of teams needed for iteration space in each dimension
const int NTeams_x = RAJA_DIVIDE_CEILING_INT(x_range.size(),NThreads);
const int NTeams_y = RAJA_DIVIDE_CEILING_INT(y_range.size(),NThreads);
const int NTeams_z = RAJA_DIVIDE_CEILING_INT(z_range.size(),NThreads);
RAJA::expt::launch<launch_policy>(RAJA::expt::HOST, //Choose RAJA::expt::DEVICE or HOST
RAJA::expt::Resources(RAJA::expt::Teams(NTeams_x,NTeams_y,NTeams_z),
RAJA::expt::Threads(NThreads,NThreads,NThreads)),
[=] RAJA_HOST_DEVICE(RAJA::expt::LaunchContext ctx) {
RAJA::expt::loop<global_thread_z>(ctx, z_range, [&] (int z) {
RAJA::expt::loop<global_thread_y>(ctx, y_range, [&] (int y) {
RAJA::expt::loop<global_thread_x>(ctx, x_range, [&] (int x) {
// Shift indices
int xcpp = x - (sx - 1);
int ycpp = y - (sy - 1);
int zcpp = z - (sz - 1);
//cout << "xpp = " << xcpp << " ycpp = " << ycpp << " zcpp = " << zcpp << endl;
// Stored all shifted indices in 'indices' array and number of directions
int indices[7];
indices[0] = xcpp;
indices[1] = ycpp;
indices[2] = zcpp;
indices[3] = rowscpp;
indices[4] = colscpp;
indices[5] = depthcpp;
indices[6] = ndir;
// Compute offset value
int offset = xcpp + rowscpp * (ycpp + colscpp * zcpp);
// Compute local density and components of the 'j' flux
compute_macro_srt(indices,f,g,is_energy,rho_xyz,T_xyz,
jx_xyz,jy_xyz,jz_xyz);
//cout << "x = " << x << " y = " << y << " z = " << z << endl;
//cout << "x = " << x << " ";
//cout << "y = " << y << " ";
//cout << "z = " << z << " ";
//cout << "rho_cpu = " << rho_xyz << " " << offset << " ";
//cout << "jxyz_cpu = " << jx_xyz << " " << jy_xyz << " " << jz_xyz << " " << offset << " ";
// Compute local source terms for NS and energy equations
double force_xyz [] = {0.0, 0.0, 0.0};
double energy_xyz = 0.0;
compute_source_xyz(xcpp,ycpp,zcpp,time,is_energy,
gforceval, E_source,
rho_xyz,T_xyz,jx_xyz,jy_xyz,jz_xyz,
force_xyz,energy_xyz);
// Check if node is a fluid node
if (obs[offset] == 0)
{
// Compute density 'rhop'
const double rhop = is_lbgk ? rho_xyz : rho0;
// Compute 'taup' to be used in collision kernel
double sigma_xyz, taup;
if (is_les)
{
// Compute old values
double rho_xyz_old, T_xyz_old, jx_xyz_old, jy_xyz_old, jz_xyz_old;
compute_macro_srt(indices,f_older,g,is_energy,rho_xyz_old,
T_xyz_old,jx_xyz_old,jy_xyz_old,jz_xyz_old);
// Compute norm of the stress tensor
compute_sigma_les(indices,v,w,f,f_older,rho0,RT,is_lbgk,time,
restart_file_time+1,
rho_xyz_old,jx_xyz_old,jy_xyz_old,jz_xyz_old,
tau[offset],sigma_xyz);
// Compute relaxation time and store it in array 'tau'
taup = tau0;
taup += 0.5 * (sqrt(tau0*tau0 + 18.0*Csmag*Csmag*sigma_xyz) - tau0);
tau[offset] = taup;
}
else
{
taup = tau0;
}
// Extract ux, uy and uz values
const double ux = jx_xyz / rhop;
const double uy = jy_xyz / rhop;
const double uz = jz_xyz / rhop;
// Compute magnitude of velocity vector
const double usq = ux * ux + uy * uy + uz * uz;
// Compute uxcl, uycl and uzcl values for collision kernel
const double uxcl = jx_xyz / rho_xyz;
const double uycl = jy_xyz / rho_xyz;
const double uzcl = jz_xyz / rho_xyz;
// Compute collision term for MRT
if (!is_lbgk)
{
mrt_collision_ns(indices,taup,rho0,is_les,
rho_xyz,jx_xyz,jy_xyz,jz_xyz,
f);
}
// Loop over directions to compute equilibrium and collision kernels
//cout << "x = " << x << " y = " << y << " z = " << z << endl;
for (int a = 0; a < ndir; a++)
{
// Get components for each direction stored in 'v'
int offsetxdir = a * 3 + x_comp;
const double vxa = v[offsetxdir];
int offsetydir = a * 3 + y_comp;
const double vya = v[offsetydir];
int offsetzdir = a * 3 + z_comp;
const double vza = v[offsetzdir];
// Compute 'edotu'
double edotu = vxa*ux + vya*uy + vza*uz;
//cout << " edotu = " << edotu;
// Compute local equillibrium value for NS equation
double feq = 0.0;
if (is_lbgk)
{
feq = 1.0 + 3.0 * edotu + 4.5 * edotu * edotu - 1.5 * usq;
feq *= rho_xyz;
}
else
{
edotu *= 1.0 / RT;
feq = rho_xyz + rho0 * (edotu + 0.5 * edotu * edotu - 0.5 * usq / RT);
}
feq *= w[a];
// Add 'feq' to 'f' array
int offsetfeq = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
//cout << "offsetfeq = " << offsetfeq << " ";
//cout << "dir = " << a << " ";
// Collision kernek 'fcl'
double fcl = 0.0;
if (is_lbgk)
{
fcl = (1.0 - 1.0/taup) * f[offsetfeq] + feq / taup;
//cout << " fcl_cpu = " << fcl << " " << offsetfeq << " " << f[offsetfeq] << " " << feq << " " \
<< jz_xyz << " ";
//cout << " fcl_cpu = " << fcl - f2[offsetfeq] << " ";
f[offsetfeq] = fcl;
}
//cout << " feq_cpu = " << feq << " offsetfeq = " << offsetfeq << " ";
//cout << " feq = " << feq << " fcl = " << fcl;
// Add force terms to the NS equations
fcl = force_xyz[0] * (vxa - uxcl);
fcl += force_xyz[1] * (vya - uycl);
fcl += force_xyz[2] * (vza - uzcl);
fcl *= delta_t * feq / RT;
f[offsetfeq] += fcl;
} // end for loop over direction
//cout << endl;
//cout << endl;
} // end if on fluid node
// Energy equations
if (is_energy)
{
// Compute velocity values to be used in energy equation
const double ux = jx_xyz / rho_xyz;
const double uy = jy_xyz / rho_xyz;
const double uz = jz_xyz / rho_xyz;
// Compute magnitude of velocity vector
const double usq = ux * ux + uy * uy + uz * uz;
// Loop over directions to compute f_equilibrium
for (int a = 0; a < ndir; a++)
{
// Get components for each direction stored in 'v'
int offsetxdir = a * 3 + x_comp;
const double vxa = v[offsetxdir];
int offsetydir = a * 3 + y_comp;
const double vya = v[offsetydir];
int offsetzdir = a * 3 + z_comp;
const double vza = v[offsetzdir];
// Compute 'edotu'
double edotu = vxa*ux + vya*uy + vza*uz;
// Compute local euilibrium value for energy equation
double geq = 1.0 + 3.0 * edotu + 4.5 * edotu * edotu - 1.5 * usq;
geq *= T_xyz * w[a];
// Add 'geq' value to 'g_equilibrium' array
int offsetgeq = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
// Collision kernel 'gcl' (same for SRT and MRT)
double gcl = (1.0 - 1.0 / tau_g) * g[offsetgeq];
gcl += geq / tau_g;
// Add source term to energy equations
gcl += delta_t * w[a] * energy_xyz;
g[offsetgeq] = gcl;
} // end loop over directions
}
});
});
});
});
One additional detail to note is that I am using macros to guard the CUDA code, this is needed in case we build on a CPU only platform where the CUDA functions calls do not exist.
Additionally, there is a memory manager under examples that will allocate with unified memory when we have a CUDA build or new when a CPU only build is detected - this enables portability when building on specific platforms.
@artv3,
thanks for providing the code. I was looking at it and noticed that some of the options alike cuda_launch_t
or cuda_global_thread_x
are not in the documentation. Are these options specific to Teams? Is there a documentation for Teams? Is there a Doxygen?
Thanks,
Hi @delcmo , yes those are specific to Teams. There is partial documentation in the develop branch of read the docs:
I think we currently point our release the docs to our last released version. cuda_global_thread_x
is also a relatively new feature and not yet documented, I'll make a note to add it to the documentation.
@artv3
in the post where you converted the code from C++ to RAJA, you left the loop over the direction unchanged.
for (int a = 0; a < ndir; a++)
Was this by design? How would RAJA handle that loop in the current implementation?
@artv3
in the post where you converted the code from C++ to RAJA, you left the loop over the direction unchanged.
for (int a = 0; a < ndir; a++)
Was this by design? How would RAJA handle that loop in the current implementation?
That was a choice I made. Given that the parallelism comes from the top three loops, those inner loops would always be executed sequentially. Converting it into a RAJA loop wouldn't add more capability.
If we chose to convert it to a RAJA loop it could look something like this:
///Execute for loop on either host or device. (Add this policy header)
using seq_loop = RAJA::expt::LoopPolicy<RAJA::loop_exec, RAJA::loop_exec>;
RAJA::expt::loop<seq_loop>(ctx, RAJA::RangeSegment(0, ndir), [&] (int a) {
. . .
});
Additionally, since the loop methods capture by reference we would also just be able modify/access anything in the loop hierarchy, this is a different design approach than RAJA kernel which required passing parameters by reference into the lambda arguments.
Hi @artv3,
if I understand you correctly, I can keep the C++ for loop syntax as it is. Does that mean each GPU thread will perform a for loop over the directions?
Hi @artv3,
if I understand you correctly, I can keep the C++ for loop syntax as it is. Does that mean each GPU thread will perform a for loop over the directions?
Yes! standard C++ loops will be executed sequentially by each thread.
@artv3,
I was able to modify my code and add the RAJA syntax to get it to work on GPUs. I still have some testing to do but I am on the right track. In the mean time I have started to look at a different part of the code that is a little bit more complex:
for (int z = sz; z <= ez; z++)
{
int zcpp = z - (sz - 1);
for (int y = sy; y <= ey; y++)
{
for (int x = sx; x <= ex; x++)
{
// Shift indices
int xcpp = x - (sx - 1);
int ycpp = y - (sy - 1);
a = 2 - 1;
offset1 = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
offset2 = xcpp + 1 + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
f[offset1] = f[offset2];
if (is_energy) g[offset1] = g[offset2];
a = 4 - 1;
offset1 = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
offset2 = xcpp + rowscpp * (ycpp + 1 + colscpp * (zcpp + depthcpp * a) );
f[offset1] = f[offset2];
if (is_energy) g[offset1] = g[offset2];
a = 6 - 1;
offset1 = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
offset2 = xcpp + rowscpp * (ycpp + colscpp * (zcpp + 1 + depthcpp * a) );
f[offset1] = f[offset2];
if (is_energy) g[offset1] = g[offset2];
a = 10 - 1;
offset1 = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
offset2 = xcpp + 1 + rowscpp * (ycpp + 1 + colscpp * (zcpp + depthcpp * a) );
f[offset1] = f[offset2];
if (is_energy) g[offset1] = g[offset2];
a = 14 - 1;
offset1 = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
offset2 = xcpp + 1 + rowscpp * (ycpp + colscpp * (zcpp + 1 + depthcpp * a) );
f[offset1] = f[offset2];
if (is_energy) g[offset1] = g[offset2];
a = 18 - 1;
offset1 = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
offset2 = xcpp + rowscpp * (ycpp + 1 + colscpp * (zcpp + 1 + depthcpp * a) );
f[offset1] = f[offset2];
if (is_energy) g[offset1] = g[offset2];
}
}
}
The integer a
denotes a direction. For each direction a
, the solution is streamed from index offset2
to offset1
. When running the nested for loops on x
, y
and z
coordinates, it is important to conserve the order the entries of the arrays f
and g
are being updated: entry offset2
cannot be modified before entry offset1
. Is there a way in RAJA to maintain that order when running the nested for loops on different threads.
The other question I have is related to the different directions a
: the work being done on each direction is independent of each other (a = 18 -1
could be performed at the same time as a = 14 -1
on two different threads). Is there a way in RAJA to optimize the above syntax to take advantage of the thread architecture?
Thanks, Marco
Hi @delcmo, if I understand correctly -- and we can iterate on this.
Regarding the first part I think you can use the same global policies for the outer for loops (z,y,x), this will create unique threads and the loop body with be executed sequentially by each thread.
For the second question, if you express the loop body as for loop over 6 values where you compute the value of a, offeset1, offset2 then we can use the RAJA loop methods to assign the work to threads in the x-direction (which will get executed in parallel) the x loop could be assigned to threads in the y direction, and the remaining 2 loops {y,z} could be assigned to blocks in the x, y direction. -- Basically hierarchical parallelism. The key though is being able to express the work as for loops.
I was able to re-write that piece of code using a fourth nested loop as follows:
int a, offset1, offset2;
for (int z = sz; z <= ez; z++)
{
// Define local arrays to simplify implementation
int nb_a = 6;
int aa [] = {2-1, 4-1, 6-1, 10-1, 14-1, 18-1};
int xcppa [] = {1, 0, 0, 1, 1, 0};
int ycppa [] = {0, 1, 0, 1, 0, 1};
int zcppa [] = {0, 0, 1, 0, 1, 1};
// Shift indices
int zcpp = z - (sz - 1);
for (int y = sy; y <= ey; y++)
{
// Shift indices
int ycpp = y - (sy - 1);
for (int x = sx; x <= ex; x++)
{
// Shift indices
int xcpp = x - (sx - 1);
// Loop over directions stored in 'aa'
for (int i = 0; i < nb_a; i++)
{
const int a = aa[i];
offset1 = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
offset2 = xcpp + xcppa[i] + rowscpp * (ycpp + ycppa[i] + colscpp * (zcpp + zcppa[i] + depthcpp * a) );
f[offset1] = f[offset2];
if (is_energy) g[offset1] = g[offset2];
}
}
}
}
The above implementation should be easier to convert using RAJA syntax. Would that be appropriate?
I have a question on one of the previous iteration for the piece of code
const int NThreads = 8; //Each team will have 8 threads per dim
//Calculate number of teams needed for iteration space in each dimension
const int NTeams_x = RAJA_DIVIDE_CEILING_INT(x_range.size(),NThreads);
const int NTeams_y = RAJA_DIVIDE_CEILING_INT(y_range.size(),NThreads);
const int NTeams_z = RAJA_DIVIDE_CEILING_INT(z_range.size(),NThreads);
RAJA::expt::launch<launch_policy>(RAJA::expt::HOST, //Choose RAJA::expt::DEVICE or HOST
RAJA::expt::Resources(RAJA::expt::Teams(NTeams_x,NTeams_y,NTeams_z),
RAJA::expt::Threads(NThreads,NThreads,NThreads)),
[=] RAJA_HOST_DEVICE(RAJA::expt::LaunchContext ctx) {
Under the current implementation, the code would use 8 threads. How does that relate to the number of GPUs on each node? I run on Summit and there are 6 GPUs per node. I tried to run on the GPUs after changing RAJA::expt::launch<launch_policy>(RAJA::expt::HOST
to RAJA::expt::launch<launch_policy>(RAJA::expt::DEVICE
and I get an error message:
terminate called after throwing an instance of 'std::runtime_error'
what(): CUDAassert
Program received signal SIGABRT: Process abort signal
Hi @delcmo , yes the proposed code would work.
In regards to the error..... I am actually not sure. Could you try running it through cuda-gdb and share a stack trace? I'm wondering if data is not on the GPU or the number of teams are too large. Could you share the sizes of NTeams_ ?
Additionally RAJA only targets 1 GPU, it is through combining MPI + RAJA that we can target multiple GPUs.
Also with the configuration of threads corresponds to 8^3 = 512 threads per team (or CUDA thread block which we call teams). Then the number of blocks are calculated by the values of NTeams_. Here we are computing on a 3D compute grid.
The code works with NTeams = 8 but gives me the error message when I set NTeams = 10 or higher.
Backtrace for this error:
CUDAassert: invalid configuration argument
/gpfs/alpine/proj-shared/cfd136/raja/build-210412c/include/RAJA/policy/cuda/MemUtils_CUDA.hpp
210
terminate called after throwing an instance of 'std::runtime_error'
what(): CUDAassert
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
ERROR: One or more process (first noticed rank 6) terminated with signal 6
(core dumped)
On Mon, Jun 21, 2021 at 2:25 PM Arturo Vargas @.***> wrote:
Hi @delcmo https://github.com/delcmo , yes the proposed code would work.
In regards to the error..... I am actually not sure. Could you try running it through cuda-gdb and share a stack trace? I'm wondering if data is not on the GPU or the number of teams are too large. Could you share the sizes of NTeams_ ?
Additionally RAJA only targets 1 GPU, it is through combining MPI + RAJA that we can target multiple GPUs.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LLNL/RAJA/issues/1057#issuecomment-865250758, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCUZRWHCQPMMOV4IFB3TT572PANCNFSM45TIE6KA .
-- Marc-Olivier Delchini
To double check, are you choosing values for NThreads or NTeams_{x,y.z}? The max number of threads in CUDA per block (RAJA Team) is 1024, NThreads = 10 should work..... I'll see if I can reproduce locally.
I set the number of threads as it is in the original code you provided in the previous email. The number of teams is calculated from
const int NTeams_x = RAJA_DIVIDE_CEILING_INT(x_range_teams.size(),NThreads);
const int NTeams_y = RAJA_DIVIDE_CEILING_INT(y_range_teams.size(),NThreads);
const int NTeams_z = RAJA_DIVIDE_CEILING_INT(z_range_teams.size(),NThreads);
where x_range_teams
is 20 (number of mesh elements in the y-direction. x_range_teams
and z_range_teams
are also set to 20.
On Wed, Jun 23, 2021 at 5:19 PM Arturo Vargas @.***> wrote:
To double check, are you choosing values for NThreads or NTeams_{x,y.z}? The max number of threads in CUDA per block (RAJA Team) is 1024, NThreads = 10 should work..... I'll see if I can reproduce locally.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LLNL/RAJA/issues/1057#issuecomment-867166472, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCV665KUIYKFMQ56I23TUJFVVANCNFSM45TIE6KA .
-- Marc-Olivier Delchini
I run some more tests and the piece of code runs fine with any value of Nthread <= 8. When the number of teams fall below 4 (Nthread >=9 ), I get a Cuda error message. The number of teams is computed from
const int NTeams_x = RAJA_DIVIDE_CEILING_INT(x_range_teams.size(),NThreads);
const int NTeams_y = RAJA_DIVIDE_CEILING_INT(y_range_teams.size(),NThreads);
const int NTeams_z = RAJA_DIVIDE_CEILING_INT(z_range_teams.size(),NThreads);
Does it work if you try launching an empty kernel with the values used directly?
RAJA::expt::launch<launch_policy>
(RAJA::expt::DEVICE,
RAJA::expt::Resources(RAJA::expt::Teams(10, 10, 10), RAJA::expt::Threads(8, 8, 8)),
[=] RAJA_HOST_DEVICE(RAJA::expt::LaunchContext ctx) {
printf("running cuda kernel \n");
}); // outer lambda
The error hints that the compute grid is not being correctly, but I'm not sure what is going on.
I added to the code the above lines and run it to get the following:
running cuda kernel
...
running cuda kernel
ERROR: One or more process (first noticed rank 3) terminated with signal 6
(core dumped)
It outputs running cuda kernel
for a while and then stops with the above
error message.
On Thu, Jun 24, 2021 at 1:13 PM Arturo Vargas @.***> wrote:
Does it work if you try launching an empty kernel with the values used directly?
RAJA::expt::launch<launch_policy> (RAJA::expt::DEVICE, RAJA::expt::Resources(RAJA::expt::Teams(10, 10, 10), RAJA::expt::Threads(8, 8, 8)), [=] RAJA_HOST_DEVICE(RAJA::expt::LaunchContext ctx) { printf("running cuda kernel \n"); }); // outer lambda
The error hints that the compute grid is not being correctly, but I'm not sure what is going on.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LLNL/RAJA/issues/1057#issuecomment-867811528, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCXEQCFBTCSA34SRFODTUNRRZANCNFSM45TIE6KA .
-- Marc-Olivier Delchini
Can you try running running the raja-teams example under examples?
I will have to try the example but I run into another problem lately that I can reproduce with the piece of code below:
RAJA::expt::launch<launch_policy>(RAJA::expt::DEVICE,
RAJA::expt::Resources(RAJA::expt::Teams(NTeams_x,NTeams_y,NTeams_z),
RAJA::expt::Threads(NThreads_x,NThreads_y,NThreads_z)),
[=] RAJA_HOST_DEVICE(RAJA::expt::LaunchContext ctx) {
RAJA::expt::loop<global_thread_z>(ctx, z_range_teams, [&] (int z) {
RAJA::expt::loop<global_thread_y>(ctx, y_range_teams, [&] (int y) {
RAJA::expt::loop<global_thread_x>(ctx, x_range_teams, [&] (int x) {
// Shift indices
int xcpp_ = x - (sev_(0) - 1); // sx = sev_(0)
int ycpp_ = y - (sev_(1) - 1); // sy = sev_(1)
int zcpp_ = z - (sev_(2) - 1); // sz = sev_(2)
int offset_ = xcpp_ + dims_(0) * (ycpp_ + dims_(1) * zcpp_ );
// Compute macro variables
int rowscpp_ = dims_(0);
int colscpp_ = dims_(1);
int depthcpp_ = dims_(2);
int ndir_ = sev_(3);
// Fluid nodes
if (obs_(offset_) == 0)
{
//printf("cuda test outside loop %d ", ndir_);
// Loop over directions to compute equilibrium and collision kernels
//printf("Before for loop");
for (int a = 1; a < ndir_; a++)
{
printf("cuda test inside loop: %d \n", a);
}
}
});
});
});
});
The outlet I get is the following:
cuda test inside loop: 1
which I interpret as the for loop
is not correctly evaluated. I am not sure to understand why all integer values are not being printed on the screen. Since I have four nest loops, and the fourth loop has to be sequentially performed.
When I look at the RAJA team examples, the C++ for loop
syntax is not used but replaced with [loop_icount
](
). Should I be using loop_icount
instead of the for()
syntax for the fourth nested loop?
Hi @delcmo , no the loop_icount
method is only needed when performing tiling. Where is ndir_ defined? If it is a member or global value, it could be an issue when it comes to lambda capturing. Something else to try is adding cudaDeviceSynchronize();
after your kernel.
Ok thanks for the clarification. I added cudaDeviceSynchronize();
and it
seems to output the correct information. ndir_
is defined but I removed
that line in the post.
I am still struggling with the error message related to
CUDAassert: too many resources requested for launch
/gpfs/alpine/proj-shared/cfd136/raja/build-210412c/include/RAJA/policy/cuda/MemUtils_CUDA.hpp
210
terminate called after throwing an instance of 'std::runtime_error'
what(): CUDAassert
For some reason, it is triggered this time by adding a print statement to
the kernel printf()
. Could it be related to a saturated memory?
Also, when using the RAJA::View
class, are there any specific rules to be
aware of? iIf defining RAJA::View<double, RAJA::Layout<1>> f_(f, sizef);
can I read and write the values of f_
from a kernel without limitation?
Should I restraint the use of RAJA::View
to a minimum in my C++code? I am
seeing weird behavior that I cannot explain when using f_
like object. If
I try to read f_
and update it from a kernel I also get the above error
message.
On Mon, Jun 28, 2021 at 2:44 PM Arturo Vargas @.***> wrote:
Hi @delcmo https://github.com/delcmo , no the loopicount method is only needed when performing tiling. Where is ndir defined? If it is a member or global value, it could be an issue when it comes to lambda capturing. Something else to try is adding cudaDeviceSynchronize(); after your kernel.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LLNL/RAJA/issues/1057#issuecomment-869929509, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCSQQ6N53LVBSHPPLB3TVC7KTANCNFSM45TIE6KA .
-- Marc-Olivier Delchini
Hi @delcmo , the only rule I can think of is that the data in the view has to be in the same memory space that you are trying to access it. If you use unified memory, data transfers should be taken care of for you.
Regarding your first error, I just thought of something. Can you try reducing the number of threads per team maybe try 6? I came across this article: https://stackoverflow.com/questions/26201172/cuda-too-many-resources-requested-for-launch
and it could be that your kernel uses too much register space. One way to also check this, would be to comment out the code inside the ::launch method, if it runs its probably maxing out register space.
I reduced the number of threads and you are correct it runs. I was previously using the maximum number of threads (Nthreads_x = Nthreads_y = 8 and Nthreads_z = 16). My lack of experience with CUCA programming cost me a few days of work ... ;)
That being said, I now wonder what would be the correct way to set the number of threads?
On Mon, Jun 28, 2021 at 5:24 PM Arturo Vargas @.***> wrote:
Hi @delcmo https://github.com/delcmo , the only rule I can think of is that the data in the view has to be in the same memory space that you are trying to access it. If you use unified memory, data transfers should be taken care of for you.
Regarding your first error, I just thought of something. Can you try reducing the number of threads per team? I came across this article:
https://stackoverflow.com/questions/26201172/cuda-too-many-resources-requested-for-launch
and it could be that your kernel uses too much register space. One way to also check this, would be to comment out the code inside the ::launch method, if it runs its probably maxing out register space.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LLNL/RAJA/issues/1057#issuecomment-870054948, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCW45VOO2NH7YTZPJS3TVDSCFANCNFSM45TIE6KA .
-- Marc-Olivier Delchini
Choosing the number of threads is a bit of an art, it's hard to say. It depends on your kernel, how much local memory it needs. It is definitely something to play around with. Happy it is now working though.
One suggestion I have though is to change your CUDA launch policy from RAJA::expt::cuda_launch_t<true>
to RAJA::expt::cuda_launch_t<false>
. This will preform synchronizations after CUDA kernels.
Additionally, the latest RAJA now has CUDA kernels running on the non-default CUDA stream; and if you allocate memory using cudaMallocManaged
I think it might be the case that you can access the data on the host before the kernel is done with it. ping: @trws is this right? could we have race conditions if RAJA is on one stream but we don't specify which stream to use when using unified memory?
One additional option in RAJA is to configure CAMP to use the CUDA default stream. That is done by defining the following variable at compile time: CAMP_USE_PLATFORM_DEFAULT_STREAM
.
I stopped using cudaMallocManaged
and exclusively rely on the capabilities of the RAJA::View
class. I will make the changes to the code based on your recommendations.
What does default stream mean?
How many teams can I use in my code? Am I limited to three, i.e. three nested loop, or can I declare a fourth team to have four nested loops?
I stopped using
cudaMallocManaged
and exclusively rely on the capabilities of theRAJA::View
class. I will make the changes to the code based on your recommendations.What does default stream mean?
How many teams can I use in my code? Am I limited to three, i.e. three nested loop, or can I declare a fourth team to have four nested loops?
A CUDA stream; is the queue in which instructions are stored. CUDA has multiple queues so kernels can be executed async. With unified memory (cudaMallocManaged) we have to make sure we sync before we touch the data on the CPU.
Teams/Threads support up to 3 dimensions (like the CUDA programming model).
Regarding the RAJA::View
it only holds a pointer so we are still responsible for moving the data to and from the device.
Which machine are you running on? If you try to compute on the GPU using data from the CPU memory space it would normally segfault.
I run on SUMMIT at ORNL. I will double check the CUDA examples and the my code because based on your previous email I clearly have something weird in my code.
If its like Sierra, its probably moving the data for you. Try running your code with the following flag: --atsdisable with lalloc
I will give it a try but that would make sense because I did checked that the code is running on the GPUs and I no longer get a CUDA error message. I assume that I will have to modify the code to make it compatible with any CUDA machines and not rely on the specificity of Summit.
You are correct, the data are being moved.
I am doing some testing on both CPU and GPU to assess the speed up we now get from using GPU over CPU. There is one thing I am not clear on is, what the number of threads mean for CPUs when entering the RAJA loop tool? Is that ignored by the CPU or do I have to be careful what value is used?
On Tue, Jul 6, 2021 at 11:33 PM Arturo Vargas @.***> wrote:
If its like Sierra, its probably moving the data for you. Try running your code with the following flag: --atsdisable with lalloc
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LLNL/RAJA/issues/1057#issuecomment-875246404, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCTXDQGQP6UEL4ADOCDTWPDIJANCNFSM45TIE6KA .
-- Marc-Olivier Delchini
On the CPU, the number of teams/threads get ignored, RAJA::loops get converted to standard for loops.
Ok thanks for the quick reply. I will make sure this is correct in my code by running on CPUs with different number of threads.
Thanks,
On Wed, Aug 4, 2021 at 1:29 PM Arturo Vargas @.***> wrote:
On the CPU, the number of teams/threads get ignored, RAJA::loops get converted to standard for loops.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LLNL/RAJA/issues/1057#issuecomment-892839403, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCSY4JMA5AX5Q3QB6HDT3F2F5ANCNFSM45TIE6KA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .
-- Marc-Olivier Delchini
Hello,
I have recently started to use RAJA package to port some of the C++ subroutines to GPUs and I was wondering how to call a C++ function from a RAJA kernel. I could not find any related issues and the documentation does not seem to address this approach.
Does RAJA support call of C++ function from a kernel? If it does, could you please point me to an example or any documentation?
Thanks,
Marco