Function call from a kernel

LLNL / RAJA

RAJA Performance Portability Layer (C++)

BSD 3-Clause "New" or "Revised" License

487 stars 103 forks source link

Function call from a kernel #1057

Open delcmo opened 3 years ago

delcmo commented 3 years ago

Hello,

I have recently started to use RAJA package to port some of the C++ subroutines to GPUs and I was wondering how to call a C++ function from a RAJA kernel. I could not find any related issues and the documentation does not seem to address this approach.

Does RAJA support call of C++ function from a kernel? If it does, could you please point me to an example or any documentation?

Thanks,

Marco

rhornung67 commented 3 years ago

You can call a function inside a kernel. This is completely independent of RAJA. Here's an example: https://github.com/LLNL/RAJAPerf/blob/develop/src/basic/TRAP_INT-Cuda.cpp#L139 The macro defining the kernel body is here: https://github.com/LLNL/RAJAPerf/blob/develop/src/basic/TRAP_INT.hpp#L39 and the function being called is defined here: https://github.com/LLNL/RAJAPerf/blob/develop/src/basic/TRAP_INT-Cuda.cpp#L27

delcmo commented 3 years ago

@rhornung67 thanks for your email. In the example you provided, the function is defined in the RAJA space:

RAJA_INLINE
RAJA_DEVICE
Real_type trap_int_func(Real_type x,
                        Real_type y,
                        Real_type xp,
                        Real_type yp)
{
   Real_type denom = (x - xp)*(x - xp) + (y - yp)*(y - yp);
   denom = 1.0/sqrt(denom);
   return denom;
}

My code was written for CPU platforms and all functions are written in C++. Based on the example you provided, I would have to re-define / convert all C++ function to RAJA so that they can be called from a RAJA kernel. Am I correct?

rhornung67 commented 3 years ago

What do you mean by "RAJA space"? What do you mean by "convert all C++ functions to RAJA"?

You shouldn't have to convert any functions, just add the appropriate hot/device annotations so the compiler will generate appropriate versions that are callable in host or device code.

delcmo commented 3 years ago

In the above example, the variables RAJA_DEVICE and RAJA_INLINE precede the definition of the C++ function. If I were to run the same C++ function on a CPU machine, I would have to change the value of the variables RAJA_DEVICE and RAJA_INLINE. Where are these variables define?

rhornung67 commented 3 years ago

The macros are defined here https://github.com/LLNL/RAJA/blob/develop/include/RAJA/util/macros.hpp#L41 and here https://github.com/LLNL/RAJA/blob/develop/include/RAJA/util/macros.hpp#L41. We use them inside RAJA and provide them to users as a convenience. You don't have to use them. You can define your own if you want.

The questions you are asking are related to topics independent of RAJA. For example, if you wrote native CUDA code, you still have to deal with making your functions callable from host and/or device code sections as needed. There is no avoiding that.

artv3 commented 3 years ago

Hi @delcmo , RAJA_DEVICE is just a macro for CUDA/HIP's __device__ qualifiers. In general, if you want to call a function on the device you need to add the __device__ qualifier. For example

__device__
void foo()
{
}

RAJA::forall<RAJA::cuda_exec<CUDA_BLOCK_SIZE>>(RAJA::RangeSegment(0, 10), 
[=] __device__ (int i) { 
foo(); 
});

If you have an example, I can try to help. Additionally you can add both __host__ __device__ to the functions and they will be callable from the CPU or GPU.

delcmo commented 3 years ago

Hi @artv3 thanks for your reply. This is what I am after. I just started with RAJA and may have overlooked a few things in the documentation.

If I understand correctly your respond, by including __device__ in the definition of the function foo() I should be able to call it from either CPU or GPU. I will have to try it with a simple hello function.

artv3 commented 3 years ago

Hi @artv3 thanks for your reply. This is what I am after. I just started with RAJA and may have overlooked a few things in the documentation.

If I understand correctly your respond, by including __device__ in the definition of the function foo() I should be able to call it from either CPU or GPU. I will have to try it with a simple hello function.

If you want the option to call it from either CPU or GPU, you will need include the __host__ qualifier as well. For example:


__host __ __device__
void foo()
{
printf("Hello world"); 
}

foo(); //call on the CPU 

RAJA::forall<RAJA::cuda_exec<CUDA_BLOCK_SIZE>>(RAJA::RangeSegment(0, 10), 
[=] __host__ __device__ (int i) { 
foo();  call from the GPU. 
});

delcmo commented 3 years ago

Ok, perfect let me try this and see if I can get it to compile and run. Thanks for taking the time to reply and to provide that working example. I will get back to you if I have any questions.

artv3 commented 3 years ago

Sounds good, I can definitely help. I use RAJA in a couple of applications and can help with integration/usage questions.

delcmo commented 3 years ago

Hi,

I was able to compile and the function from both the CPU and GPU using the following syntax:

__device__ __host__
void foo()
{
  printf("Hello world");
}

call foo();

cout << "GPU call" << endl;
const int CUDA_BLOCK_SIZE = 16;
RAJA::forall<RAJA::cuda_exec<CUDA_BLOCK_SIZE>>(RAJA::RangeSegment(0, 10),
  [=] __device__ (int i) {
  foo(); //  call from the GPU.
});
cout << "end GPU call" << endl;

I am now trying to convert one of my function to get it to run on GPU:

__host__ __device__
void compute_macro_srt2(int indices [], double *f, double *g,
                       bool is_energy, double &rho_xyz, double &T_xyz,
                       double &jx_xyz, double &jy_xyz, double &jz_xyz)
{
  int xcpp = indices[0];
  int ycpp = indices[1];
  int zcpp = indices[2];
  int rowscpp = indices[3];
  int colscpp = indices[4];
  int depthcpp = indices[5];
  int ndir = indices[6];

  int vjx [] = {1,-1,0,0,0,0,1,-1,1,-1,1,-1,1,-1,0,0,0,0,0};
  int vjy [] = {0,0,1,-1,0,0,1,1,-1,-1,0,0,0,0,1,-1,1,-1,0};
  int vjz [] = {0,0,0,0,1,-1,0,0,0,0,1,1,-1,-1,1,1,-1,-1,0};

  rho_xyz = 0.0;
  T_xyz = 0.0;
  jx_xyz = 0.0;
  jy_xyz = 0.0;
  jz_xyz = 0.0;
  for (int d = 0; d < ndir; d++)
  {
    int offsetxyza = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * d) );
    rho_xyz += f[offsetxyza];
    if (is_energy) T_xyz += g[offsetxyza];
    jx_xyz += vjx[d] * f[offsetxyza];
    jy_xyz += vjy[d] * f[offsetxyza];
    jz_xyz += vjz[d] * f[offsetxyza];
  }

}

This function computes the macro variables (density, temperature and momentum components) in a Lattice Boltzmann Method code. I encountered two issues here:

The variables *f and *g are distribution functions and are passed to the GPU using:
```
const int sizef = ndir * (rowscpp) * (colscpp) * (depthcpp);
RAJA::View<double, RAJA::Layout<1>> f_(f, sizef);
RAJA::View<double, RAJA::Layout<1>> g_(g, sizef);
```
When calling the function compute_macro_srt2 with f_ and g_ I get a compilation error message:
```
error: no suitable conversion function from "const RAJA::View<double, RAJA::Layout<1UL, RAJA::Index_type, -1L>, double *>" to "double *" exists
```
The code compiles and runs fine if I use the following method to declare f_:
```
double *f_;
cudaMallocManaged(&f_, sizef*sizeof(double));
```
What is the correct way to pass an array from the host to the device? Is my approach correct here? If not how should I change it?
The second issue is related to the function compute_macro_srt2 itself. Initially that function was declared and defined in a different file. I naively added the __host__ and __device__ to its definition but got the following compilation error message:
```
/gpfs/alpine/cfd136/proj-shared/delchini/lbmapp/PRATHAM/src/cpp_mrtutilities.C(24): warning: a __host__ function("compute_macro_srt2") redeclared with __host__ __device__, hence treated as a __host__ __device__ function
```

/gpfs/alpine/cfd136/proj-shared/delchini/lbmapp/PRATHAM/src/cpp_mrtcollision.C(417): error: calling a host function("compute_macro_srt2") from a device function(" const") is not allowed

/gpfs/alpine/cfd136/proj-shared/delchini/lbmapp/PRATHAM/src/cpp_mrtcollision.C(417): error: identifier "compute_macro_srt2" is undefined in device code


I avoided the above issue by moving the function from `cpp_mrtutilities.C` to `cpp_mrtcollision.C` where it is being called. 

How do I make a function defined in a different file available both on the host and the device?

Thanks,

rhornung67 commented 3 years ago

@delcmo Sounds like you're getting the hang of things. This is great.

1) Why are you making RAJA Views if you are not using them in your kernel? The args to your function are double* types. View types do not convert to that type. Also, Views do not allocate memory. They hold a pointer that you pass to their constructor. They are mostly used to simplify multi-dimensional indexing, pointer offset computations, etc.

Also, a word of caution. cudaMalloc calls are expensive, much more than CPU malloc calls. Moreover, cudaMallocManaged are even more costly, plus you incur the cost of host-device memory transfers triggered by runtime page faults. The allocation sizes will be an integer multiple of the page size, which is large, even if you think you are allocating a few bytes. On systems like Summit or Sierra, these are implemented in hardware. If you run on other devices that do not support managed memory in hardware, then the transfers are handled in runtime software and are more expensive. If you are going this route just to get the code to run, that's fine. But you need to think carefully about host-device memory management and should try to minimize memory space transfers as much as possible for performance reasons.

2) It's difficult to answer this without knowing the organization of your source code and how you are building/linking it.

artv3 commented 3 years ago

Hi @delcmo,

Regarding 1.

The compute_macro_srt2 function signature expects a double * for f, g, not RAJA::View<double, RAJA::Layout<1>> types which is what led to the compilation error.

If you would like to pass in RAJA::View<double, RAJA::Layout<1>> then the function signature would have to change to

void compute_macro_srt2(int indices [], RAJA::View<double, RAJA::Layout<1>> f,  RAJA::View<double, RAJA::Layout<1>> g,
                       bool is_energy, double &rho_xyz, double &T_xyz,
                       double &jx_xyz, double &jy_xyz, double &jz_xyz)

and inside the function you would have to use the parenthesis operator to access data in f, g. The role of the views however; is not to transfer data from the CPU to the GPU, but rather to enable different reshaping of the data. In potential you choose to treat the data as a 4 dimensional array using views

RAJA::View<double, RAJA::Layout<4>> f_(f,ndir, rowscpp, colscpp, depthcpp);
f_(dir,row, col, depth); //access data this way.

Using the cudaMallocManaged method as you have is fine. In fact using cudaMallocManaged makes it so that the data is available on both the host and the device (CUDA would take care of moving data automatically).

Regarding 2.

I believe you want to add __host__ __device__ to both the header and implementation file.

delcmo commented 3 years ago

Hi @rhornung67 and @artv3,

thanks for the reply. The piece of code I provided does not include the RAJA Views. Below is the full code:

  RAJA::View<int, RAJA::Layout<1>> sev_(sev, 11);
  RAJA::View<double, RAJA::Layout<1>> params_(params, 9);

  int dims [] = {rowscpp, colscpp, depthcpp, ndir};
  RAJA::View<int, RAJA::Layout<1>> dims_(dims, 4);

  int dim_comp [] = {0, 1, 2}; // {1, 2, 3} in Fortran 90
  RAJA::View<int, RAJA::Layout<1>> dim_comp_(dim_comp, 3);

  // Allocate memory for the distribution functions 'f' and 'g' on the GPU
  const int sizef = ndir * (rowscpp) * (colscpp) * (depthcpp);
  RAJA::View<double, RAJA::Layout<1>> f_(f, sizef);
  double *f2_, *g2_;
  cudaMallocManaged(&f2_, sizef*sizeof(double));
  f2_ = f;
  cudaMallocManaged(&g2_, sizef*sizeof(double));
  g2_ = g;

  RAJA::View<double, RAJA::Layout<1>> g_(g, sizef);

  // Vectors to compute momentum components
  int vjx [] = {1,-1,0,0,0,0,1,-1,1,-1,1,-1,1,-1,0,0,0,0,0};
  int vjy [] = {0,0,1,-1,0,0,1,1,-1,-1,0,0,0,0,1,-1,1,-1,0};
  int vjz [] = {0,0,0,0,1,-1,0,0,0,0,1,1,-1,-1,1,1,-1,-1,0};

  RAJA::View<int, RAJA::Layout<1>> vjx_(vjx, ndir);
  RAJA::View<int, RAJA::Layout<1>> vjy_(vjy, ndir);
  RAJA::View<int, RAJA::Layout<1>> vjz_(vjz, ndir);

  // Pass vector 'obs' to RAJA
  const int sizexyz = (rowscpp) * (colscpp) * (depthcpp);
  RAJA::View<int, RAJA::Layout<1>> obs_(obs, sizexyz);

  // Pass vector 'v' to RAJA
  RAJA::View<int, RAJA::Layout<1>> v_(v, 3*ndir);

  // Pass vector 'w' to RAJA
  RAJA::View<double, RAJA::Layout<1>> w_(w, ndir);

  // Load RAJA tools
  using RAJA::Segs;
  using RAJA::Params;

  // Range segment for x, y, z and directions
  RAJA::RangeSegment dir_range(0, ndir);
  RAJA::RangeSegment x_range(1, ex - sx + 2);
  RAJA::RangeSegment y_range(1, ey - sy + 2);
  RAJA::RangeSegment z_range(1, ez - sz + 2);

  // CUDA kernel policy
  using EXEC_POL5 =
    RAJA::KernelPolicy<
      RAJA::statement::CudaKernel<
        RAJA::statement::For<3, RAJA::cuda_thread_z_loop, // RAJA::cuda_block_z_loop,
          RAJA::statement::For<2, RAJA::cuda_thread_y_loop, // RAJA::cuda_block_y_loop,
            RAJA::statement::For<1, RAJA::cuda_thread_x_loop, // RAJA::cuda_block_x_loop,
            RAJA::statement::Lambda<0, Segs<1,2,3>, Params<0,1,2,3,4> >,
            RAJA::statement::For<0, RAJA::seq_exec, // cuda_thread_x_loop,  // RAJA::cuda_block_x_loop,
              RAJA::statement::Lambda<1, Segs<0,1,2,3>, Params<0,1,2,3,4> >
            >,
            RAJA::statement::Lambda<2, Segs<1,2,3>, Params<0,1,2,3,4> >,
            RAJA::statement::For<0, RAJA::seq_exec, // RAJA::cuda_block_x_loop,  // RAJA::cuda_block_x_loop,
              RAJA::statement::Lambda<3, Segs<0,1,2,3>, Params<0,1,2,3,4> >
            >
            >
          >
        >
      >
    >;

  RAJA::kernel_param<EXEC_POL5>(
    RAJA::make_tuple(dir_range, x_range, y_range, z_range),
    RAJA::tuple<double, double, double, double, double>{0.0, 0.0, 0.0, 0.0, 0.0},    // thread local variable for 'rho'

    // lambda 0: initialize macro variables
    [=] RAJA_DEVICE (int x, int y, int z, double &rho, double &jx, double &jy, double &jz, double &T) {
      // Allocate array with indices to pass to the function computing the macro variables
      int  indices[7];
      indices[0] = x;
      indices[1] = y;
      indices[2] = z;
      indices[3] = dims_[0];
      indices[4] = dims_[1];
      indices[5] = dims_[2];
      indices[6] = dims_[3];

      compute_macro_srt2(indices,f2_,g2_,lgclbool_[1],rho,T,jx,jy,jz);
      int offsetxyz = x + dims_[0] * (y + dims_[1] * z );
      //printf("%f, %f, %f, %f, %d ", rho, jx, jy, jz, offsetxyz);;

      //rho = 0.0;
      //jx = 0.0; jy = 0.0; jz = 0.0;
      //if (lgclbool_[1]) T = 0.0;
    },

    // lambda 1: compute macro variables by looping over all directions
    [=] RAJA_DEVICE (int dir, int x, int y, int z, double &rho, double &jx, double &jy, double &jz, double &T) {
      int offsetxyza = x + dims_[0] * (y + dims_[1] * (z + dims_[2] * dir) );
      //rho += f_[offsetxyza];
      //jx += vjx_[dir] * f_[offsetxyza];
      //jy += vjy_[dir] * f_[offsetxyza];
      //jz += vjz_[dir] * f_[offsetxyza];
      //if (lgclbool_[1]) T += g_[offsetxyza];

    },

    // lambda 2
    [=] RAJA_DEVICE (int x, int y, int z, double &rho, double &jx, double &jy, double &jz, double &T) {
      int offsetxyz = x + dims_[0] * (y + dims_[1] * z );
      //foo();
      //printf("x = %d ", x);
      //printf("y = %d ", y);
      //printf("z = %d ", z);
      //printf("rho_gpu = %f, %ld ", rho, offsetxyz);
      //printf("jxyz_gpu = %f, %f, %f, %ld ", jx, jy, jz, offsetxyz);
      //printf("jxyz_gpu = %f, %f, %f, %ld ", jx, jy, jz, offsetxyz);
      //if (lgclbool_[1]) printf("T = %f, %ld ", T, offsetxyz);
    },

    // lambda 3: compute equilibrium function for SRT scheme
    [=] RAJA_DEVICE (int dir, int x, int y, int z, double &rho, double &jx, double &jy, double &jz, double &T) {
      //printf("dir = %d ", dir);
      int offsetxyz = x + dims_[0] * (y + dims_[1] * z );
      // fluid node
      if (obs_[offsetxyz] == 0)
      {
        //printf("dir = %d %d %d %d ", dir,x,y,z);
        // Compute 'taup'
        const double taup = params_[2]; // tau0
        // Get components for each direction stored in 'v'
        //int x_comp = 0, y_comp = 1, z_comp = 2;
        int offsetxdir = dir * 3 + dim_comp_[0]; // x_comp;
        const double vxa = v_[offsetxdir];
        int offsetydir = dir * 3 + dim_comp_[1]; // y_comp;
        const double vya = v_[offsetydir];
        int offsetzdir = dir * 3 + dim_comp_[2]; // z_comp;
        const double vza = v_[offsetzdir];
        // Compute 'edotu'
        const double edotu = (vxa*jx + vya*jy + vza*jz)/rho;
        //printf(" edotu = %f", edotu);
        // Compute local equillibrium value for NS equation
        const double usq = (jx*jx+jy*jy+jz*jz)/(rho*rho);
        double feq = 0.0, fcl = 0.0;
        feq = 1.0 + 3.0 * edotu + 4.5 * edotu * edotu - 1.5 * usq;
        feq *= rho * w_[dir];
        // Compute collision
        int offsetfeq = x + dims_[0] * (y + dims_[1] * (z + dims_[2] * dir) );
        fcl = (1.0 - 1.0/taup) * f_[offsetfeq] + feq / taup;
        //printf(" feq_gpu = %f %d ", feq, offsetfeq);
        //printf("fcl_gpu = %f %d %f %f %f ", fcl, offsetfeq, f_[offsetfeq], feq, jz);
        // Update 'f_' value
        f_[offsetfeq] = fcl;
      }

      // Energy equation for fluid and solid nodes
      if (lgclbool_[1])
      {
        const double tau_g = params_[4];
        // Get components for each direction stored in 'v'
        //int x_comp = 0, y_comp = 1, z_comp = 2;
        int offsetxdir = dir * 3 + dim_comp_[0]; // x_comp;
        const double vxa = v_[offsetxdir];
        int offsetydir = dir * 3 + dim_comp_[1]; // y_comp;
        const double vya = v_[offsetydir];
        int offsetzdir = dir * 3 + dim_comp_[2]; // z_comp;
        const double vza = v_[offsetzdir];
        // Compute 'edotu'
        const double edotu = (vxa*jx + vya*jy + vza*jz)/rho;
        // Compute local equillibrium value for NS equation
        const double usq = (jx*jx+jy*jy+jz*jz)/(rho*rho);
        // Compute local euilibrium value for energy equation
        double geq = 1.0 + 3.0 * edotu + 4.5 * edotu * edotu - 1.5 * usq;
        geq *= T * w[dir];
        // Add 'geq' value to 'g_equilibrium' array
        int offsetgeq = x + rowscpp * (y + colscpp * (z + depthcpp * dir) );
        // Collision kernel 'gcl' (same for SRT and MRT)
        double gcl = (1.0 - 1.0 / tau_g) * g_[offsetgeq];
        gcl += geq / tau_g;
      }

    }

   );

When I read the RAJA documentation, the RAJA Views object seemed a good option as it avoids the cudaMallocManaged call. That being said, since the function compute_macro_srt2 takes double * as an input I will have to re-write a version of the function that is compatible with the RAJA Views. I wanted to avoid having the same function implemented twice, one for the host and the other one for the device since we want that code to run both on CPU and GPU platforms.

As for the second issue, the main function is implemented in cpp_mrtcollision.C and the function compute_macro_srt is implemented in cpp_mrtutilities.C. The file cpp_mrtcollision.C contains #include "cpp_mrtutilities.h" at the top. Is there an example that shows how to defined a function in a header and then call that same function in file that implements the main body of the code?

artv3 commented 3 years ago

Hi @delcmo, the RAJA Views just hold a pointer, the host code is still responsible for allocating the memory. Views will work on both the host and the device

The usage should look like this:

double *f_ptr; 
cudaMallocManaged(&f_ptr, sizef*sizeof(double)); //allocate memory (cudaMallocManaged enables the data to be accessible on both the CPU and GPU)
 RAJA::View<double, RAJA::Layout<1>> f_(f_ptr, sizef); // is accessible on both the CPU and GPU.

I also see you are using RAJA::kernel, looking at that policy I believe it is incomplete. The CUDA programming model uses hierarchical parallelism (blocks and threads), and your policy is missing how the blocks should be mapped.

Given the structure of your kernel, have you looked at RAJA Teams? We have a basic example that compares writing C style nested loops and how they could look like with Teams here https://github.com/LLNL/RAJA/blob/develop/examples/tut_teams_basic.cpp and another example of general usage here: https://github.com/LLNL/RAJA/blob/develop/examples/raja-teams.cpp#L156-L177

RAJA Teams creates an execution space where you express kernels as nested for loops. It also supports choosing how to run kernels based on a run-time parameter (host or device).

If you could post the C++ version of your kernel, I can give an outline of what the Teams variant would look like.

Regarding the second issue, let me see if I can come up with an example.

delcmo commented 3 years ago

@artv3

I have not looked at the block mapping yet. I spent quite a bit of time learning nested blocks and making sure the code compiles and gives the correct solution. I will have to look at this later.

I looked at the matrix multiplication example to get started with the nested for loops and built from it.

The C++ code is quite long (I think) as you can see below:


  // Loop over z, y and x axis, and directions 'dir'
  for (int z = sz; z <= ez; z++)
  {
    for (int y = sy; y <= ey; y++)
    {
      for (int x = sx; x <= ex; x++)
      {
        // Shift indices
    int xcpp = x - (sx - 1);
    int ycpp = y - (sy - 1);
        int zcpp = z - (sz - 1);
        //cout << "xpp = " << xcpp << " ycpp = " << ycpp << " zcpp = " << zcpp << endl;
    // Stored all shifted indices in 'indices' array and number of directions
        int indices[7];
    indices[0] = xcpp;
    indices[1] = ycpp;
    indices[2] = zcpp;
    indices[3] = rowscpp;
    indices[4] = colscpp;
    indices[5] = depthcpp;
        indices[6] = ndir;
        // Compute offset value
        int offset = xcpp + rowscpp * (ycpp + colscpp * zcpp);
        // Compute local density and components of the 'j' flux
    compute_macro_srt(indices,f,g,is_energy,rho_xyz,T_xyz,
              jx_xyz,jy_xyz,jz_xyz);
        //cout << "x = " << x << " y = " << y << " z = " << z << endl;
        //cout << "x = " << x << " ";
        //cout << "y = " << y << " ";
        //cout << "z = " << z << " ";
        //cout << "rho_cpu = " << rho_xyz << " " << offset << " ";
        //cout << "jxyz_cpu = " << jx_xyz << " " << jy_xyz << " " << jz_xyz << " " << offset << " ";
    // Compute local source terms for NS and energy equations
    double force_xyz [] = {0.0, 0.0, 0.0};
    double energy_xyz = 0.0;
    compute_source_xyz(xcpp,ycpp,zcpp,time,is_energy,
               gforceval, E_source,
               rho_xyz,T_xyz,jx_xyz,jy_xyz,jz_xyz,
               force_xyz,energy_xyz);
        // Check if node is a fluid node
        if (obs[offset] == 0)
        {
          // Compute density 'rhop'
          const double rhop = is_lbgk ? rho_xyz : rho0;
      // Compute 'taup' to be used in collision kernel
      double sigma_xyz, taup;
      if (is_les)
      {
        // Compute old values
        double rho_xyz_old, T_xyz_old, jx_xyz_old, jy_xyz_old, jz_xyz_old;
        compute_macro_srt(indices,f_older,g,is_energy,rho_xyz_old,
                      T_xyz_old,jx_xyz_old,jy_xyz_old,jz_xyz_old);
        // Compute norm of the stress tensor
            compute_sigma_les(indices,v,w,f,f_older,rho0,RT,is_lbgk,time,
                  restart_file_time+1,
                      rho_xyz_old,jx_xyz_old,jy_xyz_old,jz_xyz_old,
                  tau[offset],sigma_xyz);
        // Compute relaxation time and store it in array 'tau'
        taup = tau0;
        taup += 0.5 * (sqrt(tau0*tau0 + 18.0*Csmag*Csmag*sigma_xyz) - tau0);
        tau[offset] = taup;
      }
      else
      {
        taup = tau0;
      }
          // Extract ux, uy and uz values
          const double ux = jx_xyz / rhop;
          const double uy = jy_xyz / rhop;
          const double uz = jz_xyz / rhop;
          // Compute magnitude of velocity vector
          const double usq = ux * ux + uy * uy + uz * uz;
      // Compute uxcl, uycl and uzcl values for collision kernel
          const double uxcl = jx_xyz / rho_xyz;
          const double uycl = jy_xyz / rho_xyz;
          const double uzcl = jz_xyz / rho_xyz;
      // Compute collision term for MRT
      if (!is_lbgk)
      {
        mrt_collision_ns(indices,taup,rho0,is_les,
                             rho_xyz,jx_xyz,jy_xyz,jz_xyz,
                             f);
      }
          // Loop over directions to compute equilibrium and collision kernels
          //cout << "x = " << x << " y = " << y << " z = " << z << endl;
          for (int a = 0; a < ndir; a++)
          {
            // Get components for each direction stored in 'v'
            int offsetxdir = a * 3 + x_comp;
            const double vxa = v[offsetxdir];
            int offsetydir = a * 3 + y_comp;
            const double vya = v[offsetydir];
            int offsetzdir = a * 3 + z_comp;
            const double vza = v[offsetzdir];
            // Compute 'edotu'
            double edotu = vxa*ux + vya*uy + vza*uz;
        //cout << " edotu = " << edotu;
            // Compute local equillibrium value for NS equation
            double feq = 0.0;
            if (is_lbgk)
        {
              feq = 1.0 + 3.0 * edotu + 4.5 * edotu * edotu - 1.5 * usq;
              feq *= rho_xyz;
        }
        else
        {
          edotu *= 1.0 / RT;
          feq = rho_xyz + rho0 * (edotu + 0.5 * edotu * edotu - 0.5 * usq / RT);
        }
        feq *= w[a];
        // Add 'feq' to 'f' array
            int offsetfeq = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
        //cout << "offsetfeq = " << offsetfeq << " ";
        //cout << "dir = " << a << " ";

        // Collision kernek 'fcl'
        double fcl = 0.0;
        if (is_lbgk)
        {
          fcl = (1.0 - 1.0/taup) * f[offsetfeq] + feq / taup;
          //cout << " fcl_cpu = " << fcl << " " << offsetfeq << " " << f[offsetfeq] << " " << feq  << " " << jz_xyz << " ";
          //cout << " fcl_cpu = " << fcl - f2[offsetfeq] << " ";
          f[offsetfeq] = fcl;
        }
        //cout << " feq_cpu = " << feq << " offsetfeq = " << offsetfeq << " ";
        //cout << " feq = " << feq << " fcl = " << fcl;
        // Add force terms to the NS equations
        fcl = force_xyz[0] * (vxa - uxcl);
        fcl += force_xyz[1] * (vya - uycl);
        fcl += force_xyz[2] * (vza - uzcl);
        fcl *= delta_t * feq / RT;
        f[offsetfeq] += fcl;
          } // end for loop over direction
        //cout << endl;
        //cout << endl;
        } // end if on fluid node

    // Energy equations
        if (is_energy)
        {
      // Compute velocity values to be used in energy equation
          const double ux = jx_xyz / rho_xyz;
          const double uy = jy_xyz / rho_xyz;
          const double uz = jz_xyz / rho_xyz;
          // Compute magnitude of velocity vector
          const double usq = ux * ux + uy * uy + uz * uz;
          // Loop over directions to compute f_equilibrium
          for (int a = 0; a < ndir; a++)
          {
            // Get components for each direction stored in 'v'
            int offsetxdir = a * 3 + x_comp;
            const double vxa = v[offsetxdir];
            int offsetydir = a * 3 + y_comp;
            const double vya = v[offsetydir];
            int offsetzdir = a * 3 + z_comp;
            const double vza = v[offsetzdir];
            // Compute 'edotu'
            double edotu = vxa*ux + vya*uy + vza*uz;
        // Compute local euilibrium value for energy equation
        double geq = 1.0 + 3.0 * edotu + 4.5 * edotu * edotu - 1.5 * usq;
        geq *= T_xyz * w[a];
        // Add 'geq' value to 'g_equilibrium' array
            int offsetgeq = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );

            // Collision kernel 'gcl' (same for SRT and MRT)
            double gcl = (1.0 - 1.0 / tau_g) * g[offsetgeq];
        gcl += geq / tau_g;
        // Add source term to energy equations
        gcl += delta_t * w[a] * energy_xyz;
            g[offsetgeq] = gcl;
      } // end loop over directions
    }

      } // end for loop over x
    } // end for loop over y
  } // end for loop over z

I will look at your examples and try to understand the logic of the RAJA Teams. Thanks again for your help.

artv3 commented 3 years ago

Hi @delcmo, I took a look at your code, and one possible implementation with RAJA Teams which enables running sequentially or with CUDA could look like this:

This first part could go in a header, away from source code since it can be reused in other kernels.

/*                                                                                                                                                                                                                                        
 * Define host/device launch policies                                                                                                                                                                                                     
 */                                                                                                                                                                                                                                       
using launch_policy = RAJA::expt::LaunchPolicy<                                                                                                                                                                                           
    RAJA::expt::seq_launch_t                                                                                                                                                                                                              
#if defined(RAJA_ENABLE_CUDA)                                                                                                                                                                                                             
    ,                                                                                                                                                                                                                                     
    RAJA::expt::cuda_launch_t<true>                                                                                                                                                                                                       
#endif                                                                                                                                                                                                                                    
    >;    

/*                                                                                                                      
  //Define global thread policies, regular for loops on the CPU, global thread ID on GPU                                                                                     
*/                                                                                                                      
using global_thread_x = RAJA::expt::LoopPolicy<RAJA::loop_exec                                                          
#if defined(RAJA_DEVICE_ACTIVE)                                                                                         
                                       ,                                                                                
                                       RAJA::expt::cuda_global_thread_x                                                 
#endif                                                                                                                  
                                       >;                                                                               

using global_thread_y = RAJA::expt::LoopPolicy<RAJA::loop_exec                                                          
#if defined(RAJA_DEVICE_ACTIVE)                                                                                         
                                       ,                                                                                
                                       RAJA::expt::cuda_global_thread_y                                                 
#endif                                                                                                                  
                                       >;                                                                               

using global_thread_z = RAJA::expt::LoopPolicy<RAJA::loop_exec                                                          
#if defined(RAJA_DEVICE_ACTIVE)                                                                                         
                                       ,                                                                                
                                       RAJA::expt::cuda_global_thread_z                                                 
#endif                                                                                                                  
                                       >;

And the kernel is below. Here I chose to use global thread policies which flattens thread/team model to create unique thread IDs based on the compute grid rather than the standard hierarchical structure. On the GPU each loop iteration will get assigned to a thread; while on the CPU they fall back to regular for loops.

  RAJA::RangeSegment x_range(sx, ex+1);                                                                                 
  RAJA::RangeSegment y_range(sy, ey+1);                                                                                 
  RAJA::RangeSegment z_range(sz, ez+1);                                                                                 

  const int NThreads = 8;   //Each team will have 8 threads per dim        
 //Calculate number of teams needed for iteration space  in each dimension                                                                                                             
  const int NTeams_x = RAJA_DIVIDE_CEILING_INT(x_range.size(),NThreads);   
  const int NTeams_y = RAJA_DIVIDE_CEILING_INT(y_range.size(),NThreads);                                                
  const int NTeams_z = RAJA_DIVIDE_CEILING_INT(z_range.size(),NThreads);   

  RAJA::expt::launch<launch_policy>(RAJA::expt::HOST, //Choose RAJA::expt::DEVICE or HOST                               
    RAJA::expt::Resources(RAJA::expt::Teams(NTeams_x,NTeams_y,NTeams_z),                                                
                          RAJA::expt::Threads(NThreads,NThreads,NThreads)),                                             
       [=] RAJA_HOST_DEVICE(RAJA::expt::LaunchContext ctx) {                                                            

   RAJA::expt::loop<global_thread_z>(ctx, z_range, [&] (int z) {                                                        
     RAJA::expt::loop<global_thread_y>(ctx, y_range, [&] (int y) {                                                      
       RAJA::expt::loop<global_thread_x>(ctx, x_range, [&] (int x) {                                                    

           // Shift indices                                                                                             
           int xcpp = x - (sx - 1);                                                                                     
           int ycpp = y - (sy - 1);                                                                                     
           int zcpp = z - (sz - 1);                                                                                     
           //cout << "xpp = " << xcpp << " ycpp = " << ycpp << " zcpp = " << zcpp << endl;                              
           // Stored all shifted indices in 'indices' array and number of directions                                    
           int indices[7];                                                                                              
           indices[0] = xcpp;                                                                                           
           indices[1] = ycpp;                                                                                           
           indices[2] = zcpp;                                                                                           
           indices[3] = rowscpp;                                                                                        
           indices[4] = colscpp;                                                                                        
           indices[5] = depthcpp;                                                                                       
           indices[6] = ndir;                                                                                           
           // Compute offset value                                                                                      
           int offset = xcpp + rowscpp * (ycpp + colscpp * zcpp);                                                       
           // Compute local density and components of the 'j' flux                                                      
           compute_macro_srt(indices,f,g,is_energy,rho_xyz,T_xyz,                                                       
                             jx_xyz,jy_xyz,jz_xyz);          
           //cout << "x = " << x << " y = " << y << " z = " << z << endl;                                               
           //cout << "x = " << x << " ";                                                                                
           //cout << "y = " << y << " ";                                                                                
           //cout << "z = " << z << " ";                                                                                
           //cout << "rho_cpu = " << rho_xyz << " " << offset << " ";                                                   
           //cout << "jxyz_cpu = " << jx_xyz << " " << jy_xyz << " " << jz_xyz << " " << offset << " ";                 
           // Compute local source terms for NS and energy equations                                                    
           double force_xyz [] = {0.0, 0.0, 0.0};                                                                       
           double energy_xyz = 0.0;                                                                                     
           compute_source_xyz(xcpp,ycpp,zcpp,time,is_energy,                                                            
                              gforceval, E_source,                                                                      
                              rho_xyz,T_xyz,jx_xyz,jy_xyz,jz_xyz,                                                       
                              force_xyz,energy_xyz);                                                                    
           // Check if node is a fluid node                                                                             
           if (obs[offset] == 0)                                                                                        
             {                                                                                                          
               // Compute density 'rhop'                                                                                
               const double rhop = is_lbgk ? rho_xyz : rho0;                                                            
               // Compute 'taup' to be used in collision kernel                                                         
               double sigma_xyz, taup;                                                                                  
               if (is_les)                                                                                              
                 {                                                                                                      
                   // Compute old values                                                                                
                   double rho_xyz_old, T_xyz_old, jx_xyz_old, jy_xyz_old, jz_xyz_old;                                   
                   compute_macro_srt(indices,f_older,g,is_energy,rho_xyz_old,                                           
                                     T_xyz_old,jx_xyz_old,jy_xyz_old,jz_xyz_old);                                       
                   // Compute norm of the stress tensor                                                                 
                   compute_sigma_les(indices,v,w,f,f_older,rho0,RT,is_lbgk,time,                                        
                                     restart_file_time+1,                                                               
                                     rho_xyz_old,jx_xyz_old,jy_xyz_old,jz_xyz_old,                                      
                                     tau[offset],sigma_xyz);                                                            
                   // Compute relaxation time and store it in array 'tau'                                               
                   taup = tau0;                                                                                         
                   taup += 0.5 * (sqrt(tau0*tau0 + 18.0*Csmag*Csmag*sigma_xyz) - tau0);                                 
                   tau[offset] = taup;                                                                                  
                 }   
               else                                                                                                     
                 {                                                                                                      
                   taup = tau0;                                                                                         
                 }  
             // Extract ux, uy and uz values                                                                          
               const double ux = jx_xyz / rhop;                                                                         
               const double uy = jy_xyz / rhop;                                                                         
               const double uz = jz_xyz / rhop;                                                                         
               // Compute magnitude of velocity vector                                                                  
               const double usq = ux * ux + uy * uy + uz * uz;                                                          
               // Compute uxcl, uycl and uzcl values for collision kernel                                               
               const double uxcl = jx_xyz / rho_xyz;                                                                    
               const double uycl = jy_xyz / rho_xyz;                                                                    
               const double uzcl = jz_xyz / rho_xyz;                                                                    
               // Compute collision term for MRT                                                                        
               if (!is_lbgk)                                                                                            
                 {                                                                                                      
                   mrt_collision_ns(indices,taup,rho0,is_les,                                                           
                                    rho_xyz,jx_xyz,jy_xyz,jz_xyz,                                                       
                                    f);                                                                                 
                 }                                                                                                      
               // Loop over directions to compute equilibrium and collision kernels                                     
               //cout << "x = " << x << " y = " << y << " z = " << z << endl;                                           
               for (int a = 0; a < ndir; a++)                                                                           
                 {                                                                                                      
                   // Get components for each direction stored in 'v'                                                   
                   int offsetxdir = a * 3 + x_comp;                                                                     
                   const double vxa = v[offsetxdir];                                                                    
                   int offsetydir = a * 3 + y_comp;                                                                     
                   const double vya = v[offsetydir];                                                                    
                   int offsetzdir = a * 3 + z_comp;                                                                     
                   const double vza = v[offsetzdir];                                                                    
                   // Compute 'edotu'                                                                                   
                   double edotu = vxa*ux + vya*uy + vza*uz;                                                             
                   //cout << " edotu = " << edotu;                                                                      
                   // Compute local equillibrium value for NS equation                                                  
                   double feq = 0.0;                                                                                    
                   if (is_lbgk)                                                                                         
                     {                                                                                                  
                       feq = 1.0 + 3.0 * edotu + 4.5 * edotu * edotu - 1.5 * usq;                                       
                       feq *= rho_xyz;                                                                                  
                     }                                                                                                  
                   else                                                                                                 
                     {                                                                                                  
                       edotu *= 1.0 / RT;                                                                               
                       feq = rho_xyz + rho0 * (edotu + 0.5 * edotu * edotu - 0.5 * usq / RT);                           
                     }                                                                                                  
                   feq *= w[a];                                                                                         
                   // Add 'feq' to 'f' array                                                                            
                   int offsetfeq = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );                          
                   //cout << "offsetfeq = " << offsetfeq << " ";                                                        
                   //cout << "dir = " << a << " ";                          
                   // Collision kernek 'fcl'                                                                            
                   double fcl = 0.0;                                                                                    
                   if (is_lbgk)                                                                                         
                     {                                                                                                  
                       fcl = (1.0 - 1.0/taup) * f[offsetfeq] + feq / taup;                                              
                       //cout << " fcl_cpu = " << fcl << " " << offsetfeq << " " << f[offsetfeq] << " " << feq  << " " \
<< jz_xyz << " ";                                                                                                       
                       //cout << " fcl_cpu = " << fcl - f2[offsetfeq] << " ";                                           
                       f[offsetfeq] = fcl;                                                                              
                     }                                                                                                  
                   //cout << " feq_cpu = " << feq << " offsetfeq = " << offsetfeq << " ";                               
                   //cout << " feq = " << feq << " fcl = " << fcl;                                                      
                   // Add force terms to the NS equations                                                               
                   fcl = force_xyz[0] * (vxa - uxcl);                                                                   
                   fcl += force_xyz[1] * (vya - uycl);                                                                  
                   fcl += force_xyz[2] * (vza - uzcl);                                                                  
                   fcl *= delta_t * feq / RT;                                                                           
                   f[offsetfeq] += fcl;                                                                                 
                 } // end for loop over direction                                                                       
               //cout << endl;                                                                                          
               //cout << endl;                                                                                          
             } // end if on fluid node      
           // Energy equations                                                                                          
           if (is_energy)                                                                                               
             {                                                                                                          
               // Compute velocity values to be used in energy equation                                                 
               const double ux = jx_xyz / rho_xyz;                                                                      
               const double uy = jy_xyz / rho_xyz;                                                                      
               const double uz = jz_xyz / rho_xyz;                                                                      
               // Compute magnitude of velocity vector                                                                  
               const double usq = ux * ux + uy * uy + uz * uz;                                                          
               // Loop over directions to compute f_equilibrium                                                         
               for (int a = 0; a < ndir; a++)                                                                           
                 {                                                                                                      
                   // Get components for each direction stored in 'v'                                                   
                   int offsetxdir = a * 3 + x_comp;                                                                     
                   const double vxa = v[offsetxdir];                                                                    
                   int offsetydir = a * 3 + y_comp;                                                                     
                   const double vya = v[offsetydir];                                                                    
                   int offsetzdir = a * 3 + z_comp;                                                                     
                   const double vza = v[offsetzdir];                                                                    
                   // Compute 'edotu'                                                                                   
                   double edotu = vxa*ux + vya*uy + vza*uz;                                                             
                   // Compute local euilibrium value for energy equation                                                
                   double geq = 1.0 + 3.0 * edotu + 4.5 * edotu * edotu - 1.5 * usq;                                    
                   geq *= T_xyz * w[a];                                                                                 
                   // Add 'geq' value to 'g_equilibrium' array                                                          
                   int offsetgeq = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );                          

                   // Collision kernel 'gcl' (same for SRT and MRT)                                                     
                   double gcl = (1.0 - 1.0 / tau_g) * g[offsetgeq];                                                     
                   gcl += geq / tau_g;                                                                                  
                   // Add source term to energy equations                                                               
                   gcl += delta_t * w[a] * energy_xyz;                                                                  
                   g[offsetgeq] = gcl;                                                                                  
                 } // end loop over directions                                                                          
             }      
       });                                                                                                              
     });                                                                                                                
   });                                                                                                                  

  });

artv3 commented 3 years ago

One additional detail to note is that I am using macros to guard the CUDA code, this is needed in case we build on a CPU only platform where the CUDA functions calls do not exist.

Additionally, there is a memory manager under examples that will allocate with unified memory when we have a CUDA build or new when a CPU only build is detected - this enables portability when building on specific platforms.

delcmo commented 3 years ago

@artv3,

thanks for providing the code. I was looking at it and noticed that some of the options alike cuda_launch_t or cuda_global_thread_x are not in the documentation. Are these options specific to Teams? Is there a documentation for Teams? Is there a Doxygen?

Thanks,

artv3 commented 3 years ago

Hi @delcmo , yes those are specific to Teams. There is partial documentation in the develop branch of read the docs:

https://raja.readthedocs.io/en/develop/sphinx/user_guide/feature/loop_basic.html#team-based-loops-raja-launch

I think we currently point our release the docs to our last released version. cuda_global_thread_x is also a relatively new feature and not yet documented, I'll make a note to add it to the documentation.

delcmo commented 3 years ago

@artv3

in the post where you converted the code from C++ to RAJA, you left the loop over the direction unchanged.

for (int a = 0; a < ndir; a++)

Was this by design? How would RAJA handle that loop in the current implementation?

artv3 commented 3 years ago

@artv3

in the post where you converted the code from C++ to RAJA, you left the loop over the direction unchanged.
for (int a = 0; a < ndir; a++) 
Was this by design? How would RAJA handle that loop in the current implementation?

That was a choice I made. Given that the parallelism comes from the top three loops, those inner loops would always be executed sequentially. Converting it into a RAJA loop wouldn't add more capability.

If we chose to convert it to a RAJA loop it could look something like this:

///Execute for loop on either host or device. (Add this policy header) 
  using seq_loop =  RAJA::expt::LoopPolicy<RAJA::loop_exec, RAJA::loop_exec>;     

  RAJA::expt::loop<seq_loop>(ctx, RAJA::RangeSegment(0, ndir), [&] (int a) {  
        . . . 
   });

Additionally, since the loop methods capture by reference we would also just be able modify/access anything in the loop hierarchy, this is a different design approach than RAJA kernel which required passing parameters by reference into the lambda arguments.

delcmo commented 3 years ago

Hi @artv3,

if I understand you correctly, I can keep the C++ for loop syntax as it is. Does that mean each GPU thread will perform a for loop over the directions?

artv3 commented 3 years ago

Hi @artv3,

if I understand you correctly, I can keep the C++ for loop syntax as it is. Does that mean each GPU thread will perform a for loop over the directions?

Yes! standard C++ loops will be executed sequentially by each thread.

delcmo commented 3 years ago

@artv3,

I was able to modify my code and add the RAJA syntax to get it to work on GPUs. I still have some testing to do but I am on the right track. In the mean time I have started to look at a different part of the code that is a little bit more complex:

  for (int z = sz; z <= ez; z++)
  {
    int zcpp = z - (sz - 1);
    for (int y = sy; y <= ey; y++)
    {
      for (int x = sx; x <= ex; x++)
      {
        // Shift indices
        int xcpp = x - (sx - 1);
        int ycpp = y - (sy - 1);

        a = 2 - 1;
        offset1 = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
        offset2 = xcpp + 1  + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
        f[offset1] = f[offset2];
        if (is_energy) g[offset1] = g[offset2];

        a = 4 - 1;
        offset1 = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
        offset2 = xcpp  + rowscpp * (ycpp + 1 + colscpp * (zcpp + depthcpp * a) );
        f[offset1] = f[offset2];
        if (is_energy) g[offset1] = g[offset2];

        a = 6 - 1;
        offset1 = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
        offset2 = xcpp  + rowscpp * (ycpp + colscpp * (zcpp + 1 + depthcpp * a) );
        f[offset1] = f[offset2];
        if (is_energy) g[offset1] = g[offset2];

        a = 10 - 1;
        offset1 = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
        offset2 = xcpp + 1  + rowscpp * (ycpp + 1  + colscpp * (zcpp + depthcpp * a) );
        f[offset1] = f[offset2];
        if (is_energy) g[offset1] = g[offset2];

        a = 14 - 1;
        offset1 = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
        offset2 = xcpp + 1  + rowscpp * (ycpp + colscpp * (zcpp + 1 + depthcpp * a) );
        f[offset1] = f[offset2];
        if (is_energy) g[offset1] = g[offset2];

        a = 18 - 1;
        offset1 = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
        offset2 = xcpp  + rowscpp * (ycpp + 1 + colscpp * (zcpp + 1 + depthcpp * a) );
        f[offset1] = f[offset2];
        if (is_energy) g[offset1] = g[offset2];

      }
    }
  }

The integer a denotes a direction. For each direction a, the solution is streamed from index offset2 to offset1. When running the nested for loops on x, y and z coordinates, it is important to conserve the order the entries of the arrays f and g are being updated: entry offset2 cannot be modified before entry offset1. Is there a way in RAJA to maintain that order when running the nested for loops on different threads.

The other question I have is related to the different directions a: the work being done on each direction is independent of each other (a = 18 -1 could be performed at the same time as a = 14 -1 on two different threads). Is there a way in RAJA to optimize the above syntax to take advantage of the thread architecture?

Thanks, Marco

artv3 commented 3 years ago

Hi @delcmo, if I understand correctly -- and we can iterate on this.

Regarding the first part I think you can use the same global policies for the outer for loops (z,y,x), this will create unique threads and the loop body with be executed sequentially by each thread.

For the second question, if you express the loop body as for loop over 6 values where you compute the value of a, offeset1, offset2 then we can use the RAJA loop methods to assign the work to threads in the x-direction (which will get executed in parallel) the x loop could be assigned to threads in the y direction, and the remaining 2 loops {y,z} could be assigned to blocks in the x, y direction. -- Basically hierarchical parallelism. The key though is being able to express the work as for loops.

delcmo commented 3 years ago

I was able to re-write that piece of code using a fourth nested loop as follows:

  int a, offset1, offset2;
  for (int z = sz; z <= ez; z++)
  {
    // Define local arrays to simplify implementation
    int nb_a = 6;
    int aa [] = {2-1, 4-1, 6-1, 10-1, 14-1, 18-1};
    int xcppa [] = {1, 0, 0, 1, 1, 0};
    int ycppa [] = {0, 1, 0, 1, 0, 1};
    int zcppa [] = {0, 0, 1, 0, 1, 1};
    // Shift indices
    int zcpp = z - (sz - 1);
    for (int y = sy; y <= ey; y++)
    {
      // Shift indices
      int ycpp = y - (sy - 1);
      for (int x = sx; x <= ex; x++)
      {

        // Shift indices
        int xcpp = x - (sx - 1);
        // Loop over directions stored in 'aa'
        for (int i = 0; i < nb_a; i++)
        {
          const int a = aa[i];
          offset1 = xcpp + rowscpp * (ycpp + colscpp * (zcpp + depthcpp * a) );
          offset2 = xcpp + xcppa[i]  + rowscpp * (ycpp + ycppa[i] + colscpp * (zcpp + zcppa[i] + depthcpp * a) );
          f[offset1] = f[offset2];
          if (is_energy) g[offset1] = g[offset2];
        }
      }
    }
}

The above implementation should be easier to convert using RAJA syntax. Would that be appropriate?

I have a question on one of the previous iteration for the piece of code

  const int NThreads = 8;   //Each team will have 8 threads per dim        
 //Calculate number of teams needed for iteration space  in each dimension                                                                                                             
  const int NTeams_x = RAJA_DIVIDE_CEILING_INT(x_range.size(),NThreads);   
  const int NTeams_y = RAJA_DIVIDE_CEILING_INT(y_range.size(),NThreads);                                                
  const int NTeams_z = RAJA_DIVIDE_CEILING_INT(z_range.size(),NThreads);   

  RAJA::expt::launch<launch_policy>(RAJA::expt::HOST, //Choose RAJA::expt::DEVICE or HOST                               
    RAJA::expt::Resources(RAJA::expt::Teams(NTeams_x,NTeams_y,NTeams_z),                                                
                          RAJA::expt::Threads(NThreads,NThreads,NThreads)),                                             
       [=] RAJA_HOST_DEVICE(RAJA::expt::LaunchContext ctx) {

Under the current implementation, the code would use 8 threads. How does that relate to the number of GPUs on each node? I run on Summit and there are 6 GPUs per node. I tried to run on the GPUs after changing RAJA::expt::launch<launch_policy>(RAJA::expt::HOST to RAJA::expt::launch<launch_policy>(RAJA::expt::DEVICE and I get an error message:

terminate called after throwing an instance of 'std::runtime_error'
  what():  CUDAassert

Program received signal SIGABRT: Process abort signal

artv3 commented 3 years ago

Hi @delcmo , yes the proposed code would work.

In regards to the error..... I am actually not sure. Could you try running it through cuda-gdb and share a stack trace? I'm wondering if data is not on the GPU or the number of teams are too large. Could you share the sizes of NTeams_ ?

Additionally RAJA only targets 1 GPU, it is through combining MPI + RAJA that we can target multiple GPUs.

Also with the configuration of threads corresponds to 8^3 = 512 threads per team (or CUDA thread block which we call teams). Then the number of blocks are calculated by the values of NTeams_. Here we are computing on a 3D compute grid.

delcmo commented 3 years ago

The code works with NTeams = 8 but gives me the error message when I set NTeams = 10 or higher.

Backtrace for this error:
CUDAassert: invalid configuration argument
/gpfs/alpine/proj-shared/cfd136/raja/build-210412c/include/RAJA/policy/cuda/MemUtils_CUDA.hpp
210
terminate called after throwing an instance of 'std::runtime_error'
  what():  CUDAassert

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
ERROR:  One or more process (first noticed rank 6) terminated with signal 6
(core dumped)

On Mon, Jun 21, 2021 at 2:25 PM Arturo Vargas @.***> wrote:

Hi @delcmo https://github.com/delcmo , yes the proposed code would work.

In regards to the error..... I am actually not sure. Could you try running it through cuda-gdb and share a stack trace? I'm wondering if data is not on the GPU or the number of teams are too large. Could you share the sizes of NTeams_ ?

Additionally RAJA only targets 1 GPU, it is through combining MPI + RAJA that we can target multiple GPUs.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LLNL/RAJA/issues/1057#issuecomment-865250758, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCUZRWHCQPMMOV4IFB3TT572PANCNFSM45TIE6KA .

-- Marc-Olivier Delchini

artv3 commented 3 years ago

To double check, are you choosing values for NThreads or NTeams_{x,y.z}? The max number of threads in CUDA per block (RAJA Team) is 1024, NThreads = 10 should work..... I'll see if I can reproduce locally.

delcmo commented 3 years ago

I set the number of threads as it is in the original code you provided in the previous email. The number of teams is calculated from

const int NTeams_x = RAJA_DIVIDE_CEILING_INT(x_range_teams.size(),NThreads);
const int NTeams_y = RAJA_DIVIDE_CEILING_INT(y_range_teams.size(),NThreads);
const int NTeams_z = RAJA_DIVIDE_CEILING_INT(z_range_teams.size(),NThreads);

where x_range_teams is 20 (number of mesh elements in the y-direction. x_range_teams and z_range_teams are also set to 20.

On Wed, Jun 23, 2021 at 5:19 PM Arturo Vargas @.***> wrote:

To double check, are you choosing values for NThreads or NTeams_{x,y.z}? The max number of threads in CUDA per block (RAJA Team) is 1024, NThreads = 10 should work..... I'll see if I can reproduce locally.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LLNL/RAJA/issues/1057#issuecomment-867166472, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCV665KUIYKFMQ56I23TUJFVVANCNFSM45TIE6KA .

-- Marc-Olivier Delchini

delcmo commented 3 years ago

I run some more tests and the piece of code runs fine with any value of Nthread <= 8. When the number of teams fall below 4 (Nthread >=9 ), I get a Cuda error message. The number of teams is computed from

  const int NTeams_x = RAJA_DIVIDE_CEILING_INT(x_range_teams.size(),NThreads);
  const int NTeams_y = RAJA_DIVIDE_CEILING_INT(y_range_teams.size(),NThreads);
  const int NTeams_z = RAJA_DIVIDE_CEILING_INT(z_range_teams.size(),NThreads);

artv3 commented 3 years ago

Does it work if you try launching an empty kernel with the values used directly?

    RAJA::expt::launch<launch_policy>                                                                                                      
      (RAJA::expt::DEVICE,                                                                                                                 
       RAJA::expt::Resources(RAJA::expt::Teams(10, 10, 10), RAJA::expt::Threads(8, 8, 8)),                                                 
       [=] RAJA_HOST_DEVICE(RAJA::expt::LaunchContext ctx) {                                                                               

        printf("running cuda kernel \n");                                                                                                  
       });  // outer lambda

The error hints that the compute grid is not being correctly, but I'm not sure what is going on.

delcmo commented 3 years ago

I added to the code the above lines and run it to get the following:

running cuda kernel
...
running cuda kernel
ERROR:  One or more process (first noticed rank 3) terminated with signal 6
(core dumped)

It outputs running cuda kernel for a while and then stops with the above error message.

On Thu, Jun 24, 2021 at 1:13 PM Arturo Vargas @.***> wrote:

Does it work if you try launching an empty kernel with the values used directly?
RAJA::expt::launch<launch_policy>
  (RAJA::expt::DEVICE,
   RAJA::expt::Resources(RAJA::expt::Teams(10, 10, 10), RAJA::expt::Threads(8, 8, 8)),
   [=] RAJA_HOST_DEVICE(RAJA::expt::LaunchContext ctx) {

    printf("running cuda kernel \n");
   });  // outer lambda
The error hints that the compute grid is not being correctly, but I'm not sure what is going on.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LLNL/RAJA/issues/1057#issuecomment-867811528, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCXEQCFBTCSA34SRFODTUNRRZANCNFSM45TIE6KA .

-- Marc-Olivier Delchini

artv3 commented 3 years ago

Can you try running running the raja-teams example under examples?

delcmo commented 3 years ago

I will have to try the example but I run into another problem lately that I can reproduce with the piece of code below:

  RAJA::expt::launch<launch_policy>(RAJA::expt::DEVICE,
    RAJA::expt::Resources(RAJA::expt::Teams(NTeams_x,NTeams_y,NTeams_z),
                          RAJA::expt::Threads(NThreads_x,NThreads_y,NThreads_z)),
       [=] RAJA_HOST_DEVICE(RAJA::expt::LaunchContext ctx) {

   RAJA::expt::loop<global_thread_z>(ctx, z_range_teams, [&] (int z) {
     RAJA::expt::loop<global_thread_y>(ctx, y_range_teams, [&] (int y) {
       RAJA::expt::loop<global_thread_x>(ctx, x_range_teams, [&] (int x) {
         // Shift indices
         int xcpp_ = x - (sev_(0) - 1); // sx = sev_(0)
         int ycpp_ = y - (sev_(1) - 1); // sy = sev_(1)
         int zcpp_ = z - (sev_(2) - 1); // sz = sev_(2)
         int offset_ = xcpp_ + dims_(0) * (ycpp_ + dims_(1) * zcpp_ );

         // Compute macro variables
         int rowscpp_ = dims_(0);
         int colscpp_ = dims_(1);
         int depthcpp_ = dims_(2);
         int ndir_ = sev_(3);

         // Fluid nodes
         if (obs_(offset_) == 0)
         {
           //printf("cuda test outside loop %d ", ndir_);
           // Loop over directions to compute equilibrium and collision kernels
           //printf("Before for loop");
           for (int a = 1; a < ndir_; a++)
           {
             printf("cuda test inside loop: %d \n", a);
           }
         }

       });
     });
   });

   });

The outlet I get is the following:

cuda test inside loop: 1

which I interpret as the for loop is not correctly evaluated. I am not sure to understand why all integer values are not being printed on the screen. Since I have four nest loops, and the fourth loop has to be sequentially performed.

When I look at the RAJA team examples, the C++ for loop syntax is not used but replaced with [loop_icount]( ). Should I be using loop_icount instead of the for() syntax for the fourth nested loop?

artv3 commented 3 years ago

Hi @delcmo , no the loop_icount method is only needed when performing tiling. Where is ndir_ defined? If it is a member or global value, it could be an issue when it comes to lambda capturing. Something else to try is adding cudaDeviceSynchronize(); after your kernel.

delcmo commented 3 years ago

Ok thanks for the clarification. I added cudaDeviceSynchronize(); and it seems to output the correct information. ndir_ is defined but I removed that line in the post.

I am still struggling with the error message related to

CUDAassert: too many resources requested for launch
/gpfs/alpine/proj-shared/cfd136/raja/build-210412c/include/RAJA/policy/cuda/MemUtils_CUDA.hpp
210
terminate called after throwing an instance of 'std::runtime_error'
  what():  CUDAassert

For some reason, it is triggered this time by adding a print statement to the kernel printf(). Could it be related to a saturated memory?

Also, when using the RAJA::View class, are there any specific rules to be aware of? iIf defining RAJA::View<double, RAJA::Layout<1>> f_(f, sizef); can I read and write the values of f_ from a kernel without limitation? Should I restraint the use of RAJA::View to a minimum in my C++code? I am seeing weird behavior that I cannot explain when using f_ like object. If I try to read f_ and update it from a kernel I also get the above error message.

On Mon, Jun 28, 2021 at 2:44 PM Arturo Vargas @.***> wrote:

Hi @delcmo https://github.com/delcmo , no the loopicount method is only needed when performing tiling. Where is ndir defined? If it is a member or global value, it could be an issue when it comes to lambda capturing. Something else to try is adding cudaDeviceSynchronize(); after your kernel.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LLNL/RAJA/issues/1057#issuecomment-869929509, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCSQQ6N53LVBSHPPLB3TVC7KTANCNFSM45TIE6KA .

-- Marc-Olivier Delchini

artv3 commented 3 years ago

Hi @delcmo , the only rule I can think of is that the data in the view has to be in the same memory space that you are trying to access it. If you use unified memory, data transfers should be taken care of for you.

Regarding your first error, I just thought of something. Can you try reducing the number of threads per team maybe try 6? I came across this article: https://stackoverflow.com/questions/26201172/cuda-too-many-resources-requested-for-launch

and it could be that your kernel uses too much register space. One way to also check this, would be to comment out the code inside the ::launch method, if it runs its probably maxing out register space.

delcmo commented 3 years ago

I reduced the number of threads and you are correct it runs. I was previously using the maximum number of threads (Nthreads_x = Nthreads_y = 8 and Nthreads_z = 16). My lack of experience with CUCA programming cost me a few days of work ... ;)

That being said, I now wonder what would be the correct way to set the number of threads?

On Mon, Jun 28, 2021 at 5:24 PM Arturo Vargas @.***> wrote:

Hi @delcmo https://github.com/delcmo , the only rule I can think of is that the data in the view has to be in the same memory space that you are trying to access it. If you use unified memory, data transfers should be taken care of for you.

Regarding your first error, I just thought of something. Can you try reducing the number of threads per team? I came across this article:

https://stackoverflow.com/questions/26201172/cuda-too-many-resources-requested-for-launch

and it could be that your kernel uses too much register space. One way to also check this, would be to comment out the code inside the ::launch method, if it runs its probably maxing out register space.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LLNL/RAJA/issues/1057#issuecomment-870054948, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCW45VOO2NH7YTZPJS3TVDSCFANCNFSM45TIE6KA .

-- Marc-Olivier Delchini

artv3 commented 3 years ago

Choosing the number of threads is a bit of an art, it's hard to say. It depends on your kernel, how much local memory it needs. It is definitely something to play around with. Happy it is now working though.

One suggestion I have though is to change your CUDA launch policy from RAJA::expt::cuda_launch_t<true> to RAJA::expt::cuda_launch_t<false>. This will preform synchronizations after CUDA kernels.

Additionally, the latest RAJA now has CUDA kernels running on the non-default CUDA stream; and if you allocate memory using cudaMallocManaged I think it might be the case that you can access the data on the host before the kernel is done with it. ping: @trws is this right? could we have race conditions if RAJA is on one stream but we don't specify which stream to use when using unified memory?

artv3 commented 3 years ago

One additional option in RAJA is to configure CAMP to use the CUDA default stream. That is done by defining the following variable at compile time: CAMP_USE_PLATFORM_DEFAULT_STREAM.

delcmo commented 3 years ago

I stopped using cudaMallocManaged and exclusively rely on the capabilities of the RAJA::View class. I will make the changes to the code based on your recommendations.

What does default stream mean?

How many teams can I use in my code? Am I limited to three, i.e. three nested loop, or can I declare a fourth team to have four nested loops?

artv3 commented 3 years ago

I stopped using cudaMallocManaged and exclusively rely on the capabilities of the RAJA::View class. I will make the changes to the code based on your recommendations.

What does default stream mean?

How many teams can I use in my code? Am I limited to three, i.e. three nested loop, or can I declare a fourth team to have four nested loops?

A CUDA stream; is the queue in which instructions are stored. CUDA has multiple queues so kernels can be executed async. With unified memory (cudaMallocManaged) we have to make sure we sync before we touch the data on the CPU.

Teams/Threads support up to 3 dimensions (like the CUDA programming model).

Regarding the RAJA::View it only holds a pointer so we are still responsible for moving the data to and from the device. Which machine are you running on? If you try to compute on the GPU using data from the CPU memory space it would normally segfault.

delcmo commented 3 years ago

I run on SUMMIT at ORNL. I will double check the CUDA examples and the my code because based on your previous email I clearly have something weird in my code.

artv3 commented 3 years ago

If its like Sierra, its probably moving the data for you. Try running your code with the following flag: --atsdisable with lalloc

delcmo commented 3 years ago

I will give it a try but that would make sense because I did checked that the code is running on the GPUs and I no longer get a CUDA error message. I assume that I will have to modify the code to make it compatible with any CUDA machines and not rely on the specificity of Summit.

delcmo commented 3 years ago

You are correct, the data are being moved.

I am doing some testing on both CPU and GPU to assess the speed up we now get from using GPU over CPU. There is one thing I am not clear on is, what the number of threads mean for CPUs when entering the RAJA loop tool? Is that ignored by the CPU or do I have to be careful what value is used?

On Tue, Jul 6, 2021 at 11:33 PM Arturo Vargas @.***> wrote:

If its like Sierra, its probably moving the data for you. Try running your code with the following flag: --atsdisable with lalloc

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LLNL/RAJA/issues/1057#issuecomment-875246404, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCTXDQGQP6UEL4ADOCDTWPDIJANCNFSM45TIE6KA .

-- Marc-Olivier Delchini

artv3 commented 3 years ago

On the CPU, the number of teams/threads get ignored, RAJA::loops get converted to standard for loops.

delcmo commented 3 years ago

Ok thanks for the quick reply. I will make sure this is correct in my code by running on CPUs with different number of threads.

Thanks,

On Wed, Aug 4, 2021 at 1:29 PM Arturo Vargas @.***> wrote:

On the CPU, the number of teams/threads get ignored, RAJA::loops get converted to standard for loops.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/LLNL/RAJA/issues/1057#issuecomment-892839403, or unsubscribe https://github.com/notifications/unsubscribe-auth/ABD4GCSY4JMA5AX5Q3QB6HDT3F2F5ANCNFSM45TIE6KA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email .

-- Marc-Olivier Delchini