CPU performance of RAJA::View passed with __restrict__

francoishamon commented 3 years ago

Hello, I am testing RAJA in a small finite-difference code, and I encountered a problem related to the use of RAJA::View with the __restrict__ keyword on CPU. The file that implements the different versions of the kernels is here. The standard version of my main kernel looks like:

for (llint i = x3; i < x4; ++i) {
   for (llint j = y3; j < y4; ++j) {
      for (llint k = z3; k < z4; ++k) {
         float const lap = LAP;
         VUPDATE
         PHIUPDATE        
      }
   }
}

where the FD macros are defined at the top of this file. The pointers used in the macros LAP, VUPDATE, and PHIUPDATE are passed with the __restrict__ keyword.

For comparison, I also implemented in the same function another version of the kernel using RAJA::View and RAJA::kernel with RAJA::loop_exec that looks like:

RAJA::kernel<POLICY>( RAJA::make_tuple(XRange, YRange, ZRange),
    [=] ( RAJA::Index_type const i, RAJA::Index_type const j, RAJA::Index_type const k)
    {
      float const lap = LAPVIEW;
      VVIEWUPDATE
      PHIVIEWUPDATE     
    });

where the ranges are defined as RAJA::RangeSegment const XRange(x3, x4);, etc, and the views used in the macros are defined as RAJA::View< const float, RAJA::Layout<1, RAJA::Index_type, 0> > uView( u, (nx+2*lx)*(ny+2*ly)*(nz+2*lz) );, etc. The standard for-loop version and this RAJA::kernel version have a very similar performance on CPU.

My issue arose when I tried to create the RAJA::Views in one function, then passed them by reference with the __restrict__ keyword to another function like that:

compute_pml_3d_restrict( RAJA::Index_type(ny), RAJA::Index_type(nz),
                          RAJA::Index_type(x3), RAJA::Index_type(x4),
                          RAJA::Index_type(y3), RAJA::Index_type(y4),
                          RAJA::Index_type(z3), RAJA::Index_type(z4),
                          RAJA::Index_type(lx), RAJA::Index_type(ly), RAJA::Index_type(lz),
                          hdx_2, hdy_2, hdz_2,
                          coefxView, coefyView, coefzView,
                          uView, vpView, etaView, vView, phiView );

The RAJA::kernel is located in this other function compute_pml_3d_restrict, but is exactly the same as before (same range, same policy). Please see this version of the code here. In this case, it seems that the __restrict__ keyword is ignored, and my code is about twice slower than the two other versions. Am I doing anything wrong here?

I did these tests in the GEOSX environment (this branch) on Quartz, compiling in release with both clang-10.0.0 and gcc-8.1.0. Let me know if you need more information

trws commented 3 years ago

Do you see the same performance degradation without using __restrict__ references? It's possible that the alias analysis is working correctly within the same function, but being lost through the function call. Adding __restrict__ to a reference to the view sadly can't really help, that's telling the compiler that there is no other active pointer to the view but makes no guarantees about the data pointed to by the pointer it contains. Part of the problem with restrict in general is that it is not meant to work on class or struct members, only on function parameters and local variables. One way to be completely sure the qualifier is passed through would be to pass the pointers, appropriately marked, then produce the views in compute_pml_3d_restrict. Another that may work in some compilers, but is not guaranteed to work by any language standard, is to apply the restrict qualifier to the pointer type parameter to the View.

There are also some utility wrapper types that attempt to work around this in RAJA/util/types.hpp in a few different ways, but they are all attempts at working around a general limitation of portable C++, and I can't speak to their effectiveness.

francoishamon commented 3 years ago

Thanks for the quick reply and the clear explanation. Yes I confirm that I observe the same performance degradation without using the __restrict__ keyword when I pass the references to the RAJA::View. Also, I just tried to pass restricted pointers to compute_pml_3d_restrict to create the views there, and I can recover the good performance of the original code based on for loops. I will have a look at RAJA/util/types.hpp as well. Thanks for your help.

LLNL / RAJA

CPU performance of RAJA::View passed with restrict #1002

LLNL / RAJA

CPU performance of RAJA::View passed with __restrict__ #1002

CPU performance of RAJA::View passed with restrict #1002