ROCm / HIP

HIP: C++ Heterogeneous-Compute Interface for Portability
https://rocmdocs.amd.com/projects/HIP/
MIT License
3.73k stars 529 forks source link

HSA_STATUS_ERROR_ILLEGAL_INSTRUCTION triggered from SYCL kernel #2371

Closed jchlanda closed 2 years ago

jchlanda commented 3 years ago

While working with DPCPP (https://github.com/intel/llvm) compiling SYCL kernels targeting ROCm (gfx908) I came across multiple benchmarks failing with HSA_STATUS_ERROR_ILLEGAL_INSTRUCTION error:

:0:rocdevice.cpp  :2533: 1732030588683 us: Device::callbackQueue aborting with error : HSA_STATUS_ERROR_ILLEGAL_INSTRUCTION:  The agent attempted to execute an illegal shader instruction. code: 0x2a

for example:

Going by the error message: "The agent attempted to execute an illegal shader instruction." I took the s3d benchmark and used DPCPP's -save-temps mechanism to obtain the device assembly and verify the instructions. I found out that it is ratx2 (https://github.com/zjin-lcf/oneAPI-DirectProgramming/blob/master/s3d-sycl/ratx2.sycl) kernel that triggers the error. Working with save temps combined with replaying of the commands to run for this compilation pipeline, as obtained through -### switch, allowed me to by-hand modify and test the offending kernel's assembly. It would appear that the error is somewhat misleading, as I managed to preserve the erroneous behaviour, while removing all the original instructions and replacing them with s_nop (the number of nops was not a one to one match with instructions in the original kernel though). Moreover, for this particular benchmark I was able to establish such a number of nops in the kernel in question, that trigger the HSA_STATUS_ERROR_ILLEGAL_INSTRUCTION error, while removing a single nop would also remove the error. For what is worth, ratx2 kernel with only 1344 triggers the error. Unfortunately it's not quite as simple as the code size of the kernel, I tried compiling and executing a SYCL kernel with only 1344 nops in it, but the error did not reproduce.

I was wondering if there are any known issues that could be responsible for this behaviour, or if there is anything developers could do to debug it, or work around it.

Please do let me know if you'd like to see any of the generated files.

b-sumner commented 3 years ago

What version of ROCm are you running?

zjin-lcf commented 3 years ago

My version is 4.3.0-52. I don't know their version.

jchlanda commented 3 years ago

What version of ROCm are you running?

Sorry, I should have mentioned that, I see the same error on 4.2.0-21 and 4.3.0-52.

b-sumner commented 3 years ago

Thanks. I'm assuming you're using the AMDGCN back end from the LLVM within the ROCm release? If not, what are you using? Can you create a HIP example and which matches one of your failing examples as closely as possible and compare the IR from your front end to the IR from the ROCm HIP front end? Are you matching all the attributes, metadata, calling conventions, etc?

How are you loading and launching the code object? Using the HIP runtime or OpenCL runtime? The HIP runtime cannot always properly launch an OpenCL kernel and similarly for a HIP kernel launched by the OpenCL runtime.

This sounds to me like some plumbing is not set up correctly and so the GPU somehow is not executing the ISA bits you want it to.

jchlanda commented 3 years ago

Thanks. I'm assuming you're using the AMDGCN back end from the LLVM within the ROCm release?

Yes, that's correct dpcpp is on, or very close to LLVM's tip, so it stays up to date with AMDGCN backend.

This becomes a bit more problematic, since our servers have now defaulted to 4.3.0 and with that non of the benchmarks from https://github.com/zjin-lcf/oneAPI-DirectProgramming/ when run with save-temps reproduce the illegal instruction error. The benchmark that triggered this investigation now gives back HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION.

Following on form a comment here: https://github.com/intel/llvm/issues/4553#issuecomment-933817079 I was able to get the same error with a simple sin SYCL kernel (again, only when save-temps is not passed to the compiler):

#define TY cl_float                                                           
#define IN_VALUE 90.0f                                                        
#define OUT_VALUE 42.42f                                                      
//#define TY cl_double                                                        
//#define IN_VALUE 90.0                                                       
//#define OUT_VALUE 42.42                                                     

int main(int, char **) {                                                      
  TY in = IN_VALUE;                                                           
  TY out = OUT_VALUE;                                                         

  cl::sycl::gpu_selector device_selector;                                     
  cl::sycl::queue queue(device_selector);                                     
  {                                                                           
    cl::sycl::buffer<TY> in_sycl(&in, cl::sycl::range<1>(1));                 
    cl::sycl::buffer<TY> out_sycl(&out, cl::sycl::range<1>(1));               

    queue.submit([&](cl::sycl::handler &cgh) {                                
      auto in_acc = in_sycl.get_access<cl::sycl::access::mode::read>(cgh);    
      auto out_acc = out_sycl.get_access<cl::sycl::access::mode::write>(cgh); 
      cgh.parallel_for<class sin_cos>(cl::sycl::nd_range<1>(1, 1),            
                                      [=](cl::sycl::nd_item<1> item) {        
                                        out_acc[0] = cl::sycl::sin(in_acc[0]);
                                      });                                     
    });                                                                       
  }                                                                                                                                                    
  printf("Got:      %f\nExpected: %f\n", out, std::sin(in));                  

  return 0;                                                                   
}

An equivalent HIP program with HIP_CLANG_PATH pointing to dpcpp's LLVM build directory passes with no issues. I have noticed that hipcc's clang invocation sets some extra backend flags: -mllvm -amdgpu-early-inline-all=true -mllvm -amdgpu-function-calls=false, interestingly enough those flags make the sin SYCL kernel pass. Those flags are not the silver bullet though, there are larger, more complicated kernels that still fail even with the flags appended. And hipcc passes even with those removed. Looking at the assembly generated for sin, the compiler just did what it was asked for and inlined everything, mu guess is that for larger kernels it was not able to do that.

It looks like the codegen is off the hook now and more likely the bug is somewhere in the instrumentation of the kernel/kernel arguments, or perhaps calling convention of SYCL kernels/functions.

I've attached asm files for the simple sin kernel that triggers memory violation: sin_extract.log and the inlined variant that passes: sin_no_function_calls_extract.log

As I wasn't able to use save-temps I've hacked the LiveDebugValues pass to dump all the functions it runs on and extracted the relevant ones.

jchlanda commented 2 years ago

We've found the issue that was causing the illegal instruction error (https://github.com/intel/llvm/pull/4991). Closing.