ROCm / aomp

AOMP is an open source Clang/LLVM based compiler with added support for the OpenMP® API on Radeon™ GPUs. Use this repository for releases, issues, documentation, packaging, and examples.
https://github.com/ROCm/aomp
Apache License 2.0
206 stars 46 forks source link

unable to open hip GPU device (gfx1030) with latest AOMP #187

Open powderluv opened 3 years ago

powderluv commented 3 years ago

I have built latest AOMP (SHA: e2f40a73975ccc0f453e8767a8016e6deb849782) with the amd-stg-open branch. However it is unable to enumerate the HIP GPU device though rocminfo shows both cpu and gpu. I have a 6900XT (gfx1030) and am trying to get Tensile to work on it.

(I have this https://github.com/ROCm-Developer-Tools/HIP/pull/2219 locally to fix the clang_rt builtin issue on hosts).

See below:

I am running this code: https://gitlab.com/cscs-ci/ci-testing/ault-amdgpu/-/blob/master/helloworld.cpp Got an error hipErrorNoDevice

I verified I am in the video group and sudo doesn't help.

5950x:~/github/aomp$ /opt/rocm/bin/rocminfo 
ROCk module is loaded
Able to open /dev/kfd read-write
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (0xFFFFFFFFFFFFFFFF) (timestamp count)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    AMD Ryzen 9 5950X 16-Core Processor
  Uuid:                    CPU-XX                             
  Marketing Name:          AMD Ryzen 9 5950X 16-Core Processor
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0(0x0)                             
  Queue Min Size:          0(0x0)                             
  Queue Max Size:          0(0x0)                             
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768(0x8000) KB                   
  Chip ID:                 0(0x0)                             
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   3400                               

  Shader Engines:          0                                  
  Shader Arrs. per Eng.:   0                                  
  WatchPts on Addr. Ranges:1                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    131896948(0x7dc9674) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    131896948(0x7dc9674) KB            
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx1030                            
  Uuid:                    GPU-XX                             
  Marketing Name:          Device 73bf                        
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128(0x80)                          
  Queue Min Size:          4096(0x1000)                       
  Queue Max Size:          131072(0x20000)                    
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16(0x10) KB                        
  Chip ID:                 29631(0x73bf)                      
  Cacheline Size:          64(0x40)                           
  Max Clock Freq. (MHz):   2660                               
  BDFID:                   12544                              
  Internal Node ID:        1                                  
  Compute Unit:            80                                 
  SIMDs per CU:            4                                  
  Shader Engines:          8                                  
  Shader Arrs. per Eng.:   2                                  
  WatchPts on Addr. Ranges:4                                  
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          32(0x20)                           
  Workgroup Max Size:      1024(0x400)                        
  Workgroup Max Size per Dimension:
    x                        1024(0x400)                        
    y                        1024(0x400)                        
    z                        1024(0x400)                        
  Max Waves Per CU:        64(0x40)                           
  Max Work-item Per CU:    2048(0x800)                        
  Grid Max Size:           4294967295(0xffffffff)             
  Grid Max Size per Dimension:
    x                        4294967295(0xffffffff)             
    y                        4294967295(0xffffffff)             
    z                        4294967295(0xffffffff)             
  Max fbarriers/Workgrp:   32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    16760832(0xffc000) KB              
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Accessible by all:       FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64(0x40) KB                        
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Accessible by all:       FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    amdgcn-amd-amdhsa--gfx1030         
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Size:      1024(0x400)                        
      Workgroup Max Size per Dimension:
        x                        1024(0x400)                        
        y                        1024(0x400)                        
        z                        1024(0x400)                        
      Grid Max Size:           4294967295(0xffffffff)             
      Grid Max Size per Dimension:
        x                        4294967295(0xffffffff)             
        y                        4294967295(0xffffffff)             
        z                        4294967295(0xffffffff)             
      FBarrier Max Size:       32                                 
*** Done ***             

5950x:~/github/aomp$ $HIP_PATH/bin/hipconfig --full
HIP version  : 4.0.20496-4f163c68

== hipconfig
HIP_PATH     : /home/foo/rocm/aomp
ROCM_PATH    : /home/foo/rocm/aomp_13.0-2
HIP_COMPILER : clang
HIP_PLATFORM : hcc
HIP_RUNTIME  : ROCclr
CPP_CONFIG   :  -D__HIP_PLATFORM_HCC__=  -I/home/foo/rocm/aomp/include -I/home/foo/rocm/aomp/bin/../lib/clang/13.0.0 -I/home/foo/rocm/aomp_13.0-2/hsa/include -D__HIP_ROCclr__

== hip-clang
HSA_PATH         : /home/foo/rocm/aomp_13.0-2/hsa
HIP_CLANG_PATH   : /home/foo/rocm/aomp/bin
AOMP_STANDALONE_13.0-2 clang version 13.0.0 (https://github.com/ROCm-Developer-Tools/llvm-project 0e52e2879ab1bbfb75630b97aa25a28ec9e73a1e)
Target: x86_64-unknown-linux-gnu
Thread model: posix
InstalledDir: /home/foo/rocm/aomp/bin
AOMP-13.0-2 (http://github.com/ROCm-Developer-Tools/aomp):
 Source ID:13.0-2-0e52e2879ab1bbfb75630b97aa25a28ec9e73a1e
  LLVM version 13.0.0_AOMP_STANDALONE_13.0-2
  Optimized build with assertions.
  Default target: x86_64-unknown-linux-gnu
  Host CPU: znver3

  Registered Targets:
    amdgcn - AMD GCN GPUs
    r600   - AMD GPUs HD2XXX-HD6XXX
    x86    - 32-bit X86: Pentium-Pro and above
    x86-64 - 64-bit X86: EM64T and AMD64
hip-clang-cxxflags : -D__HIP_ROCclr__ -std=c++11 -isystem /home/foo/rocm/aomp_13.0-2/lib/clang/13.0.0/include/.. -isystem /home/foo/rocm/aomp_13.0-2/hsa/include -D__HIP_ROCclr__ -isystem /home/foo/rocm/aomp/include -O3
hip-clang-ldflags  :  -L/home/foo/rocm/aomp/lib -O3 -lgcc_s -lgcc -lpthread -lm

=== Environment Variables
PATH=/home/foo/anaconda3/bin:/home/foo/bin:/home/foo/lokal/bin/:/usr/local/cuda/bin:/usr/local/TensorRT/bin:/home/foo/bin:/home/foo/anaconda3/bin:/home/foo/bin:/home/foo/lokal/bin/:/usr/local/cuda/bin:/usr/local/TensorRT/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/cuda/bin:/usr/local/TensorRT/bin:/home/foo/rocm/aomp/bin:/usr/local/cuda/bin:/usr/local/TensorRT/bin:/home/foo/rocm/aomp/bin
LD_LIBRARY_PATH=/home/foo/github/mmperf/b_cuda/tvm-install/lib/:/home/foo/rocm/aomp/lib:/usr/local/cuda/lib64/:/usr/local/cuda/extras/CUPTI/lib64
HIP_PATH=/home/foo/rocm/aomp
CUDA_ROOT=/usr/local/cuda
CUDA=/usr/local/cuda

== Linux Kernel
Hostname     : 5950x
Linux 5950x 5.11.0-051100rc7-generic #202102072330 SMP Sun Feb 7 23:33:19 UTC 2021 x86_64 x86_64 x86_64 GNU/Linux
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 20.04.2 LTS
Release:        20.04
Codename:       focal
powderluv commented 3 years ago

I am in both video and render groups:

id uid=1000(foo) gid=1000(foo) groups=1000(foo),4(adm),24(cdrom),27(sudo),30(dip),44(video),46(plugdev),109(render),120(lpadmin),131(lxd),132(sambashare)

JonChesterfield commented 3 years ago

I suggested export ROCM_LLC_ARGS="--amdhsa-code-object-version=3" in a side channel. That tells amd-stg-open clang to use the same version that rocr expects. Unfortunately that doesn't appear to resolve this.

powderluv commented 3 years ago

yup. I tried both just setting the env var before building my sample with hipcc and that didn't help. Rebuilding all of aomp with that env var set doesn't help either.

Happy to gather any other debug information that is relevant. To make sure there is nothing in /opt/rocm I only have rocm-smi there

ls -ltr /opt/rocm-4.0.0/lib/* lrwxrwxrwx 1 root root 34 Dec 14 02:49 /opt/rocm-4.0.0/lib/librocm_smi64.so.2 -> ../rocm_smi/lib/librocm_smi64.so.2 lrwxrwxrwx 1 root root 32 Dec 14 02:49 /opt/rocm-4.0.0/lib/librocm_smi64.so -> ../rocm_smi/lib/librocm_smi64.so

If required I can rebuild that too but I doubt that could be the issue.

Thanks for your quick responses.

JonChesterfield commented 3 years ago

Ah. I didn't notice you were using hipcc. When I try to run hip code locally, I get a variant on 'no devices found', which seems to correlate with an invalid branch in the hip runtime. Running the host application under valgrind blames libamdhip64.so at least. Hopefully Greg has more information on that, I haven't tried to debug the hip runtime.

powderluv commented 3 years ago

Just getting familiar with runtimes. What other runtime can I use? I am trying to get Tensile going with gfx1030 which seems to require hipcc.

Yeah gdb points to libamdhip64.so.

JonChesterfield commented 3 years ago

The bottom of the stack on linux is kfd (in the linux kernel), then roct which is roughly the userspace driver part of kfd. On top of that is an implementation of the HSA spec, rocr. Those have all been robust under my testing. The OpenMP implementation on amdgpu builds directly on top of rocr for that reason.

Depending on your use case, c++ compiled for amdgcn as freestanding and launched using the functions in hsa.h works well. Opencl has its own runtime, but it looks like it's now built on the same foundation as hip so may have the same bug reported here. Windows does some different things, and so does the graphics stack.

libamdhip64.so contains, as far as I can tell, roct, rocr, rocclr, hip. Something in that appears to be broken. There's a lot of code though so it's not an easy fix. HIP mostly track errors through an internal Jira system.

Is Tensile the rocm library with that name? If so, an issue suggests it worked on a gfx1010 in November. You might therefore be able to get a working HIP installation by rolling back to a release made around then. I've added Siu Chi to this issue as he is much closer to the HIP development than me.

powderluv commented 3 years ago

Cool. Thanks for the clarity - just so many rocXX libs it was hard to understand the layering. I think c++ compiled for amdgcn and launched with hsa.h is best for us. I will look around for rocr samples as a starting point.

I was trying to get Tensile up and running on gfx1030 because those are the "baseline" GEMM routines for rocblas and want to compare to that performance too. I filed a few issues about it https://github.com/ROCmSoftwarePlatform/Tensile/issues/1282 https://github.com/RadeonOpenCompute/ROCm/issues/1376 https://github.com/ROCmSoftwarePlatform/rocBLAS/issues/1185

powderluv commented 3 years ago

Unfortunately looks like the last release of rocr was 3.1.x and there is no 4.x or later branches https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/111

Are you able to test with the opensource rocr from https://github.com/RadeonOpenCompute/ROCR-Runtime ? Any chance we can get an updated rocr or is 3.1.x supposed to work for gfx10 ?

powderluv commented 3 years ago

ok so rocr seems to be working. I have verified that with rocm_bandwidth_test (https://github.com/RadeonOpenCompute/rocm_bandwidth_test) since rocr-runtime doesn't have any tests.

so something is broken along rocclr / hip for gfx10.

./rocm-bandwidth-test ........ RocmBandwidthTest Version: 2.5.1

      Launch Command is: ./rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)

      Device: 0,  AMD Ryzen 9 5950X 16-Core Processor
      Device: 1,  Device 73bf,  GPU-XX,  31:0.0

      Inter-Device Access

      D/D       0         1         

      0         1         0         

      1         1         1         

      Inter-Device Numa Distance

      D/D       0         1         

      0         0         N/A       

      1         20        0         

      Unidirectional copy peak bandwidth GB/s

      D/D       0           1           

      0         N/A         7.030       

      1         7.262       1259.078    

      Bidirectional copy peak bandwidth GB/s

      D/D       0           1           

      0         N/A         13.292      

      1         13.292      N/A         

Thanks for the pointers.

powderluv commented 3 years ago

@JonChesterfield do you have any examples / tests that do the "c++ compiled for amdgcn as freestanding and launched using the functions in hsa.h " ? I am trying to follow https://github.com/RadeonOpenCompute/rocminfo as an example but I dont see gcn binaries in the final elf file that goes into the rocr / hsa runtime.

update: found https://github.com/ROCm-Developer-Tools/LLVM-AMDGPU-Assembler-Extra to play around with.

Update 2: I have been able to run simple code after updating to code object version 3 . Pushed a fork https://github.com/Powderluv/LLVM-AMDGPU-Assembler-Extra

JonChesterfield commented 3 years ago

Hey. I missed the above comments but saw this while looking at the tangentially related #193. I'm not clear what the status of the gfx10 cards is - the 4.1 release notes don't seem to mention it. Unofficially some code does seem to run on them, and I believe rocr and the compiler backend are functional. OpenMP does not work on gfx10 yet, working on that at present.

The code object format is currently transitioning from 3 to 4. I think the status is rocm 3.10 needs v3, rocm 4.1 can use v4, llvm trunk is reviewing patches to bring v4 online.

Using raw C++ means trading the many conveniences of the high level languages for an increase in control. Documentation is sparse, your mileage may vary. Nevertheless, an example of going down that rabbit hole is https://github.com/jonChesterfield/hostrpc, which is a bare metal prototype that I'm hoping to implement libc on top of (thus getting away from freestanding for applications). You may find it interesting but it's not production code yet.

Compiling as freestanding invocation is along the lines of: GFX=gfx906 clang -O2 -ffreestanding --target=amdgcn-amd-amdhsa -march=$GFX -mcpu=$GFX -nogpulib -emit-llvm That will emit IR for a gfx906. Functions, data and so forth.

To get something that can be launched, one currently needs to use opencl/hip/openmp/IR/asm as the kernel calling convention is not exposed to c++. That's somewhat annoying but the 'kernel' function only needs to contain a call to something written in C. E.g.:

int __device_start_cast(int argc, __global void* argv);

static unsigned get_lane_id(void)
{
  return __builtin_amdgcn_mbcnt_hi(~0u, __builtin_amdgcn_mbcnt_lo(~0u, 0u));
}

kernel void __device_start(int argc, __global void* argv, __global int* res)
{
  res[get_lane_id()] = __device_start_cast(argc, argv);
}

given some IR that contains one or more kernel functions, llc can emit a code object which the hsa loader can run on the gpu. The interface to that is RadeonOpenCompute/ROCR-Runtime/src/inc/hsa.h. It's verbose, but works broadly as the comments suggest.

powderluv commented 3 years ago

Thank you for this. hostrpc seems very useful. We will give it a spin and post issues here or on the hostrpc repo.

Also libc would be fantastic along with some utils for debugging and logging

gregrodgers commented 3 years ago

OpenMP team, what is status of AOMP on gfx1030? Should we get a test machine in our AOMP lab?

ye-luo commented 3 years ago

FYI, https://github.com/RadeonOpenCompute/ROCm/issues/887#issuecomment-822222885 I hope once ROCm side enables RDNA, AOMP works out of box. Right now, nailing the software on GFX9 is really critical.

sysmanalex commented 2 years ago

RocmBandwidthTest Version: 2.6.0 / rocm-5.1.2 gfx1030 / uname = 5.4.0-122-generic

      Launch Command is: ./rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)

      Device: 0,  11th Gen Intel(R) Core(TM) i5-11400F @ 2.60GHz
      Device: 1,  AMD Radeon RX 6800,  GPU-XX,  03:0.0

      Inter-Device Access
      D/D       0         1
      0         1         0
      1         1         1

      Inter-Device Numa Distance
      D/D       0         1
      0         0         N/A
      1         20        0

      Unidirectional copy peak bandwidth GB/s
      D/D       0           1
      0         N/A         26.662
      1         28.566      848.405

      Bidirectional copy peak bandwidth GB/s
      D/D       0           1
      0         N/A         31.417
      1         31.417      N/A
ppanchad-amd commented 2 months ago

@powderluv Do you still need assistance with this ticket? If not, please close the ticket. Thanks!