ROCm / ROCm-OpenCL-Runtime

ROCm OpenOpenCL Runtime
168 stars 55 forks source link

Mcxcl App Slowdown for ROCm 1.8.3 vs amdgpu-pro 17.50 - Potential Compiler issue #43

Open fangq opened 6 years ago

fangq commented 6 years ago

mcxcl is a package my group developed for efficient photon transport simulations. It has good performance on the latest Vega64 GPU using the amdgpu-pro driver, see Fig. 2 of our recently published paper. The kernel works fine on all tested OCL implementations from NVIDIA AMD and Intel.

However, we recently installed ROCm on one of our Linux servers (Ubuntu 16.04) and tried to run this code using the Vega64 GPU, all of our benchmarks failed with infinite loops.

To reproduce this issue, here are the commands

git clone https://github.com/fangq/mcxcl.git 
cd mcxcl 
git checkout
cd src 
make clean all 
cd ../example/benchmark 
./run_benchmark1.sh -G 1 -n 1e6  # benchmark 1 failed
./run_benchmark2.sh -G 1 -n 1e6   # benchmark 2 failed
./run_benchmark2a.sh -G 1 -n 1e6   # benchmark 2a failed

We want to to know what was the cause of this issue and how to make our code compatible with ROCm.

thanks

Srinivasuluch commented 6 years ago

Hi fangq - Tried the application, MCXcl bench on Vega10, 64GB card, DID#6868, don’t find any issues with it with latest AMD ROCm internal driver. We don’t know what kind of failure you noticed, Performance, Correctness, segfault, pagefault, etc... ? Could you please provide more details ? Also need the DID & config details of the configuration tested, rocm version, CPU, GPU, any other valuable info.

Thanks

fangq commented 6 years ago

@Srinivasuluch, I am sorry for the delay, I was not notified your reply for some reason.

here is the summary of the system:

OS: Ubuntu 16.04.3
GPU: Vega 64 (DID#687f)
CPU: i7-4770k
kernel: 4.13.0-32-generic
gcc: 5.4.0 20160609
ROCm: 1.7

My student @3upperm2n also tried (~Jan 28) on a Vega MI25 (DID#6860), same findings.

overall we've observed 2 issues,

  1. the benchmark hanged in most cases, however, this observation varies, sometimes, benchmark1/2/2a all hanged; sometimes benchmark2/2a hanged but 1 worked; sometimes, 1 worked, but 2/2a hanged;
  2. when the benchmark does not hung, the simulation speed is about 10% of the speed when using amdgpu-pro (7.30) driver.

the same software has no issue when using amdgpu-pro driver, or run other platforms (intel/nvidia).

I just reinstalled rocm (reverted back to amdgpu-pro in the past weeks), right now, benchmark1 works, but 2/2a hangs. I am not sure though if my rocm was re-installed properly. previously, I remember a kfd kernel was used,

fangq@pangu:~/space/git/Temp$ uname -a
Linux pangu 4.11.0-kfd-compute-rocm-rel-1.6-180 #1 SMP Tue Oct 10 08:15:38 CDT 2017 x86_64 x86_64 x86_64 GNU/Linux

but after I reinstalled rocm, now the uname output does not show the kfd kernel anymore.

fangq@pangu:~/space/git/Temp$ uname -a
Linux pangu 4.13.0-32-generic #35~16.04.1-Ubuntu SMP Thu Jan 25 10:13:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

nonetheless, the rocm-dkms, rocm-opencl, rocm-opencl-dev were all installed.

fangq@pangu:~/space/git/Project$ apt-cache madison rocm-dkms rocm-opencl rocm-opencl-dev
 rocm-dkms |     1.7.60 | http://repo.radeon.com/rocm/apt/debian xenial/main amd64 Packages
rocm-opencl | 1.2.0-2017121952 | http://repo.radeon.com/rocm/apt/debian xenial/main amd64 Packages
rocm-opencl-dev | 1.2.0-2017121952 | http://repo.radeon.com/rocm/apt/debian xenial/main amd64 Packages

is there a way I can verify if rocm driver was indeed used? the "verify your installation" link in the README file is broken.

rocm-smi and rocminfo outputs are attached below:

fangq@pangu:~/space/git/Project$ /opt/rocm/bin/rocm-smi
====================    ROCm System Management Interface    ====================
================================================================================
 GPU  Temp    AvgPwr   SCLK     MCLK     Fan      Perf    SCLK OD
  1   40.0c   3.0W     852Mhz   167Mhz   13.73%   auto      0%       
Traceback (most recent call last):
  File "/opt/rocm/bin/rocm-smi", line 1058, in <module>
    showAllConcise(deviceList)
  File "/opt/rocm/bin/rocm-smi", line 728, in showAllConcise
    fan = str(getFanSpeed(device))
  File "/opt/rocm/bin/rocm-smi", line 358, in getFanSpeed
    fanLevel = int(getSysfsValue(device, 'fan'))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
fangq@pangu:~/space/git/Project$ /opt/rocm/bin/rocminfo
=====================    
HSA System Attributes    
=====================    
Runtime Version:         1.1
System Timestamp Freq.:  1000.000000MHz
Sig. Max Wait Duration:  18446744073709551615 (number of timestamp)
Machine Model:           LARGE                              
System Endianness:       LITTLE                             

==========               
HSA Agents               
==========               
*******                  
Agent 1                  
*******                  
  Name:                    Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
  Vendor Name:             CPU                                
  Feature:                 None specified                     
  Profile:                 FULL_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        0                                  
  Queue Min Size:          0                                  
  Queue Max Size:          0                                  
  Queue Type:              MULTI                              
  Node:                    0                                  
  Device Type:             CPU                                
  Cache Info:              
    L1:                      32768KB                            
  Chip ID:                 0                                  
  Cacheline Size:          64                                 
  Max Clock Frequency (MHz):3900                               
  BDFID:                   0                                  
  Compute Unit:            8                                  
  Features:                None
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: KERNARG, FINE GRAINED
      Size:                    7795692KB                          
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
    Pool 2                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    7795692KB                          
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        TRUE                               
  ISA Info:                
    N/A                      
*******                  
Agent 2                  
*******                  
  Name:                    gfx900                             
  Vendor Name:             AMD                                
  Feature:                 KERNEL_DISPATCH                    
  Profile:                 BASE_PROFILE                       
  Float Round Mode:        NEAR                               
  Max Queue Number:        128                                
  Queue Min Size:          4096                               
  Queue Max Size:          131072                             
  Queue Type:              MULTI                              
  Node:                    1                                  
  Device Type:             GPU                                
  Cache Info:              
    L1:                      16KB                               
  Chip ID:                 26751                              
  Cacheline Size:          64                                 
  Max Clock Frequency (MHz):1630                               
  BDFID:                   768                                
  Compute Unit:            64                                 
  Features:                KERNEL_DISPATCH 
  Fast F16 Operation:      FALSE                              
  Wavefront Size:          64                                 
  Workgroup Max Size:      1024                               
  Workgroup Max Size Per Dimension:
    Dim[0]:                  67109888                           
    Dim[1]:                  50332672                           
    Dim[2]:                  0                                  
  Grid Max Size:           4294967295                         
  Waves Per CU:            40                                 
  Max Work-item Per CU:    2560                               
  Grid Max Size per Dimension:
    Dim[0]:                  4294967295                         
    Dim[1]:                  4294967295                         
    Dim[2]:                  4294967295                         
  Max number Of fbarriers Per Workgroup:32                                 
  Pool Info:               
    Pool 1                   
      Segment:                 GLOBAL; FLAGS: COARSE GRAINED      
      Size:                    8372224KB                          
      Allocatable:             TRUE                               
      Alloc Granule:           4KB                                
      Alloc Alignment:         4KB                                
      Acessible by all:        FALSE                              
    Pool 2                   
      Segment:                 GROUP                              
      Size:                    64KB                               
      Allocatable:             FALSE                              
      Alloc Granule:           0KB                                
      Alloc Alignment:         0KB                                
      Acessible by all:        FALSE                              
  ISA Info:                
    ISA 1                    
      Name:                    AMD:AMDGPU:9:0:0                   
      Machine Models:          HSA_MACHINE_MODEL_LARGE            
      Profiles:                HSA_PROFILE_BASE                   
      Default Rounding Mode:   NEAR                               
      Default Rounding Mode:   NEAR                               
      Fast f16:                TRUE                               
      Workgroup Max Dimension: 
        Dim[0]:                  67109888                           
        Dim[1]:                  1024                               
        Dim[2]:                  16777217                           
      Workgroup Max Size:      1024                               
      Grid Max Dimension:      
        x                        4294967295                         
        y                        4294967295                         
        z                        4294967295                         
      Grid Max Size:           4294967295                         
      FBarrier Max Size:       32                                 
*** Done ***             
gstoner commented 6 years ago

@fangq Can you test our ROCm 1.7.1 beta release to see if this address your issue http://repo.radeon.com/misc/archive/beta/rocm-1.7.1-beta.tar.gz

fangq commented 6 years ago

@gstoner, thanks for the link, I just downloaded the package, and updated rocm to 1.7.1 beta, however, those benchmarks give me the same issue:

  1. run_benchmark1.sh works, but the simulation speed is 4320.59 photons/ms, which is 1/10 of the desired speed (44306.6 photons/ms, see below);
  2. run_benchmark2/2a/3.sh all hanged.

running the failed benchmark (the same binary) on another machine with R9 nano and RX480 and amdgpu-pro worked fine.


in case the benchmarks do run on your rocm/vega, here is the expected speed we got from vega64 with amdgpu-pro by running ./run_benchmark{1,2,2a}.sh without any additional parameter (1e8 photons)

run_benchmark1:  44306.6 photons/ms 
run_benchmark2:  22461.81 photons/ms 
run_benchmark2a: 16126.43 photons/ms 

the speed value is printed at the end of the simulation.

gstoner commented 6 years ago

Ok, we have the OpenCL look at this, one thing the two drivers use two different compilers. But ROCm also uses guard pages, can you double check the code for out of bound memory references

fangq commented 6 years ago

@gstoner, thanks a lot. out-of-bound memory access is definitely a possibility.

any recommendation for a tool to detect out-of-bound bugs? I tried the open-source oclgrind tool, but it still can't make it run properly with my kernel.

fangq commented 6 years ago

@gstoner, FYI, we ran oclgrind with our benchmark codes, it did not capture any memory errors for both benchmark1 (that does not hang with rocm, but 10x slower), and benchmark2 (hangs with rocm 1.7 and 1.7.1).

Below are the outputs for both tests:

fangq@pangu:~/mcxcl/example/benchmark$ /usr/bin/oclgrind ../../bin/mcxcl -A -f benchmark1.json -k ../../src/mcx_core.cl -b 0 -n 1000
==============================================================================
=                       Monte Carlo eXtreme (MCX) -- OpenCL                  =
=          Copyright (c) 2010-2018 Qianqian Fang <q.fang at neu.edu>         =
=                             http://mcx.space/                              =
=                                                                            =
= Computational Optics&Translational Imaging (COTI) Lab - http://fanglab.org =
=            Department of Bioengineering, Northeastern University           =
==============================================================================
=    The MCX Project is funded by the NIH/NIGMS under grant R01-GM114365     =
==============================================================================
$Rev::6e839e $ Last $Date::2017-07-20 12:46:23 -04$ by $Author::Qianqian Fang$
==============================================================================
- code name: [Vanilla MCXCL] compiled with OpenCL version [1]
- compiled with: [RNG] Logistic-Lattice [Seed Length] 5
initializing streams ...    init complete : 0 ms
Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SIMPLIFY_BRANCH -DMCX_VECTOR_INDEX -DMCX_SRC_PENCIL 
build program complete : 25 ms
- [device 0(1): Oclgrind Simulator] threadph=15 oddphotons=40 np=1000.0 nthread=64 nblock=64 repetition=1
set kernel arguments complete : 25 ms
lauching mcx_main_loop for time window [0.0ns 5.0ns] ...
simulation run# 1 ...   kernel complete:    6325 ms
retrieving flux ...     transfer complete:        6325 ms
normalizing raw data ...    normalization factor alpha=200000.000000
saving data to file ... 216000 1    saving data complete : 6336 ms

simulated 1000 photons (1000) with 1 devices (repeat x1)
MCX simulation speed: 0.16 photon/ms
total simulated energy: 1000.00 absorbed: 16.84878%
(loss due to initial specular reflection is excluded in the total)

fangq@pangu:~/mcxcl/example/benchmark$ /usr/bin/oclgrind ../../bin/mcxcl -A -f benchmark1.json -b 1 -P '{"Shapes":[{"Sphere":   {"Tag":2, "O":[30,30,30],"R":15}}]}' -s benchmark2 -k ../../src/mcx_core.cl  -n 1000
==============================================================================
=                       Monte Carlo eXtreme (MCX) -- OpenCL                  =
=          Copyright (c) 2010-2018 Qianqian Fang <q.fang at neu.edu>         =
=                             http://mcx.space/                              =
=                                                                            =
= Computational Optics&Translational Imaging (COTI) Lab - http://fanglab.org =
=            Department of Bioengineering, Northeastern University           =
==============================================================================
=    The MCX Project is funded by the NIH/NIGMS under grant R01-GM114365     =
==============================================================================
$Rev::6e839e $ Last $Date::2017-07-20 12:46:23 -04$ by $Author::Qianqian Fang$
==============================================================================
- code name: [Vanilla MCXCL] compiled with OpenCL version [1]
- compiled with: [RNG] Logistic-Lattice [Seed Length] 5
initializing streams ...    init complete : 0 ms
Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SIMPLIFY_BRANCH -DMCX_VECTOR_INDEX -DMCX_SRC_PENCIL  -D MCX_DO_REFLECTION
build program complete : 26 ms
- [device 0(1): Oclgrind Simulator] threadph=15 oddphotons=40 np=1000.0 nthread=64 nblock=64 repetition=1
set kernel arguments complete : 26 ms
lauching mcx_main_loop for time window [0.0ns 5.0ns] ...
simulation run# 1 ...   kernel complete:    13521 ms
retrieving flux ...     transfer complete:        13521 ms
normalizing raw data ...    normalization factor alpha=200000.000000
saving data to file ... 216000 1    saving data complete : 13533 ms

simulated 1000 photons (1000) with 1 devices (repeat x1)
MCX simulation speed: 0.07 photon/ms
total simulated energy: 1000.00 absorbed: 27.34936%
(loss due to initial specular reflection is excluded in the total)
gstoner commented 6 years ago

@fangq Can you try 1.7.1 Beta 3 http://repo.radeon.com/misc/archive/beta/rocm-1.7.1.beta.3.tar.bz2

fangq commented 6 years ago

thanks @gstoner, I have some encouraging updates, after the update, all benchmarks now run without hanging with 1.7.1beta3!

however, there are still two (or three) remaining problems:

  1. the simulation speed for all 3 benchmarks are 10x slower than the speed using amdgpu-pro: the speed values I have on rocm 1.7.1b3 are

    run_benchmark1.sh: 4293.69 photon/ms
    run_benchmark2.sh: 2241.15 photon/ms
    run_benchmark2a.sh: 2200.22 photon/ms

    the speed for amdgpu-pro can be found in https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/issues/43#issuecomment-366480459. Both the 1st and 2nd benchmarks are about 1/10 of the speed.

  2. if I attach -o 1 to any one of the 3 benchmarks, hanging happens again. The only difference between the -o1 and the default (-o 3) is the JIT compilation flags:

    -o 1: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SRC_PENCIL
    -o 3: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SIMPLIFY_BRANCH -DMCX_VECTOR_INDEX -DMCX_SRC_PENCIL

the goal of trying to use rocm for mcxcl is to see if we can accelerate our simulation using the half-precision hardware in Vega 64, from what we learned from the AMD OpenCL forum, the only way to use the half-precision units is to use rocm because amdgpu-pro does not have rapid-packed math support.

Nonetheless, if I run mcxcl with half-precision on rocm 1.7.1b3, the speed is still not improving much. For run_benchmark1 (./run_benchmark1.sh -n 1e7 -J "-DUSE_HALF") , the speed is 4170.14 photons/ms, this is even less than the single-precision speed with rocm, not mention the 10x higher speed using single-precision with amdgpu-pro.

just to confirm, is half-precision unit supported by rocm 1.7.1b3? or I need to use some special flags?

thanks again

PS: I ran rocminfo on the system with the vega64 (DID: 687f)/rocm 1.7.1b3, I notice the following line Fast F16 Operation: FALSE so, it looks like the fp16 hardware is not supported by rocm, is there a way to enable it?

fangq commented 6 years ago

also, I notice running mcxcl with "-o 0" also hangs in benchmark1, but not in benchmark2/2a. For benchmark1, the JIT flags for "-o 0" and other options are compared below

-o 0: -DMCX_SRC_PENCIL
-o 1: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SRC_PENCIL
-o 3: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SIMPLIFY_BRANCH -DMCX_VECTOR_INDEX -DMCX_SRC_PENCIL

-o 0 disables all optimization options.

fangq commented 6 years ago

hi @gstoner, it has been a while, but I would like to come back to this issue, as my collaborator, Dr. Kaeli, myself and the PhD student Leiming (@3upperm2n) are trying to make progress on our study on half-precision Monte Carlo simulations.

with the new 1.8.3 rocm released this morning, I do have some updates

first, the hanging issues seems gone for all 3 benchmarks with all my optimization flags (-o 0/1/2/3). This is a very encouraging progress.

secondly, when running rocm-smi on my vega64 (gfx900, DID: 687f), it no longer has the error message as it reported before, but it prints an empty GPU with NA for every column

====================    ROCm System Management Interface    ====================
================================================================================
 GPU  Temp    AvgPwr   SCLK     MCLK     Fan      Perf    SCLK OD    MCLK OD
  1   31c     3.0W     852Mhz   167Mhz   0.0%     auto      0%         0%       
  0   N/A     N/A      N/A      N/A      0%       N/A       N/A        N/A      
================================================================================
====================           End of ROCm SMI Log          ====================

the rocminfo output the same as before, the "Fast F16 Operation" still shows FALSE for the vega, despite the Fast f16 field prints TRUE in the ISA 1 section. When running my code with and without the half-precision operations, speed shows no noticeable change.

right now, the biggest issue is that the mcxcl speed running on rocm is 10-fold slower than running on amdgpu-pro driver, the same as in my last report back in Feb.

we would like to get some help from your team to

  1. understand what was the cause of the 10-fold slow down of the mcxcl speed, and
  2. understand how to enable our half-precision code on vega with rocm.

we have previously observed a similar dramatic speed hit (10-fold slow down on some nvidia drivers) for our CUDA version of the code,

https://devtalk.nvidia.com/default/topic/925630/cuda-programming-and-performance/cuda-7-5-on-maxwell-980ti-drops-performance-by-10x-versus-cuda-7-0-and-6-5/

it was fixed later in a new driver, and the issue was caused by some compiler heuristic in predicate the complex kernel structure. we suspect it might be a similar scenario here.

gstoner commented 6 years ago

Your working on a system that has GPU in it that is not an AMD GPU which is the. N/A N/A. On a server this would be BMC GPU as well. So this is correct Behavior for ROCm-SMI

AMDGPUpro driver is not using ROCm it use OpenCL on PAL (Platform Abstraction Layer https://github.com/GPUOpen-Drivers/pal and uses the LLVM to HSAIL/ Shader Compiler ( same compiler as Windows driver) on Vega10 and old OpenCL/VDI/ORCA path for GFX and older. ROCm which is using the AMDGPU LLVM compiler Here is documentation so you know more about this compiler https://llvm.org/docs/AMDGPUUsage.html.

We look into this

fangq commented 6 years ago

Thanks Greg, you are right, I do have the Intel integrated GPU on that machine.

also thanks for the link for the llvm compiler. I will read it more.

To better manage these two issues we are currently facing, I created a new tracker at https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/issues/55 to discuss the half-precision support issues, and leave this tracker for understanding the 10-fold slow-down.

fangq commented 6 years ago

there are some significant speed improvement in 1.9, the details can be found in https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/issues/55 , and the speed comparison can be found below

  Benchmark           1.8.x       1.9       1.9 w/ half     amdgpu     
______________________________________________________________________
run_benchmark1.sh:   4293.69     15398.83    19406.17      44306.60
run_benchmark2.sh:   2241.15     7352.94     13154.43      22461.81
run_benchmark3.sh:   2200.22     8278.83     10364.84      --

overall, rocm 1.9 driver is about 4x faster than 1.8 for this application, and still about 2x slower than amdgpu-pro driver.

gstoner commented 6 years ago

Is it a driver or compiler issue they two different compiler.

Can you try the AMDGPUpro Driver for Linux and Test on AMDGPU driver Windows and report the numbers

fangq commented 6 years ago

hi Greg, can you let me know how to install the "amdgpu-pro" compiler? I assume you meant the OpenCL JIT compiler. in the past, I just install amdgpu-pro package which will remove all rocm* packages.

gstoner commented 6 years ago

I saw you ran AMDGPUpro with version of this linux driver 18.20 or was it newer.

fangq commented 6 years ago

the previous amdgpu-pro version I used was 17.50-511655 on Linux (ubuntu 16.04). Unfortunately, the hosts with AMD gpus do not have windows installed.

maybe I misunderstood, were you suggesting that it is possible to mix-use amdgpu-pro and rocm (i.e. using amdgpu-pro compiler and rocm driver)? if yes, is there any online documentation I can read to set up this environment? I am very curious to try. thanks

gstoner commented 6 years ago

Ok this is magic info I needed to isolate the issue AMDGPUpro is the 17:50 Driver. We need to look at Compiler Code Gen issue or even a pattern in your algorithm is driving the CLANG/LLVM incorrectly to get the best code gen.

For ROCm Driver It only uses the the CL Frontend with AMDGPU LLVM Native GCN compiler using the OpenCL Runtime on ROCr/KFD

Note internally we can load OpenCL to LLVM to HSAIL/SC compiler binary, it just drop on the platform to do a test on this.

For AMDGPUpro here is magic decoder Ring For Compiler and its Base

fangq commented 5 years ago

hi Greg, I recently upgraded a Linux workstation from Ubuntu 14.04 to 16.04, and also upgraded the amdgpu-pro driver from 16.30 (the last version supporting Ubuntu 14.04) to 18.40. The workstation has two AMD cards, an R9 Nano, and a RX480, both of which were used (with the 16.30 driver) for this paper: https://doi.org/10.1117/1.JBO.23.1.010504

after upgrading to 18.40 driver, I notice a similar 2x fold slow-down from my previous benchmark results. Some previously helpful control flow simplifications, like in this commit (the MCX_SIMPLIFY_BRANCH macro)

https://github.com/fangq/mcxcl/commit/f3a53f4e387c26b8322e3c336d5b4331ff83f7dd

seems to be responsible for such big reduction. If I disable these flow-related optimizations (by using -o 0 or -o 1 in the command line), I can recover to about 80% of the previous speed, but still not the full speed I got received from the 16.30 driver.

I just want to mention these in case it is helpful to pin point where the new driver has difficulty in handling mcxcl's kernel (possibly array indexing and branch predication).