Open fangq opened 6 years ago
Hi fangq - Tried the application, MCXcl bench on Vega10, 64GB card, DID#6868, don’t find any issues with it with latest AMD ROCm internal driver. We don’t know what kind of failure you noticed, Performance, Correctness, segfault, pagefault, etc... ? Could you please provide more details ? Also need the DID & config details of the configuration tested, rocm version, CPU, GPU, any other valuable info.
Thanks
@Srinivasuluch, I am sorry for the delay, I was not notified your reply for some reason.
here is the summary of the system:
OS: Ubuntu 16.04.3
GPU: Vega 64 (DID#687f)
CPU: i7-4770k
kernel: 4.13.0-32-generic
gcc: 5.4.0 20160609
ROCm: 1.7
My student @3upperm2n also tried (~Jan 28) on a Vega MI25 (DID#6860), same findings.
overall we've observed 2 issues,
the same software has no issue when using amdgpu-pro driver, or run other platforms (intel/nvidia).
I just reinstalled rocm (reverted back to amdgpu-pro in the past weeks), right now, benchmark1 works, but 2/2a hangs. I am not sure though if my rocm was re-installed properly. previously, I remember a kfd kernel was used,
fangq@pangu:~/space/git/Temp$ uname -a
Linux pangu 4.11.0-kfd-compute-rocm-rel-1.6-180 #1 SMP Tue Oct 10 08:15:38 CDT 2017 x86_64 x86_64 x86_64 GNU/Linux
but after I reinstalled rocm, now the uname output does not show the kfd kernel anymore.
fangq@pangu:~/space/git/Temp$ uname -a
Linux pangu 4.13.0-32-generic #35~16.04.1-Ubuntu SMP Thu Jan 25 10:13:43 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
nonetheless, the rocm-dkms, rocm-opencl, rocm-opencl-dev were all installed.
fangq@pangu:~/space/git/Project$ apt-cache madison rocm-dkms rocm-opencl rocm-opencl-dev
rocm-dkms | 1.7.60 | http://repo.radeon.com/rocm/apt/debian xenial/main amd64 Packages
rocm-opencl | 1.2.0-2017121952 | http://repo.radeon.com/rocm/apt/debian xenial/main amd64 Packages
rocm-opencl-dev | 1.2.0-2017121952 | http://repo.radeon.com/rocm/apt/debian xenial/main amd64 Packages
is there a way I can verify if rocm driver was indeed used? the "verify your installation" link in the README file is broken.
rocm-smi and rocminfo outputs are attached below:
fangq@pangu:~/space/git/Project$ /opt/rocm/bin/rocm-smi
==================== ROCm System Management Interface ====================
================================================================================
GPU Temp AvgPwr SCLK MCLK Fan Perf SCLK OD
1 40.0c 3.0W 852Mhz 167Mhz 13.73% auto 0%
Traceback (most recent call last):
File "/opt/rocm/bin/rocm-smi", line 1058, in <module>
showAllConcise(deviceList)
File "/opt/rocm/bin/rocm-smi", line 728, in showAllConcise
fan = str(getFanSpeed(device))
File "/opt/rocm/bin/rocm-smi", line 358, in getFanSpeed
fanLevel = int(getSysfsValue(device, 'fan'))
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
fangq@pangu:~/space/git/Project$ /opt/rocm/bin/rocminfo
=====================
HSA System Attributes
=====================
Runtime Version: 1.1
System Timestamp Freq.: 1000.000000MHz
Sig. Max Wait Duration: 18446744073709551615 (number of timestamp)
Machine Model: LARGE
System Endianness: LITTLE
==========
HSA Agents
==========
*******
Agent 1
*******
Name: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
Vendor Name: CPU
Feature: None specified
Profile: FULL_PROFILE
Float Round Mode: NEAR
Max Queue Number: 0
Queue Min Size: 0
Queue Max Size: 0
Queue Type: MULTI
Node: 0
Device Type: CPU
Cache Info:
L1: 32768KB
Chip ID: 0
Cacheline Size: 64
Max Clock Frequency (MHz):3900
BDFID: 0
Compute Unit: 8
Features: None
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: KERNARG, FINE GRAINED
Size: 7795692KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: TRUE
Pool 2
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 7795692KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: TRUE
ISA Info:
N/A
*******
Agent 2
*******
Name: gfx900
Vendor Name: AMD
Feature: KERNEL_DISPATCH
Profile: BASE_PROFILE
Float Round Mode: NEAR
Max Queue Number: 128
Queue Min Size: 4096
Queue Max Size: 131072
Queue Type: MULTI
Node: 1
Device Type: GPU
Cache Info:
L1: 16KB
Chip ID: 26751
Cacheline Size: 64
Max Clock Frequency (MHz):1630
BDFID: 768
Compute Unit: 64
Features: KERNEL_DISPATCH
Fast F16 Operation: FALSE
Wavefront Size: 64
Workgroup Max Size: 1024
Workgroup Max Size Per Dimension:
Dim[0]: 67109888
Dim[1]: 50332672
Dim[2]: 0
Grid Max Size: 4294967295
Waves Per CU: 40
Max Work-item Per CU: 2560
Grid Max Size per Dimension:
Dim[0]: 4294967295
Dim[1]: 4294967295
Dim[2]: 4294967295
Max number Of fbarriers Per Workgroup:32
Pool Info:
Pool 1
Segment: GLOBAL; FLAGS: COARSE GRAINED
Size: 8372224KB
Allocatable: TRUE
Alloc Granule: 4KB
Alloc Alignment: 4KB
Acessible by all: FALSE
Pool 2
Segment: GROUP
Size: 64KB
Allocatable: FALSE
Alloc Granule: 0KB
Alloc Alignment: 0KB
Acessible by all: FALSE
ISA Info:
ISA 1
Name: AMD:AMDGPU:9:0:0
Machine Models: HSA_MACHINE_MODEL_LARGE
Profiles: HSA_PROFILE_BASE
Default Rounding Mode: NEAR
Default Rounding Mode: NEAR
Fast f16: TRUE
Workgroup Max Dimension:
Dim[0]: 67109888
Dim[1]: 1024
Dim[2]: 16777217
Workgroup Max Size: 1024
Grid Max Dimension:
x 4294967295
y 4294967295
z 4294967295
Grid Max Size: 4294967295
FBarrier Max Size: 32
*** Done ***
@fangq Can you test our ROCm 1.7.1 beta release to see if this address your issue http://repo.radeon.com/misc/archive/beta/rocm-1.7.1-beta.tar.gz
@gstoner, thanks for the link, I just downloaded the package, and updated rocm to 1.7.1 beta, however, those benchmarks give me the same issue:
running the failed benchmark (the same binary) on another machine with R9 nano and RX480 and amdgpu-pro worked fine.
in case the benchmarks do run on your rocm/vega, here is the expected speed we got from vega64 with amdgpu-pro by running ./run_benchmark{1,2,2a}.sh without any additional parameter (1e8 photons)
run_benchmark1: 44306.6 photons/ms
run_benchmark2: 22461.81 photons/ms
run_benchmark2a: 16126.43 photons/ms
the speed value is printed at the end of the simulation.
Ok, we have the OpenCL look at this, one thing the two drivers use two different compilers. But ROCm also uses guard pages, can you double check the code for out of bound memory references
@gstoner, thanks a lot. out-of-bound memory access is definitely a possibility.
any recommendation for a tool to detect out-of-bound bugs? I tried the open-source oclgrind tool, but it still can't make it run properly with my kernel.
@gstoner, FYI, we ran oclgrind with our benchmark codes, it did not capture any memory errors for both benchmark1 (that does not hang with rocm, but 10x slower), and benchmark2 (hangs with rocm 1.7 and 1.7.1).
Below are the outputs for both tests:
fangq@pangu:~/mcxcl/example/benchmark$ /usr/bin/oclgrind ../../bin/mcxcl -A -f benchmark1.json -k ../../src/mcx_core.cl -b 0 -n 1000
==============================================================================
= Monte Carlo eXtreme (MCX) -- OpenCL =
= Copyright (c) 2010-2018 Qianqian Fang <q.fang at neu.edu> =
= http://mcx.space/ =
= =
= Computational Optics&Translational Imaging (COTI) Lab - http://fanglab.org =
= Department of Bioengineering, Northeastern University =
==============================================================================
= The MCX Project is funded by the NIH/NIGMS under grant R01-GM114365 =
==============================================================================
$Rev::6e839e $ Last $Date::2017-07-20 12:46:23 -04$ by $Author::Qianqian Fang$
==============================================================================
- code name: [Vanilla MCXCL] compiled with OpenCL version [1]
- compiled with: [RNG] Logistic-Lattice [Seed Length] 5
initializing streams ... init complete : 0 ms
Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SIMPLIFY_BRANCH -DMCX_VECTOR_INDEX -DMCX_SRC_PENCIL
build program complete : 25 ms
- [device 0(1): Oclgrind Simulator] threadph=15 oddphotons=40 np=1000.0 nthread=64 nblock=64 repetition=1
set kernel arguments complete : 25 ms
lauching mcx_main_loop for time window [0.0ns 5.0ns] ...
simulation run# 1 ... kernel complete: 6325 ms
retrieving flux ... transfer complete: 6325 ms
normalizing raw data ... normalization factor alpha=200000.000000
saving data to file ... 216000 1 saving data complete : 6336 ms
simulated 1000 photons (1000) with 1 devices (repeat x1)
MCX simulation speed: 0.16 photon/ms
total simulated energy: 1000.00 absorbed: 16.84878%
(loss due to initial specular reflection is excluded in the total)
fangq@pangu:~/mcxcl/example/benchmark$ /usr/bin/oclgrind ../../bin/mcxcl -A -f benchmark1.json -b 1 -P '{"Shapes":[{"Sphere": {"Tag":2, "O":[30,30,30],"R":15}}]}' -s benchmark2 -k ../../src/mcx_core.cl -n 1000
==============================================================================
= Monte Carlo eXtreme (MCX) -- OpenCL =
= Copyright (c) 2010-2018 Qianqian Fang <q.fang at neu.edu> =
= http://mcx.space/ =
= =
= Computational Optics&Translational Imaging (COTI) Lab - http://fanglab.org =
= Department of Bioengineering, Northeastern University =
==============================================================================
= The MCX Project is funded by the NIH/NIGMS under grant R01-GM114365 =
==============================================================================
$Rev::6e839e $ Last $Date::2017-07-20 12:46:23 -04$ by $Author::Qianqian Fang$
==============================================================================
- code name: [Vanilla MCXCL] compiled with OpenCL version [1]
- compiled with: [RNG] Logistic-Lattice [Seed Length] 5
initializing streams ... init complete : 0 ms
Building kernel with option: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SIMPLIFY_BRANCH -DMCX_VECTOR_INDEX -DMCX_SRC_PENCIL -D MCX_DO_REFLECTION
build program complete : 26 ms
- [device 0(1): Oclgrind Simulator] threadph=15 oddphotons=40 np=1000.0 nthread=64 nblock=64 repetition=1
set kernel arguments complete : 26 ms
lauching mcx_main_loop for time window [0.0ns 5.0ns] ...
simulation run# 1 ... kernel complete: 13521 ms
retrieving flux ... transfer complete: 13521 ms
normalizing raw data ... normalization factor alpha=200000.000000
saving data to file ... 216000 1 saving data complete : 13533 ms
simulated 1000 photons (1000) with 1 devices (repeat x1)
MCX simulation speed: 0.07 photon/ms
total simulated energy: 1000.00 absorbed: 27.34936%
(loss due to initial specular reflection is excluded in the total)
@fangq Can you try 1.7.1 Beta 3 http://repo.radeon.com/misc/archive/beta/rocm-1.7.1.beta.3.tar.bz2
thanks @gstoner, I have some encouraging updates, after the update, all benchmarks now run without hanging with 1.7.1beta3!
however, there are still two (or three) remaining problems:
the simulation speed for all 3 benchmarks are 10x slower than the speed using amdgpu-pro: the speed values I have on rocm 1.7.1b3 are
run_benchmark1.sh: 4293.69 photon/ms
run_benchmark2.sh: 2241.15 photon/ms
run_benchmark2a.sh: 2200.22 photon/ms
the speed for amdgpu-pro can be found in https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime/issues/43#issuecomment-366480459. Both the 1st and 2nd benchmarks are about 1/10 of the speed.
if I attach -o 1 to any one of the 3 benchmarks, hanging happens again. The only difference between the -o1 and the default (-o 3) is the JIT compilation flags:
-o 1: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SRC_PENCIL
-o 3: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SIMPLIFY_BRANCH -DMCX_VECTOR_INDEX -DMCX_SRC_PENCIL
the goal of trying to use rocm for mcxcl is to see if we can accelerate our simulation using the half-precision hardware in Vega 64, from what we learned from the AMD OpenCL forum, the only way to use the half-precision units is to use rocm because amdgpu-pro does not have rapid-packed math support.
Nonetheless, if I run mcxcl with half-precision on rocm 1.7.1b3, the speed is still not improving much. For run_benchmark1 (./run_benchmark1.sh -n 1e7 -J "-DUSE_HALF") , the speed is 4170.14 photons/ms, this is even less than the single-precision speed with rocm, not mention the 10x higher speed using single-precision with amdgpu-pro.
just to confirm, is half-precision unit supported by rocm 1.7.1b3? or I need to use some special flags?
thanks again
PS: I ran rocminfo on the system with the vega64 (DID: 687f)/rocm 1.7.1b3, I notice the following line
Fast F16 Operation: FALSE
so, it looks like the fp16 hardware is not supported by rocm, is there a way to enable it?
also, I notice running mcxcl with "-o 0" also hangs in benchmark1, but not in benchmark2/2a. For benchmark1, the JIT flags for "-o 0" and other options are compared below
-o 0: -DMCX_SRC_PENCIL
-o 1: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SRC_PENCIL
-o 3: -cl-mad-enable -DMCX_USE_NATIVE -DMCX_SIMPLIFY_BRANCH -DMCX_VECTOR_INDEX -DMCX_SRC_PENCIL
-o 0 disables all optimization options.
hi @gstoner, it has been a while, but I would like to come back to this issue, as my collaborator, Dr. Kaeli, myself and the PhD student Leiming (@3upperm2n) are trying to make progress on our study on half-precision Monte Carlo simulations.
with the new 1.8.3 rocm released this morning, I do have some updates
first, the hanging issues seems gone for all 3 benchmarks with all my optimization flags (-o 0/1/2/3). This is a very encouraging progress.
secondly, when running rocm-smi on my vega64 (gfx900, DID: 687f), it no longer has the error message as it reported before, but it prints an empty GPU with NA for every column
==================== ROCm System Management Interface ====================
================================================================================
GPU Temp AvgPwr SCLK MCLK Fan Perf SCLK OD MCLK OD
1 31c 3.0W 852Mhz 167Mhz 0.0% auto 0% 0%
0 N/A N/A N/A N/A 0% N/A N/A N/A
================================================================================
==================== End of ROCm SMI Log ====================
the rocminfo output the same as before, the "Fast F16 Operation" still shows FALSE for the vega, despite the Fast f16 field prints TRUE in the ISA 1 section. When running my code with and without the half-precision operations, speed shows no noticeable change.
right now, the biggest issue is that the mcxcl speed running on rocm is 10-fold slower than running on amdgpu-pro driver, the same as in my last report back in Feb.
we would like to get some help from your team to
we have previously observed a similar dramatic speed hit (10-fold slow down on some nvidia drivers) for our CUDA version of the code,
it was fixed later in a new driver, and the issue was caused by some compiler heuristic in predicate the complex kernel structure. we suspect it might be a similar scenario here.
Your working on a system that has GPU in it that is not an AMD GPU which is the. N/A N/A. On a server this would be BMC GPU as well. So this is correct Behavior for ROCm-SMI
AMDGPUpro driver is not using ROCm it use OpenCL on PAL (Platform Abstraction Layer https://github.com/GPUOpen-Drivers/pal and uses the LLVM to HSAIL/ Shader Compiler ( same compiler as Windows driver) on Vega10 and old OpenCL/VDI/ORCA path for GFX and older. ROCm which is using the AMDGPU LLVM compiler Here is documentation so you know more about this compiler https://llvm.org/docs/AMDGPUUsage.html.
We look into this
Thanks Greg, you are right, I do have the Intel integrated GPU on that machine.
also thanks for the link for the llvm compiler. I will read it more.
To better manage these two issues we are currently facing, I created a new tracker at https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/issues/55 to discuss the half-precision support issues, and leave this tracker for understanding the 10-fold slow-down.
there are some significant speed improvement in 1.9, the details can be found in https://github.com/RadeonOpenCompute/ROCK-Kernel-Driver/issues/55 , and the speed comparison can be found below
Benchmark 1.8.x 1.9 1.9 w/ half amdgpu
______________________________________________________________________
run_benchmark1.sh: 4293.69 15398.83 19406.17 44306.60
run_benchmark2.sh: 2241.15 7352.94 13154.43 22461.81
run_benchmark3.sh: 2200.22 8278.83 10364.84 --
overall, rocm 1.9 driver is about 4x faster than 1.8 for this application, and still about 2x slower than amdgpu-pro driver.
Is it a driver or compiler issue they two different compiler.
Can you try the AMDGPUpro Driver for Linux and Test on AMDGPU driver Windows and report the numbers
hi Greg, can you let me know how to install the "amdgpu-pro" compiler? I assume you meant the OpenCL JIT compiler. in the past, I just install amdgpu-pro package which will remove all rocm* packages.
I saw you ran AMDGPUpro with version of this linux driver 18.20 or was it newer.
the previous amdgpu-pro version I used was 17.50-511655 on Linux (ubuntu 16.04). Unfortunately, the hosts with AMD gpus do not have windows installed.
maybe I misunderstood, were you suggesting that it is possible to mix-use amdgpu-pro and rocm (i.e. using amdgpu-pro compiler and rocm driver)? if yes, is there any online documentation I can read to set up this environment? I am very curious to try. thanks
Ok this is magic info I needed to isolate the issue AMDGPUpro is the 17:50 Driver. We need to look at Compiler Code Gen issue or even a pattern in your algorithm is driving the CLANG/LLVM incorrectly to get the best code gen.
For ROCm Driver It only uses the the CL Frontend with AMDGPU LLVM Native GCN compiler using the OpenCL Runtime on ROCr/KFD
Note internally we can load OpenCL to LLVM to HSAIL/SC compiler binary, it just drop on the platform to do a test on this.
For AMDGPUpro here is magic decoder Ring For Compiler and its Base
17:40 Driver
17:50 Driver
18:20 Driver
hi Greg, I recently upgraded a Linux workstation from Ubuntu 14.04 to 16.04, and also upgraded the amdgpu-pro driver from 16.30 (the last version supporting Ubuntu 14.04) to 18.40. The workstation has two AMD cards, an R9 Nano, and a RX480, both of which were used (with the 16.30 driver) for this paper: https://doi.org/10.1117/1.JBO.23.1.010504
after upgrading to 18.40 driver, I notice a similar 2x fold slow-down from my previous benchmark results. Some previously helpful control flow simplifications, like in this commit (the MCX_SIMPLIFY_BRANCH macro)
https://github.com/fangq/mcxcl/commit/f3a53f4e387c26b8322e3c336d5b4331ff83f7dd
seems to be responsible for such big reduction. If I disable these flow-related optimizations (by using -o 0 or -o 1 in the command line), I can recover to about 80% of the previous speed, but still not the full speed I got received from the 16.30 driver.
I just want to mention these in case it is helpful to pin point where the new driver has difficulty in handling mcxcl's kernel (possibly array indexing and branch predication).
mcxcl is a package my group developed for efficient photon transport simulations. It has good performance on the latest Vega64 GPU using the amdgpu-pro driver, see Fig. 2 of our recently published paper. The kernel works fine on all tested OCL implementations from NVIDIA AMD and Intel.
However, we recently installed ROCm on one of our Linux servers (Ubuntu 16.04) and tried to run this code using the Vega64 GPU, all of our benchmarks failed with infinite loops.
To reproduce this issue, here are the commands
We want to to know what was the cause of this issue and how to make our code compatible with ROCm.
thanks