Open powderluv opened 3 years ago
I am in both video and render groups:
id uid=1000(foo) gid=1000(foo) groups=1000(foo),4(adm),24(cdrom),27(sudo),30(dip),44(video),46(plugdev),109(render),120(lpadmin),131(lxd),132(sambashare)
I suggested export ROCM_LLC_ARGS="--amdhsa-code-object-version=3"
in a side channel. That tells amd-stg-open clang to use the same version that rocr expects. Unfortunately that doesn't appear to resolve this.
yup. I tried both just setting the env var before building my sample with hipcc and that didn't help. Rebuilding all of aomp with that env var set doesn't help either.
Happy to gather any other debug information that is relevant. To make sure there is nothing in /opt/rocm I only have rocm-smi there
ls -ltr /opt/rocm-4.0.0/lib/* lrwxrwxrwx 1 root root 34 Dec 14 02:49 /opt/rocm-4.0.0/lib/librocm_smi64.so.2 -> ../rocm_smi/lib/librocm_smi64.so.2 lrwxrwxrwx 1 root root 32 Dec 14 02:49 /opt/rocm-4.0.0/lib/librocm_smi64.so -> ../rocm_smi/lib/librocm_smi64.so
If required I can rebuild that too but I doubt that could be the issue.
Thanks for your quick responses.
Ah. I didn't notice you were using hipcc. When I try to run hip code locally, I get a variant on 'no devices found', which seems to correlate with an invalid branch in the hip runtime. Running the host application under valgrind blames libamdhip64.so at least. Hopefully Greg has more information on that, I haven't tried to debug the hip runtime.
Just getting familiar with runtimes. What other runtime can I use? I am trying to get Tensile going with gfx1030 which seems to require hipcc.
Yeah gdb points to libamdhip64.so.
The bottom of the stack on linux is kfd (in the linux kernel), then roct which is roughly the userspace driver part of kfd. On top of that is an implementation of the HSA spec, rocr. Those have all been robust under my testing. The OpenMP implementation on amdgpu builds directly on top of rocr for that reason.
Depending on your use case, c++ compiled for amdgcn as freestanding and launched using the functions in hsa.h works well. Opencl has its own runtime, but it looks like it's now built on the same foundation as hip so may have the same bug reported here. Windows does some different things, and so does the graphics stack.
libamdhip64.so contains, as far as I can tell, roct, rocr, rocclr, hip. Something in that appears to be broken. There's a lot of code though so it's not an easy fix. HIP mostly track errors through an internal Jira system.
Is Tensile the rocm library with that name? If so, an issue suggests it worked on a gfx1010 in November. You might therefore be able to get a working HIP installation by rolling back to a release made around then. I've added Siu Chi to this issue as he is much closer to the HIP development than me.
Cool. Thanks for the clarity - just so many rocXX libs it was hard to understand the layering. I think c++ compiled for amdgcn and launched with hsa.h is best for us. I will look around for rocr samples as a starting point.
I was trying to get Tensile up and running on gfx1030 because those are the "baseline" GEMM routines for rocblas and want to compare to that performance too. I filed a few issues about it https://github.com/ROCmSoftwarePlatform/Tensile/issues/1282 https://github.com/RadeonOpenCompute/ROCm/issues/1376 https://github.com/ROCmSoftwarePlatform/rocBLAS/issues/1185
Unfortunately looks like the last release of rocr was 3.1.x and there is no 4.x or later branches https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/111
Are you able to test with the opensource rocr from https://github.com/RadeonOpenCompute/ROCR-Runtime ? Any chance we can get an updated rocr or is 3.1.x supposed to work for gfx10 ?
ok so rocr seems to be working. I have verified that with rocm_bandwidth_test (https://github.com/RadeonOpenCompute/rocm_bandwidth_test) since rocr-runtime doesn't have any tests.
so something is broken along rocclr / hip for gfx10.
./rocm-bandwidth-test ........ RocmBandwidthTest Version: 2.5.1
Launch Command is: ./rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)
Device: 0, AMD Ryzen 9 5950X 16-Core Processor
Device: 1, Device 73bf, GPU-XX, 31:0.0
Inter-Device Access
D/D 0 1
0 1 0
1 1 1
Inter-Device Numa Distance
D/D 0 1
0 0 N/A
1 20 0
Unidirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 7.030
1 7.262 1259.078
Bidirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 13.292
1 13.292 N/A
Thanks for the pointers.
@JonChesterfield do you have any examples / tests that do the "c++ compiled for amdgcn as freestanding and launched using the functions in hsa.h " ? I am trying to follow https://github.com/RadeonOpenCompute/rocminfo as an example but I dont see gcn binaries in the final elf file that goes into the rocr / hsa runtime.
update: found https://github.com/ROCm-Developer-Tools/LLVM-AMDGPU-Assembler-Extra to play around with.
Update 2: I have been able to run simple code after updating to code object version 3 . Pushed a fork https://github.com/Powderluv/LLVM-AMDGPU-Assembler-Extra
Hey. I missed the above comments but saw this while looking at the tangentially related #193. I'm not clear what the status of the gfx10 cards is - the 4.1 release notes don't seem to mention it. Unofficially some code does seem to run on them, and I believe rocr and the compiler backend are functional. OpenMP does not work on gfx10 yet, working on that at present.
The code object format is currently transitioning from 3 to 4. I think the status is rocm 3.10 needs v3, rocm 4.1 can use v4, llvm trunk is reviewing patches to bring v4 online.
Using raw C++ means trading the many conveniences of the high level languages for an increase in control. Documentation is sparse, your mileage may vary. Nevertheless, an example of going down that rabbit hole is https://github.com/jonChesterfield/hostrpc, which is a bare metal prototype that I'm hoping to implement libc on top of (thus getting away from freestanding for applications). You may find it interesting but it's not production code yet.
Compiling as freestanding invocation is along the lines of:
GFX=gfx906 clang -O2 -ffreestanding --target=amdgcn-amd-amdhsa -march=$GFX -mcpu=$GFX -nogpulib -emit-llvm
That will emit IR for a gfx906. Functions, data and so forth.
To get something that can be launched, one currently needs to use opencl/hip/openmp/IR/asm as the kernel calling convention is not exposed to c++. That's somewhat annoying but the 'kernel' function only needs to contain a call to something written in C. E.g.:
int __device_start_cast(int argc, __global void* argv);
static unsigned get_lane_id(void)
{
return __builtin_amdgcn_mbcnt_hi(~0u, __builtin_amdgcn_mbcnt_lo(~0u, 0u));
}
kernel void __device_start(int argc, __global void* argv, __global int* res)
{
res[get_lane_id()] = __device_start_cast(argc, argv);
}
given some IR that contains one or more kernel functions, llc can emit a code object which the hsa loader can run on the gpu. The interface to that is RadeonOpenCompute/ROCR-Runtime/src/inc/hsa.h. It's verbose, but works broadly as the comments suggest.
Thank you for this. hostrpc seems very useful. We will give it a spin and post issues here or on the hostrpc repo.
Also libc would be fantastic along with some utils for debugging and logging
OpenMP team, what is status of AOMP on gfx1030? Should we get a test machine in our AOMP lab?
FYI, https://github.com/RadeonOpenCompute/ROCm/issues/887#issuecomment-822222885 I hope once ROCm side enables RDNA, AOMP works out of box. Right now, nailing the software on GFX9 is really critical.
RocmBandwidthTest Version: 2.6.0 / rocm-5.1.2 gfx1030 / uname = 5.4.0-122-generic
Launch Command is: ./rocm-bandwidth-test (rocm_bandwidth -a + rocm_bandwidth -A)
Device: 0, 11th Gen Intel(R) Core(TM) i5-11400F @ 2.60GHz
Device: 1, AMD Radeon RX 6800, GPU-XX, 03:0.0
Inter-Device Access
D/D 0 1
0 1 0
1 1 1
Inter-Device Numa Distance
D/D 0 1
0 0 N/A
1 20 0
Unidirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 26.662
1 28.566 848.405
Bidirectional copy peak bandwidth GB/s
D/D 0 1
0 N/A 31.417
1 31.417 N/A
@powderluv Do you still need assistance with this ticket? If not, please close the ticket. Thanks!
I have built latest AOMP (SHA: e2f40a73975ccc0f453e8767a8016e6deb849782) with the amd-stg-open branch. However it is unable to enumerate the HIP GPU device though rocminfo shows both cpu and gpu. I have a 6900XT (gfx1030) and am trying to get Tensile to work on it.
(I have this https://github.com/ROCm-Developer-Tools/HIP/pull/2219 locally to fix the clang_rt builtin issue on hosts).
See below:
I am running this code: https://gitlab.com/cscs-ci/ci-testing/ault-amdgpu/-/blob/master/helloworld.cpp Got an error hipErrorNoDevice
I verified I am in the video group and sudo doesn't help.