ROCm / rocprofiler

ROC profiler library. Profiling with perf-counters and derived metrics.
https://rocm.docs.amd.com/projects/rocprofiler/en/latest/
MIT License
126 stars 46 forks source link

[HSA_STATUS_ERROR] A generic error has occurred #10

Closed camierjs closed 4 years ago

camierjs commented 4 years ago

Hi,

I am trying to profile some CEED benchmarks. I'm using a gfx906 card with rocm 2.10.

I'm using hipcc and the compilation/run seem fine, but I can't get any output from the profiler.

I tried different options, but I keep getting this generic error:

rcprof -C  ./bp3 -o 2 -l 8 -d hip
Options used:
   --mesh-dimension 3
   --refinement-level 8
   --order 2
   --device hip
Radeon Compute Profiler V5.6.7262 is enabled
No counter file specified. Only counters that will fit into a single pass will be enabled.
Device configuration: hip,cpu
Processor partitioning: 1 1 1
Mesh dimensions: 8 8 4
Total number of elements: 256
Number of finite element unknowns: 2601
aqlprofile API table load failed: HSA_STATUS_ERROR: A generic error has occurred.
[corona90:mpi_rank_0][error_sighandler] Caught error: Aborted (signal 6)
Failed to generate profile result /g/g91/camier1/Session1.csv.

Have you seen this kind of error?

Thank you for your help,

Jean-Sylvain

nwolfey21 commented 4 years ago

Hi Jean-Sylvain. I think what you are looking for is the rocprofiler. Easy mistake ;)

https://github.com/ROCm-Developer-Tools/rocprofiler

camierjs commented 4 years ago

Hi Noah,

You're right, I was first trying to use the rocprof shipped with the rocm/2.10 software toolchain.

I switched to the rocprofiler github version, compiled it and run it with the --hip-trace option.

Here is the output:

corona_hip/mfem4_bps> ~/usr/local/bin/rocprof --hip-trace ./bp3 -o 2 -l 8 -d hip
RPL: on '191218_154552' from '/g/g91/camier1/usr/local/rocprofiler' in '~/home/benchmarks_corona/builds/corona_hip/mfem4_bps'
RPL: profiling '"./bp3" "-o" "2" "-l" "8" "-d" "hip"'
RPL: input file ''
RPL: output dir '/tmp/rpl_data_191218_154552_114455'
RPL: result dir '/tmp/rpl_data_191218_154552_114455/input_results_191218_154552'
Tool lib "~/usr/local/roctracer/tool/libtracer_tool.so" failed to load.
Options used:
   --mesh-dimension 3
   --refinement-level 8
   --order 2
   --device hip
Device configuration: hip,cpu
Processor partitioning: 1 1 1
Mesh dimensions: 8 8 4
Total number of elements: 256
Number of finite element unknowns: 2601
   Iteration :   0  (B r, r) = 0.000604889 ...
   Iteration :  56  (B r, r) = 3.78127e-28
Average reduction factor = 0.607985

Total CG time:    0.0621552 (0.0621552) sec.
Time per CG step: 0.00110991 (0.00110991) sec.

"DOFs/sec" in CG: 2.34342 (2.34342) million.

One results.db file is outputed, but the message libtracer_tool suggests me to compile the roctracer tool, but fail with the master branch with the following error:

In file included from /g/g91/camier1/home/roctracer/src/core/roctracer.cpp:30:
/g/g91/camier1/home/roctracer/inc/roctracer_kfd.h:30:10: fatal error: inc/kfd_ostream_ops.h: No such file or directory
 #include "inc/kfd_ostream_ops.h"
          ^~~~~~~~~~~~~~~~~~~~~~~
compilation terminated.

I'd like to have the json file as I see it is possible to get one.

Thank you,

Jean-Sylvain

nwolfey21 commented 4 years ago

Ahh yes, your system is missing roctracer. Do you have sudo access on the system? If so, then the easiest method is to sudo apt install roctracer-dev.

nwolfey21 commented 4 years ago

Otherwise, if you don't have sudo access and need to build from source, perhaps try building the rocm-2.10.x branch of roctracer since you're using ROCm 2.10. The master branch has already pulled in many changes for ROCm 3.0. The software stack is moving fast.

camierjs commented 4 years ago

Ok, thank you for your answer. I don't have sudo access, I'll try to rebuild it.

camierjs commented 4 years ago

Hi,

I've rebuild both roc-2.10.x branches of rocprofiler and roctracer, but I'm still hitting the same error:

Tool lib "/g/g91/camier1/usr/local/rocprofiler/roctracer/tool/libtracer_tool.so" failed to load.

ldd looks good, LD_LIBRARY_PATH too.

My command line looks like this: ~/usr/local/rocprofiler/bin/rocprof -i counterfile_HSA_Vega.txt --stats --hip-trace -t tmp -d data ./bp1 -o 2 -l 8 -d hip

Thank you for any suggestion,

Jean-Sylvain

eshcherb commented 4 years ago

Hi Camierjs, could you share what system and which compiler you use?

camierjs commented 4 years ago

I'm on the Corona cluster, with the 2.10 ROCm stack installed on the system. I tried on MI25 and MI60.

skyreflectedinmirrors commented 4 years ago

@camierjs:

1) there was a bug in the library paths of roctracer for RHEL + ROCm 2.10, this can be fixed via setting export LD_LIBRARY_PATH=$(LD_LIBRARY_PATH):/opt/rocm/roctracer/lib in your environment. This may resolve your tool lib error. 2) Regardless, I believe that the hsa-amd-aqlprofile.x86_64 package must be installed via sudo yum install hsa-amd-aqlprofile.x86_64, as this library isn't open-sourced yet. On my system this lives in /opt/rocm/hsa-amd-aqlprofile/lib/libhsa-amd-aqlprofile64.so.1.0.0.

eshcherb commented 4 years ago

ROCr runtime failed to dlopen '/g/g91/camier1/usr/local/rocprofiler/roctracer/tool/libtracer_tool.so' library.

If according to the message above "ldd looks good" all dependencies were resolved then it might be symbols linking problem.

@camierjs: Could you try a simple test, for example from: '/opt/rocm/hip/samples/2_Cookbook/0_MatrixTranspose', and to check using the LD_DEBUG environment variable if some symbols were not found by dynamic linker: $ rocprof --cmd-qts off LD_DEBUG=all ./MatrixTranspose

or just to debug symbols: $ rocprof --cmd-qts off LD_DEBUG=symbols ./MatrixTranspose

A link for LD_DEBUG description: http://www.bnikolic.co.uk/blog/linux-ld-debug.html

camierjs commented 4 years ago
/opt/rocm/hsa-amd-aqlprofile/lib> ls
lrwxrwxrwx 1 root root     28 Mar 25  2019 libhsa-amd-aqlprofile64.so -> libhsa-amd-aqlprofile64.so.1
lrwxrwxrwx 1 root root     32 Mar 25  2019 libhsa-amd-aqlprofile64.so.1 -> libhsa-amd-aqlprofile64.so.1.0.0
-rwxr-xr-x 1 root root 220064 May  6  2018 libhsa-amd-aqlprofile64.so.1.0.0

We do see the error with the rocprof:

    102080: /lib64/libstdc++.so.6: error: version lookup error: version `GLIBCXX_3.4.20' not found (required by /g/g91/camier1/usr/local/rocprofiler/roctracer/tool/libtracer_tool.so) (fatal)
    102080: file=/g/g91/camier1/usr/local/rocprofiler/roctracer/tool/libtracer_tool.so [0];  destroying link map
eshcherb commented 4 years ago

Which OS do you have and which compiler do you use?

camierjs commented 4 years ago

Linux corona141 3.10.0-1062.7.1.1chaos.ch6.x86_64, and the module list is:

Currently Loaded Modules:
  1) texlive/2016   2) StdEnv   3) opt   4) gcc/8.1.0   5) rocm/2.10   6) mvapich2/2.3
eshcherb commented 4 years ago

Could you try to enable devtoolset-7 according to the link below? https://www.softwarecollections.org/en/scls/rhscl/devtoolset-7/

According to ROCm GitHub https://github.com/RadeonOpenCompute/ROCm#supported-operating-systems • CentOS v7.7 (Using devtoolset-7 runtime support) • RHEL v7.7 (Using devtoolset-7 runtime support)

camierjs commented 4 years ago

Thank you, I've forwarded your request to the admins, I'll let you know their answer.

eshcherb commented 4 years ago

@camierjs: Thank you! And I would appreciate if you could send me output from the following commands in your current environment: $ gcc --version $ ldd --version

camierjs commented 4 years ago

$ gcc --version:

Reading specs from /usr/tce/packages/gcc/gcc-8.1.0/lib64/gcc/x86_64-pc-linux-gnu/8.1.0/specs
COLLECT_GCC=gcc
COLLECT_LTO_WRAPPER=/usr/tce/packages/gcc/gcc-8.1.0/libexec/gcc/x86_64-pc-linux-gnu/8.1.0/lto-wrapper
Target: x86_64-pc-linux-gnu
Configured with: /builddir/build/BUILD/gccspack/spack/var/spack/stage/gcc-8.1.0-yf4dn5leietjepntgrnkv4syhgmb2nmm/gcc-8.1.0/configure --prefix=/usr/tce/packages/gcc/gcc-8.1.0 --libdir=/usr/tce/packages/gcc/gcc-8.1.0/lib64 --disable-multilib --enable-languages=c,obj-c++,c++,fortran,objc,go,lto --with-mpfr=/ --with-gmp=/usr --enable-lto --with-quad --with-sysroot=/ --with-stage1-ldflags='-Wl,-rpath,/usr/tce/packages/gcc/gcc-8.1.0/lib -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.1.0/lib64 -Wl,-rpath,/lib -Wl,-rpath,/builddir/build/BUILD/gccspack/spack/opt/spack/linux-rhel7-x86_64/gcc-4.8.5/isl-0.18-a6bgwfhlamdrd6tbb7l6oonhnxruvlfh/lib -Wl,-rpath,/usr/lib -Wl,-rpath,/usr/tce/packages/binutils/binutils-2.30/lib -Wl,-rpath,/lib -Wl,-rpath,/lib64 -Wl,-rpath,/usr/lib64 -Wl,-rpath,/usr/tce/packages/binutils/binutils-2.30/lib64 -Wl,-rpath,/lib64 -static-libstdc++ -static-libgcc' --with-boot-ldflags='-Wl,-rpath,/usr/tce/packages/gcc/gcc-8.1.0/lib -Wl,-rpath,/usr/tce/packages/gcc/gcc-8.1.0/lib64 -Wl,-rpath,/lib -Wl,-rpath,/builddir/build/BUILD/gccspack/spack/opt/spack/linux-rhel7-x86_64/gcc-4.8.5/isl-0.18-a6bgwfhlamdrd6tbb7l6oonhnxruvlfh/lib -Wl,-rpath,/usr/lib -Wl,-rpath,/usr/tce/packages/binutils/binutils-2.30/lib -Wl,-rpath,/lib -Wl,-rpath,/lib64 -Wl,-rpath,/usr/lib64 -Wl,-rpath,/usr/tce/packages/binutils/binutils-2.30/lib64 -Wl,-rpath,/lib64 -static-libstdc++ -static-libgcc' --with-gnu-ld --with-gnu-as --with-ld=/usr/tce/packages/gcc/gcc-8.1.0/bin/ld --with-as=/usr/tce/packages/gcc/gcc-8.1.0/bin/as --with-mpc=/ --with-isl=/builddir/build/BUILD/gccspack/spack/opt/spack/linux-rhel7-x86_64/gcc-4.8.5/isl-0.18-a6bgwfhlamdrd6tbb7l6oonhnxruvlfh
Thread model: posix
gcc version 8.1.0 (GCC) 

$ ldd --version:

ldd (GNU libc) 2.17
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
eshcherb commented 4 years ago

Could you try just to enable 'devtoolset-7', it might be already installed and it seems you don't need privilege access just to enable it: $ scl enable devtoolset-7 bash

camierjs commented 4 years ago

Unable to open /etc/scl/conf/devtoolset-7! There is no devtoolset-7 in the directory.

eshcherb commented 4 years ago

I see, so just please wait for your admins to respond. And thank you very much for trying!

eshcherb commented 4 years ago

You might also can consider to ask admins to install 'roctracer-dev' Linux package.

camierjs commented 4 years ago

Ok, I'll do that: thank you for looking into this!

eshcherb commented 4 years ago

No problem, thank you!

camierjs commented 4 years ago

Hi, I waited that the software stack to be updated now to rocm/3.0. Unfortunately, I'm getting the same issue:

/opt/rocm/profiler/bin/rcprof -A ./MatrixTranspose
Radeon Compute Profiler V5.6.7262 is enabled
Device name Vega 20
aqlprofile API table load failed: HSA_STATUS_ERROR: A generic error has occurred.

I'll try rebuilding it from source as we did with 2.10.

eshcherb commented 4 years ago

Hi, 'roctracer' has to be installed manually. So you can contact your admins to install 'roctracer-dev' Linux package or compile it from GitHub. To compile it you need 'devtoolset-7' on SLES/RHEL platforms. Also need pythonmodules: CppHeaderParser, argparse. To install: sudo pip install CppHeaderParser argparse

It is planned installing 'roctracer-dev' by default for one of future ROCm releases.

camierjs commented 4 years ago

Thank you for all the answers, it's now more on our installation side to get up to date.

crr0004 commented 4 years ago

In case anyone else is bumping against this error aqlprofile API table load failed, installing hsa-amd-aqlprofile package seemed to solve it. This issue pops up when you search for aqlprofile API table load failed. I only realised you can install that package because calling https://github.com/ROCm-Developer-Tools/rocprofiler/blob/207458f251f223803dbbce64821dde15107a1781/test/util/hsa_rsrc_factory.cpp#L131 in debug spits out the library name isn't found.

Mine was failing because ctrl was falling back onto trying to load HSA_EXTENSION_AMD_AQLPROFILE and then failing on that. This is where the HSA_STATUS_ERROR was coming from which is being invoked from https://github.com/ROCm-Developer-Tools/rocprofiler/blob/207458f251f223803dbbce64821dde15107a1781/test/util/hsa_rsrc_factory.cpp#L133