ROCm / rocminfo

ROCm Application for Reporting System Info
Other
32 stars 30 forks source link

Rocminfo Fails #38

Open BemusedCat opened 4 years ago

BemusedCat commented 4 years ago

I tried everything to run rocm-tensorflow but unable to do so . Tried everything but nothing works My rocminfo ROCk module is loaded Unable to open /dev/kfd read-write: Bad address abhigyan is member of render group hsa api call failure at: /src/rocminfo/rocminfo.cc:1142 Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events. `

fxkamd commented 4 years ago

Sounds like an installation or configuration problem. Can you post dmesg output and "ls -l /dev/kfd" for a start?

skeelyamd commented 4 years ago

The problem is permission to access the device driver interface file as indicated here "Unable to open /dev/kfd". The permissions /group membership needed depends on your specific environment. Running 'ls -l /dev/kfd' will show the owning group and it's permissions. Ensure that you are a member of that group.

BemusedCat commented 4 years ago

This is the output of 'ls -l /dev/kfd'. crw-rw-rw- 1 root render 237, 0 Aug 25 00:11 /dev/kfd

When I run tensorflow I get this error. 2020-08-25 02:39:00.789159: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libhip_hcc.so'; dlerror: libhip_hcc.so: cannot open shared object file: No such file or directory 2020-08-25 02:39:00.789209: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Failed precondition: Could not load dynamic library 'libhip_hcc.so'; dlerror: libhip_hcc.so: cannot open shared object file: No such file or directory Aborted (core dumped)

skeelyamd commented 4 years ago

That looks like HIP is not installed. Did you install the complete set of rocm packages? What is in /opt/rocm/lib and /opt/rocm/bin?

BemusedCat commented 4 years ago

Yes I did installed. I even fresh installed ubuntu and again did everything.

this are files in /opt/rocm/lib

cmake libhsakmt.so.1 librocalution.so librocsolver.so oclc_isa_version_1012.amdgcn.bc hc.amdgcn.bc libhsakmt.so.1.0.30700 librocalution.so.0 librocsolver.so.0 oclc_isa_version_700.amdgcn.bc hip.amdgcn.bc libhsa-runtime64.so librocalution.so.0.1.30700 librocsolver.so.0.1.30700 oclc_isa_version_701.amdgcn.bc libamd_comgr.so libhsa-runtime64.so.1 librocblas.so librocsparse.so oclc_isa_version_702.amdgcn.bc libamd_comgr.so.1 libhsa-runtime64.so.1.2.30700 librocblas.so.0 librocsparse.so.0 oclc_isa_version_801.amdgcn.bc libamd_comgr.so.1.6.30700 libmiopengemm.so librocblas.so.0.1.30700 librocsparse.so.0.1.30700 oclc_isa_version_802.amdgcn.bc libamdhip64.so libmiopengemm.so.1 librocfft-device.so libroctracer64.so oclc_isa_version_803.amdgcn.bc libamdhip64.so.3 libmiopengemm.so.1.0.30700 librocfft-device.so.0 libroctracer64.so.1 oclc_isa_version_810.amdgcn.bc libamdhip64.so.3.7.30700 libMIOpen.so librocfft-device.so.0.1.30700 libroctracer64.so.1.0.30700 oclc_isa_version_900.amdgcn.bc libCXLActivityLogger.so libMIOpen.so.1 librocfft.so libroctx64.so oclc_isa_version_902.amdgcn.bc libhipblas.so libMIOpen.so.1.0.30700 librocfft.so.0 libroctx64.so.1 oclc_isa_version_904.amdgcn.bc libhipblas.so.0 libOpenCL.so librocfft.so.0.1.30700 libroctx64.so.1.0.30700 oclc_isa_version_906.amdgcn.bc libhipblas.so.0.1.30700 libOpenCL.so.1 librocm-dbgapi.so ockl.amdgcn.bc oclc_isa_version_908.amdgcn.bc libhiprand.so libOpenCL.so.1.2 librocm-dbgapi.so.0 oclc_correctly_rounded_sqrt_off.amdgcn.bc oclc_unsafe_math_off.amdgcn.bc libhiprand.so.1 library librocm-dbgapi.so.0.30.0 oclc_correctly_rounded_sqrt_on.amdgcn.bc oclc_unsafe_math_on.amdgcn.bc libhiprand.so.1.1.30700 librccl.so librocm_smi64.so oclc_daz_opt_off.amdgcn.bc oclc_wavefrontsize64_off.amdgcn.bc libhipsparse.so librccl.so.1 librocm_smi64.so.2 oclc_daz_opt_on.amdgcn.bc oclc_wavefrontsize64_on.amdgcn.bc libhipsparse.so.0 librccl.so.1.0.30700 librocprofiler64.so oclc_finite_only_off.amdgcn.bc ocml.amdgcn.bc libhipsparse.so.0.1.30700 librocalution_hip.so librocrand.so oclc_finite_only_on.amdgcn.bc opencl.amdgcn.bc libhsa-amd-aqlprofile64.so librocalution_hip.so.0 librocrand.so.1 oclc_isa_version_1010.amdgcn.bc libhsakmt.so librocalution_hip.so.0.1.30700 librocrand.so.1.1.30700 oclc_isa_version_1011.amdgcn.bc

/opt/rocm/bin

ca findcode.sh hipcc_cmake_linker_helper hipconvertinplace.sh hipexamine.sh lpl rocminfo rocprof clang-ocl finduncodep.sh hipconfig hipdemangleatp hipify-cmakefile rocgdb rocm-smi extractkernel hipcc hipconvertinplace-perl.sh hipexamine-perl.sh hipify-perl rocm_agent_enumerator rocm_smi.py

zyzzyxdonta commented 3 years ago

Hello, are there any news on this issue?

I ran into the same problem using the rocm/dev-ubuntu-18.04 docker image inside a CI machine. It took me quite a bit of digging to find this because rocminfo is called by rocm_agent_enumerator which, instead of reporting this error when rocminfo fails, just gets stuck.

I noticed another thing, though I'm not sure where to report this: rocm_agent_enumerator uses lspci which is not installed inside the docker image.

$ ls -l /dev/kfd
crw-rw-rw- 1 root video 240, 0 Oct  8 13:37 /dev/kfd
$ /opt/rocm-3.8.0/bin/rocminfo &
$ ROCINFO_PID="$!"
$ sleep 30
ROCk module is loaded
Unable to open /dev/kfd read-write: Resource temporarily unavailable
Failed to get user name to check for video group membership
hsa api call failure at: /src/rocminfo/rocminfo.cc:1142
Call returned HSA_STATUS_ERROR_OUT_OF_RESOURCES: The runtime failed to allocate the necessary resources. This error may also occur when the core runtime library needs to spawn threads or create internal OS-specific events.
$ ps -eF --forest
UID          PID    PPID  C    SZ   RSS PSR STIME TTY          TIME CMD
root           1       0  0  4635  3276  14 13:37 ?        00:00:00 /bin/bash
root           7       1  0  4660  2560  29 13:37 ?        00:00:00 /bin/bash
root        1137       7  0     0     0  27 13:38 ?        00:00:00  \_ [rocminfo]
root        1139    1137  0     0     0  24 13:38 ?        00:00:00  |   \_ [sh] <defunct>
root        1142       7  0  8602  3000   7 13:39 ?        00:00:00  \_ ps -eF --forest
ppanchad-amd commented 3 weeks ago

@BemusedCat @zyzzyxdonta Apologies for the lack of response. Can you please test with the latest ROCm 6.2? If issue is resolved, please close the ticket. Thanks!