ROCm / ROCR-Runtime

ROCm Platform Runtime: ROCr a HPC market enhanced HSA based runtime
https://rocm.docs.amd.com/projects/ROCR-Runtime/en/latest/
Other
216 stars 105 forks source link

`sample/vector_copy` fails: 'Create the program failed.' #21

Closed patricklauer closed 6 years ago

patricklauer commented 7 years ago

Trying to make HSA/ROC work on an A10-7700K. Building and installing ROCK, ROCT works. With a stock 4.10 kernel initializing hsa runtime fails. Using the patched ROCK kernel things fail a bit later:

 # ./vector_copy
Initializing the hsa runtime succeeded.
Checking finalizer 1.0 extension support succeeded.
Generating function table for finalizer succeeded.
Getting a gpu agent succeeded.
Querying the agent name succeeded.
The agent name is gfx700.
Querying the agent maximum queue size succeeded.
The maximum queue size is 131072.
Creating the queue succeeded.
"Obtaining machine model" succeeded.
"Getting agent profile" succeeded.
Create the program failed.

strace says:

write(1, "Creating the queue succeeded.\n", 30Creating the queue succeeded.
) = 30
write(1, "\"Obtaining machine model\" succee"..., 37"Obtaining machine model" succeeded.
) = 37
write(1, "\"Getting agent profile\" succeede"..., 35"Getting agent profile" succeeded.
) = 35
open("vector_copy_full.brig", O_RDONLY) = 5
fstat(5, {st_mode=S_IFREG|0644, st_size=3456, ...}) = 0
fstat(5, {st_mode=S_IFREG|0644, st_size=3456, ...}) = 0
lseek(5, 0, SEEK_SET)                   = 0
read(5, "HSA BRIG\1\0\0\0\0\0\0\0\200\r\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 3456) = 3456
lseek(5, 3456, SEEK_SET)                = 3456
close(5)                                = 0
write(1, "Create the program failed.\n", 27Create the program failed.
) = 27
exit_group(1)                           = ?
+++ exited with 1 +++
patricklauer commented 7 years ago

Tried with both amdgpu and radeon as kernel drivers, both fail in the same way. Both ROCR-Runtime and ROCT-Thunk-Interface are version 1.4

gstoner commented 7 years ago

Kavari is not officially supported by ROCm Platform, ROCm primary focus is on Server Based Computing, but we recumbent AMD Ryzen CPU, Haswell or newer Intel Core I3,I5 and I7, XeonE3 and Intel Xeon E5 CPU’s. We recommend our GFX8 CPU Fiji and Polaris based.

Note Kaveri was only used by AMD HSA development team as a development vehicle to get the part of the base stack up prior to HSA 1.0 enabled devices were made available. Kaveri has number of architecual limitation. One big one is how how the GPU and CPU are interconnected if you try to use Coherent interconnect.

On Linux Kernel 4.9 support , ROCm we is just currently moving to Linux Kernel 4.9 to be supported, it should be part of the next release.

Please do not Mix the old Radeon Driver and ROCm driver they are not compatible, we need the new base linux stack from the AMDGPU driver.

Thanks

On Feb 26, 2017, at 10:04 AM, patricklauer notifications@github.com<mailto:notifications@github.com> wrote:

Tried with both amdgpu and radeon as kernel drivers, both fail in the same way. Both ROCR-Runtime and ROCT-Thunk-Interface are version 1.4

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHubhttps://github.com/RadeonOpenCompute/ROCR-Runtime/issues/21#issuecomment-282566041, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AD8DucLab7OGVpYnjc07WtCKUq_3Xsruks5rgaKngaJpZM4MMa1h.

rwvo commented 7 years ago

I'm running into the same issue on an Intel Core i7-6700K / Radeon R9 Nano (Fiji), with Ubuntu 16.04. I got a working ROCm stack using the AMD ROCm apt repositories, but want to build from source.

Any suggestions?

Further info:

#512 uname -a
Linux nano 4.9.0-kfd+ #1 SMP Tue Jun 20 10:33:36 CDT 2017 x86_64 x86_64 x86_64 GNU/Linux
#513 lsmod | grep amd
amdkfd                225280  1
amd_iommu_v2           20480  1 amdkfd
amdgpu               2437120  48
i2c_algo_bit           16384  1 amdgpu
ttm                   102400  1 amdgpu
drm_kms_helper        155648  1 amdgpu
drm                   360448  6 amdgpu,ttm,drm_kms_helper
rwvo commented 7 years ago

Found the issue: there's a call to core::ExtensionEntryPoints::LoadFinalizer with argument library_name is libhsa-ext-finalize64.so.1. There's no such library on my machine. On my other system (ROCm installed from AMD *.deb repositories), the lib belongs to hsa-ext-rocr-dev. What is the corresponding source package? Apparently, I would have to build/install that prior to running the sample vector_copy.

rwvo commented 7 years ago

Turns out the finalizer is a closed source component. Installing it with "sudo apt install hsa-ext-rocr-dev" made vector_copy succeed.

insujang commented 7 years ago

I had the same issue with @rwvo and solved it with the same solution. Thank you for your analysis!

Additional comment: we need to add a ROCm apt repository to install hsa-ext-rocr-dev. It can be done by following this instruction.

PhilipDeegan commented 6 years ago

@gstoner

is there a plan to release the source for libhsa-ext-finalize64.so ?

ref: https://github.com/RadeonOpenCompute/ROCR-Runtime/issues/33

gstoner commented 6 years ago

@Dekken libhsa-ext-finalize64.so. we replace this compiler with new native opensource LLVM compiler. this was proprietary compiler that we could not release as source