Closed 0x0f0f0f closed 3 years ago
And on my system I get:
julia> ROCArray(rand(4,4)) * ROCArray(rand(4,4))
4×4 ROCMatrix{Float64}:
8.0e-323 8.0e-323 8.0e-323 8.0e-323
8.0e-323 8.0e-323 8.0e-323 8.0e-323
8.0e-323 8.0e-323 8.0e-323 8.0e-323
8.0e-323 8.0e-323 8.0e-323 8.0e-323
This seems similar to https://github.com/JuliaGPU/AMDGPU.jl/issues/92 It seems it also depends on the version of rocm that is installed. On the newest ones, I am getting these memory access faults. On older versions of rocm (e.g. 3.5) I simply get wrong answers.
@0x0f0f0f, @jpsamaroo, could you let me know which version of rocm you are using when performing tests on the RX 500 series? I am seeing conflicting suggestions on the tensorflow and rocm support forums and I am uncertain what is"best practices". I would like to attempt to debug this more in-depth, but I feel like I should be careful which rocm I use for this debugging.
Also, has this ever worked on an RX 500 card? I am a bit out of the loop and do not have a good idea whether this is a bug that makes it impossible to use the library or if this is just affecting an old GPU that was never really supported.
I would guess that it's a bug in AMDGPU, not in ROCm. I ran CI on an RX 480 very recently, which is essentially just a lower-clocked RX 500. I doubt RX 400/500 support will disappear entirely from ROCm for another few years.
It seems to be working now
Seems to be working on my hardware as well (on the current master)
Well that's confusingly convenient :smile: I'm going to keep this open because right now we're not ensuring correct ordering between raw kernel and HIP-derived launches (because raw kernels use queues, but HIP-derived libraries use their own "streams"). So it's likely that results will be unreliable until that's fixed.
Running on GLIBC void linux. AMD RX570 8gb sapphire. Using my fork for using Yggdrasil HSA artifacts: https://github.com/0x0f0f0f/AMDGPU.jl/tree/artifacts