Open vchuravy opened 8 years ago
And the only function that really calls clGetCommandQueueInfo
https://github.com/clMathLibraries/clBLAS/blob/8b5f7a0e6f800b9597319d71f70fbf67e410b004/src/library/blas/generic/solution_seq_make.c#L374 and https://github.com/clMathLibraries/clBLAS/blob/8b5f7a0e6f800b9597319d71f70fbf67e410b004/src/library/blas/generic/solution_seq_make.c#L503-L504
in https://github.com/clMathLibraries/clBLAS/blob/8b5f7a0e6f800b9597319d71f70fbf67e410b004/src/library/blas/generic/common.c#L311 and https://github.com/clMathLibraries/clBLAS/blob/8b5f7a0e6f800b9597319d71f70fbf67e410b004/src/library/blas/generic/common.c#L296
so my idea is that it probably is not related to alignment, but something funky going on with regards to queues (or that because of misaligment, we are overriding queue information).
@dfdx When building with debug information I got it to happen with Float64
and since we also get an invalid value error it seems to have more to do with the interaction between OpenCL.jl and CLBLAS runtime. Do you know if there is any asynchronous operations happening?
Currently all high-level functions in CLBLAS add task to a queue and return corresponding event. However, in tests we immediately call cl.wait()
, so there's always only one task in the queue. jl.
BTW and coming a little bit back, I think sizeof()
allows to "guess" alignment pretty accurately :
type T1 x::Int8 end; sizeof(T1) # => 1
type T2 x::Int8; y::Int8 end; sizeof(T2) # => 2
type T3 x::Int8; y::Int16 end; sizeof(T3) # => 4
So any type larger than 8 bytes is aligned to 16 bytes. Complex64
has size 8 bytes, i.e. it's maximally packed. I don't really know what aligned(8)
means for struct { float; float}
, but I'm pretty much sure it has the same 8 bytes, which confirms your analysis.
@dfdx Here is a travis build that build clBLAS from scratch with debug information enabled and it clearly tells us where the seqfault comes from https://travis-ci.org/JuliaGPU/CLBLAS.jl/jobs/103249068#L1672
Thats on the branch https://github.com/JuliaGPU/CLBLAS.jl/tree/vc/complex32
Maybe we are encountering something similar to https://github.com/clMathLibraries/clBLAS/issues/187
@dfdx I won't have time this week to look into this, feel free to continue the bug hunt or ping me in a week or so.
Got it. I think I'll have some spare time later this week to debug it.
For convenience of debugging, here's a short test calling clBLAS
directly without all CLBLAS.jl's wrappers:
import OpenCL: CLArray, CL_float
const cl = OpenCL
import CLBLAS: CL_float2
const libCLBLAS = "libclBLAS"
dev, ctx, q = cl.create_compute_context()
CLBLAS.setup()
num_queues = cl.CL_uint(1)
queues = Ptr{Void}[q.id]
num_events = cl.cl_uint(0)
events = Ptr{Void}[]
ret_event = Array(cl.CL_event,1)
N = 10
DX = Complex64(2.0)
X = cl.ones(Complex64, q, 10)
err = ccall((:clblasCscal, libCLBLAS), cl.CL_int,
(Csize_t, CL_float, Ptr{Void}, Csize_t, Cint,
cl.CL_uint, Ptr{Ptr{Void}}, cl.CL_uint, Ptr{Ptr{Void}},
Ptr{Ptr{Void}}),
N, DX, pointer(X), Csize_t(0), Cint(1),
cl.CL_uint(1), pointer(queues), cl.CL_uint(0),
pointer(events), pointer(ret_event))
if err != cl.CL_SUCCESS
throw(cl.CLError(err))
end
Which gives CL_INVALID_EVENT_WAIT_LIST
(code=-57), while same thing for Complex128
and clblasZscal
produces no error.
If the event list is empty you need to pass in C_NULL and not a pointer to an empty list.
--edit:
Never mind: You are hitting https://github.com/clMathLibraries/clBLAS/blob/master/src/library/blas/xaxpy.c#L91-L94 while numEventsInWaitList
should be equal to zero but it isn't...
so running the script under lldb
lldb julia -- clblas.jl
(lldb) breakpoint set -n clblasCscal
(lldb) r
Process 10392 launched: '/usr/bin/julia' (x86_64)
1 location added to breakpoint 1
error: libclBLAS.so 0x00059975: DW_TAG_member 's' refers to type 0x0005999a which extends beyond the bounds of 0x00059904
Process 10392 stopped
* thread #1: tid = 10392, 0x00007ffdc891cf31 libclBLAS.so`::clblasCscal(N=10, alpha=cl_float2 at 0x00007fffffffc510, X=0x0000000001ac3530, offx=0, incx=1, numCommandQueues=1, commandQueues=0x00007ffdf4920d10, numEventsInWaitList=4103052928, eventWaitList=0x00007ffdf49213f0, events=0x00007ffd40000000) + 52 at xscal.cc:156, name = 'julia', stop reason = breakpoint 1.1
frame #0: 0x00007ffdc891cf31 libclBLAS.so`::clblasCscal(N=10, alpha=cl_float2 at 0x00007fffffffc510, X=0x0000000001ac3530, offx=0, incx=1, numCommandQueues=1, commandQueues=0x00007ffdf4920d10, numEventsInWaitList=4103052928, eventWaitList=0x00007ffdf49213f0, events=0x00007ffd40000000) + 52 at xscal.cc:156
153 const cl_event *eventWaitList,
154 cl_event *events)
155 {
-> 156 CHECK_QUEUES(numCommandQueues, commandQueues);
157 CHECK_EVENTS(numEventsInWaitList, eventWaitList);
158 CHECK_VECTOR_X(TYPE_COMPLEX_FLOAT, N, X, offx, incx);
159
(lldb) po numCommandQueues
1
(lldb) po numEventsInWaitList
4103052928
First thing I noticed is that you have an error in your ccall, Second argument should be cl_float2
and not cl_float
.
err = ccall((:clblasCscal, libCLBLAS), cl.CL_int,
(Csize_t, CL_float2, Ptr{Void}, Csize_t, Cint,
cl.CL_uint, Ptr{Ptr{Void}}, cl.CL_uint, Ptr{Ptr{Void}},
Ptr{Ptr{Void}}),
N, DX, pointer(X), Csize_t(0), Cint(1),
cl.CL_uint(1), pointer(queues), cl.CL_uint(0),
pointer(events), pointer(ret_event))
But even after that:
(lldb) p alpha
(cl_float2) $2 = {
s = {}
= (x = 0, y = 0)
= (s0 = 0, s1 = 0)
= (lo = 0, hi = 0)
v2 = (0, 0)
}
(lldb) po numEventsInWaitList
4103052928
I'm not really familiar with lldb
(or gdb
), do I understand it correctly that just after ccalling clblasCscal
value of input parameter numEventsInWaitList
is equal 4103052928?
Yes and alpha e.g. DX is not correctly passed through. So I am back thinking about misalignment or misrepresentation of cl_float2
On Fri, 22 Jan 2016, 06:47 Andrei Zhabinski notifications@github.com wrote:
I'm not really familiar with lldb (or gdb), do I understand it correctly that just after ccalling clblasCscal value of input parameter numEventsInWaitList is equal 4103052928?
— Reply to this email directly or view it on GitHub https://github.com/JuliaGPU/CLBLAS.jl/issues/23#issuecomment-173719472.
I just created a small c-example to see if I could figure out the correct way of working with Complex64 and cl_float2 and if I compile the library with gcc and the test program with clang I get the same problem on the C level.
Right now the only way of solving this for users is to recommend to them to compile clBLAS with clang/clang++
Sounds unpleasant, but reasonable. I'll add corresponding note to the README.
@dfdx Maybe just maybe we could use https://strpackjl.readthedocs.org/en/latest/
StrPack
seems to be quite outdated (lots of deprecation warnings), but worth trying anyway. I will check in the next couple of days, thanks for the tip.
You need to use the current master. Then there should be no warnings.
On Mon, 28 Mar 2016, 20:15 Andrei Zhabinski, notifications@github.com wrote:
StrPack seems to be quite outdated (lots of deprecation warnings), but worth trying anyway. I will check in the next couple of days, thanks for the tip.
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/JuliaGPU/CLBLAS.jl/issues/23#issuecomment-202343863
Seems like data packed using StrPack
doesn't play well with OpenCL.Buffer
:
import OpenCL
import CLBLAS: CLArray
import StrPack: @struct, pack
@struct type MyCL_double2
f1::Float64
f2::Float64
end
hX = [MyCL_double2(rand(Float64), rand(Float64)) for i=1:10]
packedX = [pack(x).data for x in hX]
dev, ctx, q = OpenCL.create_compute_context()
CLBLAS.setup()
X = CLArray(q, packedX)
gives:
ERROR: type does not have a canonical binary representation
in Buffer at /home/dfdx/.julia/v0.4/OpenCL/src/buffer.jl:137
in Buffer at /home/dfdx/.julia/v0.4/OpenCL/src/buffer.jl:86
in Buffer at /home/dfdx/.julia/v0.4/OpenCL/src/buffer.jl:52
Obviously, packedX
has type Array{Array{UInt8,1},1}
and OpenCL doesn't know how to handle such a type, but I'm not sure we can fix it on our side.
What happens if you do
@struct immutable MyCL_double2
...
end
Actually, it's not about mutability - packed data is Array{Attay{UInt8}}
anyway, and this is what OpenCL doesn't know how to handle.
However, I found out that using immutable
(with or without @struct
annotation) fixes some errors, e.g. following code works fine (note that I switched back to float2
instead of double2
):
import OpenCL
const cl = OpenCL
import CLBLAS: CLArray
const libCLBLAS = "libclBLAS"
immutable MyCL_float2 # or @struct immutable MyCL_float2
f1::Float32
f2::Float32
end
hX = [MyCL_float2(rand(Float64), rand(Float64)) for i=1:10]
dev, ctx, q = OpenCL.create_compute_context()
CLBLAS.setup()
X = CLArray(q, hX)
# sample call
num_queues = cl.CL_uint(1)
queues = Ptr{Void}[q.id]
num_events = cl.cl_uint(0)
events = Ptr{Void}[]
ret_event = Array(cl.CL_event,1)
N = 10
DX = MyCL_float2(2.0, 0.0)
err = ccall((:clblasCscal, libCLBLAS), cl.CL_int,
(Csize_t, MyCL_float2, Ptr{Void}, Csize_t, Cint,
cl.CL_uint, Ptr{Ptr{Void}}, cl.CL_uint, Ptr{Ptr{Void}},
Ptr{Ptr{Void}}),
N, DX, pointer(X), Csize_t(0), Cint(1),
cl.CL_uint(1), pointer(queues), cl.CL_uint(0),
C_NULL, pointer(ret_event))
If we change immutable
to type
, the code throws ReadOnlyMemoryError()
, which is quite intuitive.
But what is not intuitive is that changing definition to:
typealias MyCL_float2 Complex64
leads to a segmentation fault:
signal (11): Segmentation fault
unknown function (ip: 0x7efd3caeab24)
unknown function (ip: 0x7efd3cae34bd)
unknown function (ip: 0x7efd3cae388c)
unknown function (ip: 0x7efd3810fa51)
_ZN25clblasCscalFunctorGeneric7executeERN18clblasXscalFunctorI9cl_float2S1_E4ArgsE at /usr/lib/x86_64-linux-gnu/libclBLAS.so (unknown line)
clblasCscal at /usr/lib/x86_64-linux-gnu/libclBLAS.so (unknown line)
anonymous at no file:0
unknown function (ip: 0x7eff597289d3)
unknown function (ip: 0x7eff597295ec)
jl_load at /opt/julia/bin/../lib/julia/libjulia.so (unknown line)
include at ./boot.jl:261
jl_apply_generic at /opt/julia/bin/../lib/julia/libjulia.so (unknown line)
include_from_node1 at ./loading.jl:304
jl_apply_generic at /opt/julia/bin/../lib/julia/libjulia.so (unknown line)
process_options at ./client.jl:284
_start at ./client.jl:378
unknown function (ip: 0x7eff5618d8f9)
jl_apply_generic at /opt/julia/bin/../lib/julia/libjulia.so (unknown line)
unknown function (ip: 0x401c47)
unknown function (ip: 0x40182f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401875)
Segmentation fault (core dumped)
and the only difference between our MyCL_float2
and Complex64
is that latter gets its base type (Float32
) through type parameters:
immutable Complex{T<:Real} <: Number
re::T
im::T
end
Ok, it's not even about type parameters, but about inheritance. This works fine:
immutable MyCL_float2{T}
f1::T
f2::T
end
but the following modification makes it break with the same segfault again:
immutable MyCL_float2{T} <: Number
f1::T
f2::T
end
And, by the way, none of these options work with gemm
function.
Hey, I think I have found a solution to this problem for Linux. I have developed a solution that works on Windows 7 64 bit:
Initially, the zGEMM function on Windows 7 x64 was throwing segmentation faults. However, I found that the ccall function to clblasZgemm() would stop throwing segmentation faults if I changed the argument type of variables alpha and beta to Ref{cl_double2} or Ptr{cl_double2}. This is reflected in changing variables alpha and beta to 1-element cl_double2 arrays. Here is the function I used in my code while working on my project:
function clblasZgemm(o,tA,tB,M,N,K,alpha,A,offA,lda,B,offB,ldb,beta,C,offC,ldc,ncq,cq,ne,wle,e)
return ccall((:clblasZgemm, libclblas), cl_int, (clblasOrder,
clblasTranspose,
clblasTranspose,
Csize_t,
Csize_t,
Csize_t,
Ref{clblasDoubleComplex},#treating this as a pointer fixed a segmentation fault
cl_mem,
Csize_t,
Csize_t,
cl_mem,
Csize_t,
Csize_t,
Ref{clblasDoubleComplex},#treating this as a pointer fixed a segmentation fault
cl_mem,
Csize_t,
Csize_t,
cl_uint,
Ref{cl_command_queue},
cl_uint,
Ref{cl_event},
Ptr{cl_event}),
o,tA,tB,M,N,K,alpha,A,offA,lda,B,offB,ldb,beta,C,offC,ldc,ncq,cq,ne,wle,e)
end
My suspicion is that the libclBLAS library is treating the alpha and beta arguments as pointers like the event variable.
Can you guys check if this fixes the segmentation faults on Linux or OSX?
I won't have access to GPU-enabled laptop till the end of this week, but I think you can test your code on Travis. The easiest way to go should be to:
CLBLAS.jl
.clblasZgemm
there. Note, that you may need to setup your own Travis account and add Mac OS X to .travis.yml
, so if it's too much trouble for you, just leave it till the end of the week.
I've just checked it on Complex64
. Call doesn't fail, but the result is incorrect.
@mikhail-j: could you please provide full code you used for testing? Just to be on the same page.
The following code I used was copied from test_zgemm.jl. I have aliased clblasDoubleComplex to Complex{Float64} in clblas_typedef.jl as CLBLAS.jl does too.
const libclblas = Libdl.find_library(["clBLAS","libclBLAS"],["C:\\AMD\\clBLA-2.10.0\\bin","C:\\AMD\\acml6.1.0.33\\ifort64\\lib\\"])
#const libopencl = Libdl.find_library(["libOpenCL","OpenCL"],["."])
const libopencl = Libdl.find_library(["OpenCL64","OpenCL"],["C:\\Program Files\\NVIDIA Corporation\\OpenCL\\","C:\\Program Files (x86)\\AMD APP SDK\\2.9-1\\bin\\x86_64"])
if (isempty(libclblas))
print("clBLAS can't be found!")
end
include("cl_typedef.jl")
include("clblas_typedef.jl")
include("cl_functions.jl")
include("clblas_functions.jl")
#ccall((:function, “library”), return_type, (argtype,),arg)
function clblasZgemm(o,tA,tB,M,N,K,alpha,A,offA,lda,B,offB,ldb,beta,C,offC,ldc,ncq,cq,ne,wle,e)
return ccall((:clblasZgemm, libclblas), cl_int, (clblasOrder,
clblasTranspose,
clblasTranspose,
Csize_t,
Csize_t,
Csize_t,
Ref{clblasDoubleComplex},#treating this as a pointer fixed a segmentation fault
#clblasDoubleComplex,
cl_mem,
Csize_t,
Csize_t,
cl_mem,
Csize_t,
Csize_t,
Ref{clblasDoubleComplex},#treating this as a pointer fixed a segmentation fault
#clblasDoubleComplex,
#Base.cconvert(Ptr{Void}, Ref{cl_mem}),
#Ref{cl_mem},
cl_mem,
Csize_t,
Csize_t,
cl_uint,
Ref{cl_command_queue},
cl_uint,
#Ref{cl_event},
#AMD's OpenCL driver (Windows 7 x64) throws invalid event if argument type is Ref{cl_event}
Ptr{cl_event},
Ptr{cl_event}),
#Ptr{cl_event_info},
#Ptr{cl_event_info}),
o,tA,tB,M,N,K,alpha,A,offA,lda,B,offB,ldb,beta,C,offC,ldc,ncq,cq,ne,wle,e)
end
function main()
local props = vec(convert(Array{cl_context_properties, 2}, [CL_CONTEXT_PLATFORM 0 0]))
devs = Array(cl_device_id, 1)
devs[1] = clGetFirstGPU()
local platform = clGetGPUPlatform(devs[1])
println(string("Selected GPU: ",clGetDeviceVendor(devs[1])), " ", clGetDeviceName(devs[1]))
props[2] = Base.cconvert(cl_context_properties,platform)
err = Array(cl_int, 1)
local ctx = clCreateContext(props,1,devs[1],C_NULL,C_NULL,err)
statusCheck(err[1])
err = Array(cl_int, 1)
local queue = Array(cl_command_queue, 1)
queue[1] = clCreateCommandQueue(ctx, devs[1], cl_command_queue_properties(0), err)
statusCheck(err[1])
################################ create arrays
A = convert(Array{clblasDoubleComplex,2}, [[11, 12, 13, 14, 15]';[21, 22, 23, 24, 25]';[31, 32, 33, 34, 35]';[41, 42, 43, 44, 45]'])
B = convert(Array{clblasDoubleComplex,2}, [[11, 12, 13]';[21, 22, 23]';[31, 32, 33]';[41, 42, 43]';[51, 52, 53]'])
C = convert(Array{clblasDoubleComplex,2}, [[11, 12, 13]';[21, 22, 23]';[31, 32, 33]';[41, 42, 43]'])
##A = convert(Array{cl_double2,2}, convert(Array{clblasDoubleComplex,2}, [[11, 12, 13, 14, 15]';[21, 22, 23, 24, 25]';[31, 32, 33, 34, 35]';[41, 42, 43, 44, 45]']))
##B = convert(Array{cl_double2,2}, convert(Array{clblasDoubleComplex,2}, [[11, 12, 13]';[21, 22, 23]';[31, 32, 33]';[41, 42, 43]';[51, 52, 53]']))
##C = convert(Array{cl_double2,2}, convert(Array{clblasDoubleComplex,2}, [[11, 12, 13]';[21, 22, 23]';[31, 32, 33]';[41, 42, 43]']))
A1 = vec(A)
B1 = vec(B)
C1 = vec(C)
M = Csize_t(length(A[:,1]))
K = Csize_t(length(B[:,1]))
N = Csize_t(length(B[1,:]))
order = clblasColumnMajor ##julia uses column major
alpha = Array(clblasDoubleComplex, 1)
alpha[1] = convert(clblasDoubleComplex, 10)
#println(string("alpha: ",alpha))
beta = Array(clblasDoubleComplex, 1)
beta[1] = convert(clblasDoubleComplex, 20)
#println(string("beta: ",beta))
transA = clblasNoTrans;
transB = clblasNoTrans;
off = convert(Csize_t, 0)
offA = convert(Csize_t, 0)
offB = convert(Csize_t, 0)
offC = convert(Csize_t, 0)
#Now initialize OpenCLBLAS and buffers
statusCheck(clblasSetup())
statusCheck(clFlush(queue[1]))
err = Array(cl_int, 1)
bufA = clCreateBuffer(ctx, CL_MEM_READ_ONLY, M * K * sizeof(clblasDoubleComplex), C_NULL, err)
statusCheck(err[1])
err = Array(cl_int, 1)
bufB = clCreateBuffer(ctx, CL_MEM_READ_ONLY, K * N * sizeof(clblasDoubleComplex), C_NULL, err)
statusCheck(err[1])
err = Array(cl_int, 1)
bufC = clCreateBuffer(ctx, CL_MEM_READ_WRITE, M * N * sizeof(clblasDoubleComplex), C_NULL, err)
statusCheck(err[1])
statusCheck(clFlush(queue[1]))
event = Array(cl_event, 1)
event[1] = C_NULL
statusCheck(clEnqueueWriteBuffer(queue[1], bufA, CL_TRUE, Csize_t(0), M * K * sizeof(clblasDoubleComplex), A1, cl_uint(0), C_NULL, event))
statusCheck(clWaitForEvents(1,event))
statusCheck(clReleaseEvent(event[1])) #free the memory
event[1] = C_NULL
statusCheck(clEnqueueWriteBuffer(queue[1], bufB, CL_TRUE, Csize_t(0), K * N * sizeof(clblasDoubleComplex), B1, cl_uint(0), C_NULL, event))
statusCheck(clWaitForEvents(1,event))
statusCheck(clReleaseEvent(event[1])) #free the memory
event[1] = C_NULL
statusCheck(clEnqueueWriteBuffer(queue[1], bufC, CL_TRUE, Csize_t(0), M * N * sizeof(clblasDoubleComplex), C1, cl_uint(0), C_NULL, event))
statusCheck(clWaitForEvents(1,event))
statusCheck(clReleaseEvent(event[1])) #free the memory
#=================Check respective buffer sizes in GPU
ref_count = Array(Csize_t, 1)
statusCheck(clGetMemObjectInfo(bufA, CL_MEM_SIZE, Csize_t(sizeof(ref_count)), ref_count, C_NULL))
println(string("bufA memory object size: ", Int32(ref_count[1])))
ref_count = 0
ref_count = Array(Csize_t, 1)
statusCheck(clGetMemObjectInfo(bufB, CL_MEM_SIZE, Csize_t(sizeof(ref_count)), ref_count, C_NULL))
println(string("bufB memory object size: ", Int32(ref_count[1])))
ref_count = 0
ref_count = Array(Csize_t, 1)
statusCheck(clGetMemObjectInfo(bufC, CL_MEM_SIZE, Csize_t(sizeof(ref_count)), ref_count, C_NULL))
println(string("bufC memory object size: ", Int32(ref_count[1])))
ref_count = 0
=====#
event[1] = C_NULL
#=
statusCheck(clblasSgemm(clblasRowMajor, clblasNoTrans, clblasNoTrans, M, N, K,
alpha, bufA, 0, K,
bufB, 0, N, beta,
bufC, 0, N,
1, queue, 0, C_NULL, event))
=#
statusCheck(clblasZgemm(clblasColumnMajor, clblasNoTrans, clblasNoTrans, M, N, K,
alpha, bufA, 0, M,
bufB, 0, K, beta,
bufC, 0, M,
1, queue, 0, C_NULL, event))
statusCheck(clFlush(queue[1]))
statusCheck(clWaitForEvents(1,event))
statusCheck(clReleaseEvent(event[1])) #free the memory
C2=Array(clblasDoubleComplex,length(C1))
event[1] = C_NULL
statusCheck(clEnqueueReadBuffer(queue[1], bufC, CL_TRUE, Csize_t(0), length(C1)*sizeof(clblasDoubleComplex), C2, cl_uint(0), C_NULL, event))
statusCheck(clWaitForEvents(1,event))
statusCheck(clReleaseEvent(event[1])) #free the memory
statusCheck(clFlush(queue[1]))
statusCheck(clReleaseMemObject(bufC))
statusCheck(clFlush(queue[1]))
statusCheck(clReleaseMemObject(bufB))
statusCheck(clFlush(queue[1]))
statusCheck(clReleaseMemObject(bufA))
statusCheck(clFlush(queue[1]))
#statusCheck(clGetMemObjectInfo(bufA, CL_MEM_REFERENCE_COUNT, Csize_t(sizeof(ref_count)), ref_count, C_NULL))
#bufA = C_NULL
#bufB = C_NULL
#bufC = C_NULL
clblasTeardown()
statusCheck(clFlush(queue[1]))
statusCheck(clReleaseCommandQueue(queue[1]))
statusCheck(clReleaseContext(ctx))
bufC = C_NULL
bufB = C_NULL
bufA = C_NULL
queue[1] = C_NULL
event[1] = C_NULL
ctx = C_NULL
devs[1] = C_NULL
Base.gc() ##not sure if julia has been garbage collecting, now is a good time though
return reshape(C2, Int(M), Int(N))
end
if (!isempty(libclblas) && !isempty(libopencl))
main()
end
I'm afraid this doesn't fix the error for me (Ubuntu 15.10, NVidia GForce GT 630M):
$ julia test_zgemm.jl
Selected GPU: NVIDIA Corporation GeForce GT 630M
WARNING: OpenCL Error:
in statusCheck at /home/dfdx/work/playground/OpenCLBLAS.jl/src/cl_functions.jl:96
in main at /home/dfdx/work/playground/OpenCLBLAS.jl/src/test_zgemm.jl:166
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:280
in _start at ./client.jl:378
ERROR: LoadError: "CL_INVALID_COMMAND_QUEUE"
in statusCheck at /home/dfdx/work/playground/OpenCLBLAS.jl/src/cl_functions.jl:98
in main at /home/dfdx/work/playground/OpenCLBLAS.jl/src/test_zgemm.jl:166
in include at ./boot.jl:261
in include_from_node1 at ./loading.jl:304
in process_options at ./client.jl:280
in _start at ./client.jl:378
while loading /home/dfdx/work/playground/OpenCLBLAS.jl/src/test_zgemm.jl, in expression starting on line 209
Yet I'm curious what was your idea when you tried to pass pointer to Complex{Float64}
instead of the number itself?
I came across the possible solution when I started writing these wrapper ccall functions myself. I found that some functions threw a segmentation fault if I passed a normal variable rather than a pointer.
So, I tweaked my clbla<type>
gemm functions to accept pointers and now function works without segmentation faults.
@dfdx, I noticed that you had changed the line numbers in the code when the error occurs on line 166.
If the message is CL_INVALID_COMMAND_QUEUE, could you change the Ref{cl_command_queue} in the wrapper to Ptr{cl_command_queue}?
or
Do a git pull for the revised version (and then add your path to the libraries)?
@mikhail-j: I only changed code for finding libraries, the rest of the code is the same.
I'm using another laptop right now, so will check your suggestion in the evening (~10 hours from now).
@mikhail-j: nope, changing Ref{cl_command_queue}
to Ptr{cl_command_queue}
didn't help either.
Just for reference, on what CPU/GPU do you test it?
I've tested my code on Windows 7 x64 with a NVIDIA GTX 780 Ti GPU (CUDA 7.5) and AMD R9 390 GPU (Crimson 14.2 hotfix).
In regards to the CPU, I used a Intel Core i7-3930K.
@mikhail-j May I ask which compiler you are using for CLBLAS? I found that different compilers have different alignments and as such influence which call works and which doesn't.
@vchuravy I used MinGW-w64 on Windows 7 x64.
However, I recently tested the cGEMM and zGEMM functions on SUSE SLES 11 SP3 Linux (customized kernel version 3.18.36). At first, libclBLAS.so refused to load because my glibc version was too old for its liking (I had 2.11.3). After updating my glibc version to 2.23, libclBLAS.so finally loaded into julia (I compiled julia v0.4.6 with _gcc 4.8.5 x8664).
I found that Complex{Float64} functioned properly without Ptr{T}/Ref{T}.
When I tested the Complex{Float32} function, it threw a segmentation fault as you noted earlier.
This was tested on a NVIDIA GTX 780 Ti GPU:
julia> include("test_cgemm.jl")
Selected GPU: NVIDIA Corporation GeForce GTX 780 Ti
signal (11): Segmentation fault
_Z10clblasGemmI9cl_float2E13clblasStatus_12clblasOrder_16clblasTranspose_S3_mmmT_P7_cl_memmmS6_mmS4_S6_mmjPP17_cl_command_queuejPKP9_cl_eventPSB_ at ../clBLAS-2.10.0-Hawaii-Linux-x64-CL2.0/lib64/libclBLAS.so (unknown line)
clblasCgemm at ~/OpenCLBLAS.jl/src/test_cgemm.jl:38
main at ~/OpenCLBLAS.jl/src/test_cgemm.jl:173
jlcall_main_21183 at (unknown line)
jl_apply_generic at~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 0x7fe4a04ec0f3)
unknown function (ip: 0x7fe4a04eb527)
unknown function (ip: 0x7fe4a04ec988)
unknown function (ip: 0x7fe4a04ea84d)
unknown function (ip: 0x7fe4a050094f)
unknown function (ip: 0x7fe4a05011c9)
jl_load at ~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
include at ./boot.jl:261
jl_apply_generic at ~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
include_from_node1 at ./loading.jl:320
jl_apply_generic at ~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 0x7fe4a04ec0f3)
unknown function (ip: 0x7fe4a04eb527)
unknown function (ip: 0x7fe4a05004d8)
jl_toplevel_eval_in at ~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
eval_user_input at REPL.jl:62
jlcall_eval_user_input_21160 at (unknown line)
jl_apply_generic at ~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
anonymous at REPL.jl:92
unknown function (ip: 0x7fe4a04f252c)
unknown function (ip: (nil))
Segmentation fault
I wonder if a fresh compilation of libclBLAS.so would generate better behavior with complex GEMM.
as discussed in #21
Complex64
is currently not working and we are getting a seqfault when passing it to clblas, whereasComplex128
works without issue.Complex64
maps tocl_float2
andComplex128
maps tocl_double2
.The definitions of both types in cl_platform.h is:
The only difference I can see is that
cl_float2
is using 8bit alignment andcl_double2
is using 16bit alignment and if I remember correctly Julia uses 16bit alignment for nearly everything.Complex is defined here: https://github.com/JuliaLang/julia/blob/02aeb44299d090d50d2c58e004f58a8b8d4f3da6/base/complex.jl