JuliaGPU / CLBLAS.jl

CLBLAS integration for Julia
Apache License 2.0
22 stars 14 forks source link

Complex64 not working #23

Open vchuravy opened 8 years ago

vchuravy commented 8 years ago

as discussed in #21 Complex64 is currently not working and we are getting a seqfault when passing it to clblas, whereas Complex128 works without issue.

Complex64 maps to cl_float2 and Complex128 maps to cl_double2.

The definitions of both types in cl_platform.h is:

typedef union
{
    cl_float  CL_ALIGNED(8) s[2];
#if __CL_HAS_ANON_STRUCT__
   __CL_ANON_STRUCT__ struct{ cl_float  x, y; };
   __CL_ANON_STRUCT__ struct{ cl_float  s0, s1; };
   __CL_ANON_STRUCT__ struct{ cl_float  lo, hi; };
#endif
#if defined( __CL_FLOAT2__) 
    __cl_float2     v2;
#endif
}cl_float2;

typedef union
{
    cl_double  CL_ALIGNED(16) s[2];
#if __CL_HAS_ANON_STRUCT__
   __CL_ANON_STRUCT__ struct{ cl_double  x, y; };
   __CL_ANON_STRUCT__ struct{ cl_double s0, s1; };
   __CL_ANON_STRUCT__ struct{ cl_double lo, hi; };
#endif
#if defined( __CL_DOUBLE2__) 
    __cl_double2     v2;
#endif
}cl_double2;

The only difference I can see is that cl_float2 is using 8bit alignment and cl_double2 is using 16bit alignment and if I remember correctly Julia uses 16bit alignment for nearly everything.

Complex is defined here: https://github.com/JuliaLang/julia/blob/02aeb44299d090d50d2c58e004f58a8b8d4f3da6/base/complex.jl

vchuravy commented 8 years ago

On v0.4 https://travis-ci.org/JuliaGPU/CLBLAS.jl/jobs/103056335 on nightly: 0.5 https://travis-ci.org/JuliaGPU/CLBLAS.jl/jobs/103056337

Trace of calls https://github.com/clMathLibraries/clBLAS/blob/8b5f7a0e6f800b9597319d71f70fbf67e410b004/src/library/blas/xaxpy.c#L116

https://github.com/clMathLibraries/clBLAS/blob/8b5f7a0e6f800b9597319d71f70fbf67e410b004/src/library/blas/generic/solution_seq_make.c#L332

vchuravy commented 8 years ago

And the only function that really calls clGetCommandQueueInfo https://github.com/clMathLibraries/clBLAS/blob/8b5f7a0e6f800b9597319d71f70fbf67e410b004/src/library/blas/generic/solution_seq_make.c#L374 and https://github.com/clMathLibraries/clBLAS/blob/8b5f7a0e6f800b9597319d71f70fbf67e410b004/src/library/blas/generic/solution_seq_make.c#L503-L504

in https://github.com/clMathLibraries/clBLAS/blob/8b5f7a0e6f800b9597319d71f70fbf67e410b004/src/library/blas/generic/common.c#L311 and https://github.com/clMathLibraries/clBLAS/blob/8b5f7a0e6f800b9597319d71f70fbf67e410b004/src/library/blas/generic/common.c#L296

so my idea is that it probably is not related to alignment, but something funky going on with regards to queues (or that because of misaligment, we are overriding queue information).

vchuravy commented 8 years ago

@dfdx When building with debug information I got it to happen with Float64 and since we also get an invalid value error it seems to have more to do with the interaction between OpenCL.jl and CLBLAS runtime. Do you know if there is any asynchronous operations happening?

dfdx commented 8 years ago

Currently all high-level functions in CLBLAS add task to a queue and return corresponding event. However, in tests we immediately call cl.wait(), so there's always only one task in the queue. jl.

BTW and coming a little bit back, I think sizeof() allows to "guess" alignment pretty accurately :

type T1 x::Int8 end; sizeof(T1)                # => 1
type T2 x::Int8; y::Int8 end; sizeof(T2)    # => 2
type T3 x::Int8; y::Int16 end; sizeof(T3)  # => 4

So any type larger than 8 bytes is aligned to 16 bytes. Complex64 has size 8 bytes, i.e. it's maximally packed. I don't really know what aligned(8) means for struct { float; float}, but I'm pretty much sure it has the same 8 bytes, which confirms your analysis.

vchuravy commented 8 years ago

@dfdx Here is a travis build that build clBLAS from scratch with debug information enabled and it clearly tells us where the seqfault comes from https://travis-ci.org/JuliaGPU/CLBLAS.jl/jobs/103249068#L1672

Thats on the branch https://github.com/JuliaGPU/CLBLAS.jl/tree/vc/complex32

vchuravy commented 8 years ago

Maybe we are encountering something similar to https://github.com/clMathLibraries/clBLAS/issues/187

vchuravy commented 8 years ago

@dfdx I won't have time this week to look into this, feel free to continue the bug hunt or ping me in a week or so.

dfdx commented 8 years ago

Got it. I think I'll have some spare time later this week to debug it.

dfdx commented 8 years ago

For convenience of debugging, here's a short test calling clBLAS directly without all CLBLAS.jl's wrappers:

import OpenCL: CLArray, CL_float
const cl = OpenCL
import CLBLAS: CL_float2

const libCLBLAS = "libclBLAS"

dev, ctx, q = cl.create_compute_context()
CLBLAS.setup()

num_queues = cl.CL_uint(1)
queues = Ptr{Void}[q.id]
num_events = cl.cl_uint(0)
events = Ptr{Void}[]       
ret_event = Array(cl.CL_event,1)

N = 10
DX = Complex64(2.0)
X = cl.ones(Complex64, q, 10)

err = ccall((:clblasCscal, libCLBLAS), cl.CL_int,
            (Csize_t, CL_float, Ptr{Void}, Csize_t, Cint,
             cl.CL_uint, Ptr{Ptr{Void}}, cl.CL_uint, Ptr{Ptr{Void}},
             Ptr{Ptr{Void}}),
            N, DX, pointer(X), Csize_t(0), Cint(1),
            cl.CL_uint(1), pointer(queues), cl.CL_uint(0),
            pointer(events), pointer(ret_event))

if err != cl.CL_SUCCESS
    throw(cl.CLError(err))
end

Which gives CL_INVALID_EVENT_WAIT_LIST (code=-57), while same thing for Complex128 and clblasZscal produces no error.

vchuravy commented 8 years ago

If the event list is empty you need to pass in C_NULL and not a pointer to an empty list.

--edit: Never mind: You are hitting https://github.com/clMathLibraries/clBLAS/blob/master/src/library/blas/xaxpy.c#L91-L94 while numEventsInWaitList should be equal to zero but it isn't...

vchuravy commented 8 years ago

so running the script under lldb

lldb julia -- clblas.jl
(lldb) breakpoint set -n clblasCscal
(lldb) r
Process 10392 launched: '/usr/bin/julia' (x86_64)
1 location added to breakpoint 1
error: libclBLAS.so 0x00059975: DW_TAG_member 's' refers to type 0x0005999a which extends beyond the bounds of 0x00059904
Process 10392 stopped
* thread #1: tid = 10392, 0x00007ffdc891cf31 libclBLAS.so`::clblasCscal(N=10, alpha=cl_float2 at 0x00007fffffffc510, X=0x0000000001ac3530, offx=0, incx=1, numCommandQueues=1, commandQueues=0x00007ffdf4920d10, numEventsInWaitList=4103052928, eventWaitList=0x00007ffdf49213f0, events=0x00007ffd40000000) + 52 at xscal.cc:156, name = 'julia', stop reason = breakpoint 1.1
    frame #0: 0x00007ffdc891cf31 libclBLAS.so`::clblasCscal(N=10, alpha=cl_float2 at 0x00007fffffffc510, X=0x0000000001ac3530, offx=0, incx=1, numCommandQueues=1, commandQueues=0x00007ffdf4920d10, numEventsInWaitList=4103052928, eventWaitList=0x00007ffdf49213f0, events=0x00007ffd40000000) + 52 at xscal.cc:156
   153      const cl_event *eventWaitList,
   154      cl_event *events)
   155  {
-> 156    CHECK_QUEUES(numCommandQueues, commandQueues);
   157    CHECK_EVENTS(numEventsInWaitList, eventWaitList);
   158    CHECK_VECTOR_X(TYPE_COMPLEX_FLOAT, N, X, offx, incx);
   159  
(lldb) po numCommandQueues
1

(lldb) po numEventsInWaitList
4103052928

First thing I noticed is that you have an error in your ccall, Second argument should be cl_float2 and not cl_float.

err = ccall((:clblasCscal, libCLBLAS), cl.CL_int,
            (Csize_t, CL_float2, Ptr{Void}, Csize_t, Cint,
             cl.CL_uint, Ptr{Ptr{Void}}, cl.CL_uint, Ptr{Ptr{Void}},
             Ptr{Ptr{Void}}),
            N, DX, pointer(X), Csize_t(0), Cint(1),
            cl.CL_uint(1), pointer(queues), cl.CL_uint(0),
            pointer(events), pointer(ret_event))

But even after that:

(lldb) p alpha
(cl_float2) $2 = {
  s = {}
   = (x = 0, y = 0)
   = (s0 = 0, s1 = 0)
   = (lo = 0, hi = 0)
  v2 = (0, 0)
} 

(lldb) po numEventsInWaitList
4103052928
dfdx commented 8 years ago

I'm not really familiar with lldb (or gdb), do I understand it correctly that just after ccalling clblasCscal value of input parameter numEventsInWaitList is equal 4103052928?

vchuravy commented 8 years ago

Yes and alpha e.g. DX is not correctly passed through. So I am back thinking about misalignment or misrepresentation of cl_float2

On Fri, 22 Jan 2016, 06:47 Andrei Zhabinski notifications@github.com wrote:

I'm not really familiar with lldb (or gdb), do I understand it correctly that just after ccalling clblasCscal value of input parameter numEventsInWaitList is equal 4103052928?

— Reply to this email directly or view it on GitHub https://github.com/JuliaGPU/CLBLAS.jl/issues/23#issuecomment-173719472.

vchuravy commented 8 years ago

I just created a small c-example to see if I could figure out the correct way of working with Complex64 and cl_float2 and if I compile the library with gcc and the test program with clang I get the same problem on the C level.

Right now the only way of solving this for users is to recommend to them to compile clBLAS with clang/clang++

dfdx commented 8 years ago

Sounds unpleasant, but reasonable. I'll add corresponding note to the README.

vchuravy commented 8 years ago

@dfdx Maybe just maybe we could use https://strpackjl.readthedocs.org/en/latest/

dfdx commented 8 years ago

StrPack seems to be quite outdated (lots of deprecation warnings), but worth trying anyway. I will check in the next couple of days, thanks for the tip.

vchuravy commented 8 years ago

You need to use the current master. Then there should be no warnings.

On Mon, 28 Mar 2016, 20:15 Andrei Zhabinski, notifications@github.com wrote:

StrPack seems to be quite outdated (lots of deprecation warnings), but worth trying anyway. I will check in the next couple of days, thanks for the tip.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/JuliaGPU/CLBLAS.jl/issues/23#issuecomment-202343863

dfdx commented 8 years ago

Seems like data packed using StrPack doesn't play well with OpenCL.Buffer:

import OpenCL
import CLBLAS: CLArray
import StrPack: @struct, pack

@struct type MyCL_double2
    f1::Float64
    f2::Float64
end

hX = [MyCL_double2(rand(Float64), rand(Float64)) for i=1:10]
packedX = [pack(x).data for x in hX]

dev, ctx, q = OpenCL.create_compute_context()
CLBLAS.setup()
X = CLArray(q, packedX)

gives:

ERROR: type does not have a canonical binary representation 
  in Buffer at /home/dfdx/.julia/v0.4/OpenCL/src/buffer.jl:137
  in Buffer at /home/dfdx/.julia/v0.4/OpenCL/src/buffer.jl:86
  in Buffer at /home/dfdx/.julia/v0.4/OpenCL/src/buffer.jl:52

Obviously, packedX has type Array{Array{UInt8,1},1} and OpenCL doesn't know how to handle such a type, but I'm not sure we can fix it on our side.

vchuravy commented 8 years ago

What happens if you do

@struct immutable MyCL_double2
...
end
dfdx commented 8 years ago

Actually, it's not about mutability - packed data is Array{Attay{UInt8}} anyway, and this is what OpenCL doesn't know how to handle.

However, I found out that using immutable (with or without @struct annotation) fixes some errors, e.g. following code works fine (note that I switched back to float2 instead of double2):

import OpenCL
const cl = OpenCL
import CLBLAS: CLArray

const libCLBLAS = "libclBLAS"

immutable MyCL_float2  # or @struct immutable MyCL_float2
    f1::Float32
    f2::Float32
end

hX = [MyCL_float2(rand(Float64), rand(Float64)) for i=1:10]

dev, ctx, q = OpenCL.create_compute_context()
CLBLAS.setup()
X = CLArray(q, hX)

# sample call
num_queues = cl.CL_uint(1)
queues = Ptr{Void}[q.id]
num_events = cl.cl_uint(0)
events = Ptr{Void}[]
ret_event = Array(cl.CL_event,1)

N = 10
DX = MyCL_float2(2.0, 0.0)
err = ccall((:clblasCscal, libCLBLAS), cl.CL_int,
            (Csize_t, MyCL_float2, Ptr{Void}, Csize_t, Cint,
             cl.CL_uint, Ptr{Ptr{Void}}, cl.CL_uint, Ptr{Ptr{Void}},
             Ptr{Ptr{Void}}),
            N, DX, pointer(X), Csize_t(0), Cint(1),
            cl.CL_uint(1), pointer(queues), cl.CL_uint(0),
            C_NULL, pointer(ret_event))

If we change immutable to type, the code throws ReadOnlyMemoryError(), which is quite intuitive.

But what is not intuitive is that changing definition to:

typealias MyCL_float2 Complex64

leads to a segmentation fault:

signal (11): Segmentation fault
unknown function (ip: 0x7efd3caeab24)
unknown function (ip: 0x7efd3cae34bd)
unknown function (ip: 0x7efd3cae388c)
unknown function (ip: 0x7efd3810fa51)
_ZN25clblasCscalFunctorGeneric7executeERN18clblasXscalFunctorI9cl_float2S1_E4ArgsE at /usr/lib/x86_64-linux-gnu/libclBLAS.so (unknown line)
clblasCscal at /usr/lib/x86_64-linux-gnu/libclBLAS.so (unknown line)
anonymous at no file:0
unknown function (ip: 0x7eff597289d3)
unknown function (ip: 0x7eff597295ec)
jl_load at /opt/julia/bin/../lib/julia/libjulia.so (unknown line)
include at ./boot.jl:261
jl_apply_generic at /opt/julia/bin/../lib/julia/libjulia.so (unknown line)
include_from_node1 at ./loading.jl:304 
jl_apply_generic at /opt/julia/bin/../lib/julia/libjulia.so (unknown line)
process_options at ./client.jl:284
_start at ./client.jl:378
unknown function (ip: 0x7eff5618d8f9)
jl_apply_generic at /opt/julia/bin/../lib/julia/libjulia.so (unknown line)
unknown function (ip: 0x401c47)
unknown function (ip: 0x40182f)
__libc_start_main at /lib/x86_64-linux-gnu/libc.so.6 (unknown line)
unknown function (ip: 0x401875)
Segmentation fault (core dumped)

and the only difference between our MyCL_float2 and Complex64 is that latter gets its base type (Float32) through type parameters:

immutable Complex{T<:Real} <: Number
    re::T
    im::T
end
dfdx commented 8 years ago

Ok, it's not even about type parameters, but about inheritance. This works fine:

immutable MyCL_float2{T}
    f1::T
    f2::T
end

but the following modification makes it break with the same segfault again:

immutable MyCL_float2{T} <: Number
    f1::T
    f2::T
end

And, by the way, none of these options work with gemm function.

mikhail-j commented 8 years ago

Hey, I think I have found a solution to this problem for Linux. I have developed a solution that works on Windows 7 64 bit:

Initially, the zGEMM function on Windows 7 x64 was throwing segmentation faults. However, I found that the ccall function to clblasZgemm() would stop throwing segmentation faults if I changed the argument type of variables alpha and beta to Ref{cl_double2} or Ptr{cl_double2}. This is reflected in changing variables alpha and beta to 1-element cl_double2 arrays. Here is the function I used in my code while working on my project:

function clblasZgemm(o,tA,tB,M,N,K,alpha,A,offA,lda,B,offB,ldb,beta,C,offC,ldc,ncq,cq,ne,wle,e)  
    return ccall((:clblasZgemm, libclblas), cl_int, (clblasOrder,  
        clblasTranspose,  
        clblasTranspose,  
        Csize_t,  
        Csize_t,  
        Csize_t,  
        Ref{clblasDoubleComplex},#treating this as a pointer fixed a segmentation fault  
        cl_mem,  
        Csize_t,  
        Csize_t,  
        cl_mem,  
        Csize_t,  
        Csize_t,  
        Ref{clblasDoubleComplex},#treating this as a pointer fixed a segmentation fault  
        cl_mem,  
        Csize_t,  
        Csize_t,  
        cl_uint,  
        Ref{cl_command_queue},  
        cl_uint,  
        Ref{cl_event},  
        Ptr{cl_event}),  
        o,tA,tB,M,N,K,alpha,A,offA,lda,B,offB,ldb,beta,C,offC,ldc,ncq,cq,ne,wle,e)  
end  

My suspicion is that the libclBLAS library is treating the alpha and beta arguments as pointers like the event variable.

Can you guys check if this fixes the segmentation faults on Linux or OSX?

dfdx commented 8 years ago

I won't have access to GPU-enabled laptop till the end of this week, but I think you can test your code on Travis. The easiest way to go should be to:

  1. Clone CLBLAS.jl.
  2. Create a branch and modify clblasZgemm there.
  3. Push code to Github and let Travis run tests.

Note, that you may need to setup your own Travis account and add Mac OS X to .travis.yml, so if it's too much trouble for you, just leave it till the end of the week.

dfdx commented 8 years ago

I've just checked it on Complex64. Call doesn't fail, but the result is incorrect.

@mikhail-j: could you please provide full code you used for testing? Just to be on the same page.

mikhail-j commented 8 years ago

The following code I used was copied from test_zgemm.jl. I have aliased clblasDoubleComplex to Complex{Float64} in clblas_typedef.jl as CLBLAS.jl does too.

const libclblas = Libdl.find_library(["clBLAS","libclBLAS"],["C:\\AMD\\clBLA-2.10.0\\bin","C:\\AMD\\acml6.1.0.33\\ifort64\\lib\\"])
#const libopencl = Libdl.find_library(["libOpenCL","OpenCL"],["."])
const libopencl = Libdl.find_library(["OpenCL64","OpenCL"],["C:\\Program Files\\NVIDIA Corporation\\OpenCL\\","C:\\Program Files (x86)\\AMD APP SDK\\2.9-1\\bin\\x86_64"])
if (isempty(libclblas))
    print("clBLAS can't be found!")
end
include("cl_typedef.jl")
include("clblas_typedef.jl")
include("cl_functions.jl")
include("clblas_functions.jl")
#ccall((:function, “library”), return_type, (argtype,),arg)

function clblasZgemm(o,tA,tB,M,N,K,alpha,A,offA,lda,B,offB,ldb,beta,C,offC,ldc,ncq,cq,ne,wle,e)
    return ccall((:clblasZgemm, libclblas), cl_int, (clblasOrder,
        clblasTranspose,
        clblasTranspose,
        Csize_t,
        Csize_t,
        Csize_t,
        Ref{clblasDoubleComplex},#treating this as a pointer fixed a segmentation fault
        #clblasDoubleComplex,
        cl_mem,
        Csize_t,
        Csize_t,
        cl_mem,
        Csize_t,
        Csize_t,
        Ref{clblasDoubleComplex},#treating this as a pointer fixed a segmentation fault
        #clblasDoubleComplex,
        #Base.cconvert(Ptr{Void}, Ref{cl_mem}),
        #Ref{cl_mem},
        cl_mem,
        Csize_t,
        Csize_t,
        cl_uint,
        Ref{cl_command_queue},
        cl_uint,
        #Ref{cl_event},
        #AMD's OpenCL driver (Windows 7 x64) throws invalid event if argument type is Ref{cl_event}
        Ptr{cl_event},
        Ptr{cl_event}),
        #Ptr{cl_event_info},
        #Ptr{cl_event_info}),
        o,tA,tB,M,N,K,alpha,A,offA,lda,B,offB,ldb,beta,C,offC,ldc,ncq,cq,ne,wle,e)
end

function main()

    local props = vec(convert(Array{cl_context_properties, 2}, [CL_CONTEXT_PLATFORM 0 0]))
    devs = Array(cl_device_id, 1)
    devs[1] = clGetFirstGPU()
    local platform = clGetGPUPlatform(devs[1])

    println(string("Selected GPU: ",clGetDeviceVendor(devs[1])), " ", clGetDeviceName(devs[1]))
    props[2] = Base.cconvert(cl_context_properties,platform)
    err = Array(cl_int, 1)
    local ctx = clCreateContext(props,1,devs[1],C_NULL,C_NULL,err)
    statusCheck(err[1])
    err = Array(cl_int, 1)
    local queue = Array(cl_command_queue, 1)
    queue[1] = clCreateCommandQueue(ctx, devs[1], cl_command_queue_properties(0), err)
    statusCheck(err[1])
    ################################    create arrays
    A = convert(Array{clblasDoubleComplex,2}, [[11, 12, 13, 14, 15]';[21, 22, 23, 24, 25]';[31, 32, 33, 34, 35]';[41, 42, 43, 44, 45]'])
    B = convert(Array{clblasDoubleComplex,2}, [[11, 12, 13]';[21, 22, 23]';[31, 32, 33]';[41, 42, 43]';[51, 52, 53]'])
    C = convert(Array{clblasDoubleComplex,2}, [[11, 12, 13]';[21, 22, 23]';[31, 32, 33]';[41, 42, 43]'])
    ##A =   convert(Array{cl_double2,2}, convert(Array{clblasDoubleComplex,2}, [[11, 12, 13, 14, 15]';[21, 22, 23, 24, 25]';[31, 32, 33, 34, 35]';[41, 42, 43, 44, 45]']))
    ##B = convert(Array{cl_double2,2}, convert(Array{clblasDoubleComplex,2}, [[11, 12, 13]';[21, 22, 23]';[31, 32, 33]';[41, 42, 43]';[51, 52, 53]']))
    ##C = convert(Array{cl_double2,2}, convert(Array{clblasDoubleComplex,2}, [[11, 12, 13]';[21, 22, 23]';[31, 32, 33]';[41, 42, 43]']))
    A1 = vec(A)
    B1 = vec(B)
    C1 = vec(C)
    M = Csize_t(length(A[:,1]))
    K = Csize_t(length(B[:,1]))
    N = Csize_t(length(B[1,:]))

    order = clblasColumnMajor       ##julia uses column major
    alpha = Array(clblasDoubleComplex, 1)
    alpha[1] = convert(clblasDoubleComplex, 10)
    #println(string("alpha: ",alpha))
    beta = Array(clblasDoubleComplex, 1)
    beta[1] = convert(clblasDoubleComplex, 20)
    #println(string("beta: ",beta))
    transA = clblasNoTrans;
    transB = clblasNoTrans;
    off =  convert(Csize_t, 0)
    offA = convert(Csize_t, 0)
    offB = convert(Csize_t, 0)
    offC = convert(Csize_t, 0)
    #Now initialize OpenCLBLAS and buffers
    statusCheck(clblasSetup())
    statusCheck(clFlush(queue[1]))
    err = Array(cl_int, 1)
    bufA = clCreateBuffer(ctx, CL_MEM_READ_ONLY, M * K * sizeof(clblasDoubleComplex), C_NULL, err)
    statusCheck(err[1])
    err = Array(cl_int, 1)
    bufB = clCreateBuffer(ctx, CL_MEM_READ_ONLY, K * N * sizeof(clblasDoubleComplex), C_NULL, err)
    statusCheck(err[1])
    err = Array(cl_int, 1)
    bufC = clCreateBuffer(ctx, CL_MEM_READ_WRITE, M * N * sizeof(clblasDoubleComplex), C_NULL, err)
    statusCheck(err[1])
    statusCheck(clFlush(queue[1]))

    event = Array(cl_event, 1)

    event[1] = C_NULL
    statusCheck(clEnqueueWriteBuffer(queue[1], bufA, CL_TRUE, Csize_t(0), M * K * sizeof(clblasDoubleComplex), A1, cl_uint(0), C_NULL, event))
    statusCheck(clWaitForEvents(1,event))
    statusCheck(clReleaseEvent(event[1]))       #free the memory
    event[1] = C_NULL
    statusCheck(clEnqueueWriteBuffer(queue[1], bufB, CL_TRUE, Csize_t(0), K * N * sizeof(clblasDoubleComplex), B1, cl_uint(0), C_NULL, event))
    statusCheck(clWaitForEvents(1,event))
    statusCheck(clReleaseEvent(event[1]))       #free the memory

    event[1] = C_NULL
    statusCheck(clEnqueueWriteBuffer(queue[1], bufC, CL_TRUE, Csize_t(0), M * N * sizeof(clblasDoubleComplex), C1, cl_uint(0), C_NULL, event))
    statusCheck(clWaitForEvents(1,event))
    statusCheck(clReleaseEvent(event[1]))       #free the memory

#=================Check respective buffer sizes in GPU
    ref_count = Array(Csize_t, 1)
    statusCheck(clGetMemObjectInfo(bufA, CL_MEM_SIZE, Csize_t(sizeof(ref_count)), ref_count, C_NULL))
    println(string("bufA memory object size: ", Int32(ref_count[1])))
    ref_count = 0
    ref_count = Array(Csize_t, 1)
    statusCheck(clGetMemObjectInfo(bufB, CL_MEM_SIZE, Csize_t(sizeof(ref_count)), ref_count, C_NULL))
    println(string("bufB memory object size: ", Int32(ref_count[1])))
    ref_count = 0
    ref_count = Array(Csize_t, 1)
    statusCheck(clGetMemObjectInfo(bufC, CL_MEM_SIZE, Csize_t(sizeof(ref_count)), ref_count, C_NULL))
    println(string("bufC memory object size: ", Int32(ref_count[1])))
    ref_count = 0
=====#
    event[1] = C_NULL
    #=
    statusCheck(clblasSgemm(clblasRowMajor, clblasNoTrans, clblasNoTrans, M, N, K,
                             alpha, bufA, 0, K,
                             bufB, 0, N, beta,
                             bufC, 0, N,
                             1, queue, 0, C_NULL, event))
=#
    statusCheck(clblasZgemm(clblasColumnMajor, clblasNoTrans, clblasNoTrans, M, N, K,
                             alpha, bufA, 0, M,
                             bufB, 0, K, beta,
                             bufC, 0, M,
                             1, queue, 0, C_NULL, event))

    statusCheck(clFlush(queue[1]))
    statusCheck(clWaitForEvents(1,event))
    statusCheck(clReleaseEvent(event[1]))       #free the memory

    C2=Array(clblasDoubleComplex,length(C1))
    event[1] = C_NULL
    statusCheck(clEnqueueReadBuffer(queue[1], bufC, CL_TRUE, Csize_t(0), length(C1)*sizeof(clblasDoubleComplex), C2, cl_uint(0), C_NULL, event))
    statusCheck(clWaitForEvents(1,event))
    statusCheck(clReleaseEvent(event[1]))       #free the memory

    statusCheck(clFlush(queue[1]))
    statusCheck(clReleaseMemObject(bufC))
    statusCheck(clFlush(queue[1]))
    statusCheck(clReleaseMemObject(bufB))
    statusCheck(clFlush(queue[1]))
    statusCheck(clReleaseMemObject(bufA))
    statusCheck(clFlush(queue[1]))
    #statusCheck(clGetMemObjectInfo(bufA, CL_MEM_REFERENCE_COUNT, Csize_t(sizeof(ref_count)), ref_count, C_NULL))
    #bufA = C_NULL
    #bufB = C_NULL
    #bufC = C_NULL
    clblasTeardown()
    statusCheck(clFlush(queue[1]))
    statusCheck(clReleaseCommandQueue(queue[1]))
    statusCheck(clReleaseContext(ctx))
    bufC = C_NULL
    bufB = C_NULL
    bufA = C_NULL
    queue[1] = C_NULL
    event[1] = C_NULL
    ctx = C_NULL
    devs[1] = C_NULL
    Base.gc()       ##not sure if julia has been garbage collecting, now is a good time though
    return reshape(C2, Int(M), Int(N))
end

if (!isempty(libclblas) && !isempty(libopencl))
    main()
end
dfdx commented 8 years ago

I'm afraid this doesn't fix the error for me (Ubuntu 15.10, NVidia GForce GT 630M):

$ julia test_zgemm.jl 
Selected GPU: NVIDIA Corporation GeForce GT 630M
WARNING: OpenCL Error:

 in statusCheck at /home/dfdx/work/playground/OpenCLBLAS.jl/src/cl_functions.jl:96
 in main at /home/dfdx/work/playground/OpenCLBLAS.jl/src/test_zgemm.jl:166
 in include at ./boot.jl:261
 in include_from_node1 at ./loading.jl:304
 in process_options at ./client.jl:280
 in _start at ./client.jl:378
ERROR: LoadError: "CL_INVALID_COMMAND_QUEUE"
 in statusCheck at /home/dfdx/work/playground/OpenCLBLAS.jl/src/cl_functions.jl:98
 in main at /home/dfdx/work/playground/OpenCLBLAS.jl/src/test_zgemm.jl:166
 in include at ./boot.jl:261
 in include_from_node1 at ./loading.jl:304
 in process_options at ./client.jl:280
 in _start at ./client.jl:378
while loading /home/dfdx/work/playground/OpenCLBLAS.jl/src/test_zgemm.jl, in expression starting on line 209

Yet I'm curious what was your idea when you tried to pass pointer to Complex{Float64} instead of the number itself?

mikhail-j commented 8 years ago

I came across the possible solution when I started writing these wrapper ccall functions myself. I found that some functions threw a segmentation fault if I passed a normal variable rather than a pointer.

So, I tweaked my clbla<type>gemm functions to accept pointers and now function works without segmentation faults.

@dfdx, I noticed that you had changed the line numbers in the code when the error occurs on line 166.

If the message is CL_INVALID_COMMAND_QUEUE, could you change the Ref{cl_command_queue} in the wrapper to Ptr{cl_command_queue}?

or

Do a git pull for the revised version (and then add your path to the libraries)?

dfdx commented 8 years ago

@mikhail-j: I only changed code for finding libraries, the rest of the code is the same.

I'm using another laptop right now, so will check your suggestion in the evening (~10 hours from now).

dfdx commented 8 years ago

@mikhail-j: nope, changing Ref{cl_command_queue} to Ptr{cl_command_queue} didn't help either.

Just for reference, on what CPU/GPU do you test it?

mikhail-j commented 8 years ago

I've tested my code on Windows 7 x64 with a NVIDIA GTX 780 Ti GPU (CUDA 7.5) and AMD R9 390 GPU (Crimson 14.2 hotfix).

In regards to the CPU, I used a Intel Core i7-3930K.

vchuravy commented 8 years ago

@mikhail-j May I ask which compiler you are using for CLBLAS? I found that different compilers have different alignments and as such influence which call works and which doesn't.

mikhail-j commented 8 years ago

@vchuravy I used MinGW-w64 on Windows 7 x64.

However, I recently tested the cGEMM and zGEMM functions on SUSE SLES 11 SP3 Linux (customized kernel version 3.18.36). At first, libclBLAS.so refused to load because my glibc version was too old for its liking (I had 2.11.3). After updating my glibc version to 2.23, libclBLAS.so finally loaded into julia (I compiled julia v0.4.6 with _gcc 4.8.5 x8664).

I found that Complex{Float64} functioned properly without Ptr{T}/Ref{T}.

When I tested the Complex{Float32} function, it threw a segmentation fault as you noted earlier.

This was tested on a NVIDIA GTX 780 Ti GPU:

julia> include("test_cgemm.jl")
Selected GPU: NVIDIA Corporation GeForce GTX 780 Ti

signal (11): Segmentation fault
_Z10clblasGemmI9cl_float2E13clblasStatus_12clblasOrder_16clblasTranspose_S3_mmmT_P7_cl_memmmS6_mmS4_S6_mmjPP17_cl_command_queuejPKP9_cl_eventPSB_ at ../clBLAS-2.10.0-Hawaii-Linux-x64-CL2.0/lib64/libclBLAS.so (unknown line)
clblasCgemm at ~/OpenCLBLAS.jl/src/test_cgemm.jl:38
main at ~/OpenCLBLAS.jl/src/test_cgemm.jl:173
jlcall_main_21183 at  (unknown line)
jl_apply_generic at~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 0x7fe4a04ec0f3)
unknown function (ip: 0x7fe4a04eb527)
unknown function (ip: 0x7fe4a04ec988)
unknown function (ip: 0x7fe4a04ea84d)
unknown function (ip: 0x7fe4a050094f)
unknown function (ip: 0x7fe4a05011c9)
jl_load at ~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
include at ./boot.jl:261
jl_apply_generic at ~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
include_from_node1 at ./loading.jl:320
jl_apply_generic at ~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
unknown function (ip: 0x7fe4a04ec0f3)
unknown function (ip: 0x7fe4a04eb527)
unknown function (ip: 0x7fe4a05004d8)
jl_toplevel_eval_in at ~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
eval_user_input at REPL.jl:62
jlcall_eval_user_input_21160 at  (unknown line)
jl_apply_generic at ~/julia/0.4.5/usr/bin/../lib/libjulia.so (unknown line)
anonymous at REPL.jl:92
unknown function (ip: 0x7fe4a04f252c)
unknown function (ip: (nil))
Segmentation fault

I wonder if a fresh compilation of libclBLAS.so would generate better behavior with complex GEMM.