clMathLibraries / clBLAS

a software library containing BLAS functions written in OpenCL
Apache License 2.0
843 stars 237 forks source link

Seqmentation fault of Zaxpy on 32-bit Windows #42

Closed myurkin closed 10 years ago

myurkin commented 10 years ago

I am trying to use clAmdBlas 1.10 package on Windows 7 (64 bit) notebook with Nvidia GTX540M. I am using MinGW (with gcc 4.8.1) to compile the programs both in native (64-bit) and in 32-bit mode. I managed to link my program against clAmdBlas and run it in 64-bit, but had issues in 32-bit mode. After investigating it boiled down to double_complex functions.

To test it I took the example example_saxpy.c from the package and modified into all other three modes (D,C, and Z). Then I compiled and tested all 4 (with -m32). Compilation was fine for all 4. All ran fine except Zaxpy. The latter produced segmentation fault when calling clAmdBlasZaxpy. I attach corresponding test in the end.

I have also tested the same executables (without recompiling) on another Windows7 computer with AMD7950 GPU with the same conclusions.

As additional information, in my code I called several complex functions: Zaxpy and Zscal failed, while Zdotu and Dznrm2 worked fine. However, I haven't tested them in details as I did for Zaxpy.

I report a problem for rather old version of the package, because I do not see at easy way to obtain a more recent DLL for Windows.

#include <sys/types.h>
#include <stdio.h>
#include <string.h>
#include <math.h>

/* Include CLBLAS header. It automatically includes needed OpenCL header,
 * so we can drop out explicit inclusion of cl.h header.
 */
#include <clAmdBlas.h>

/* This example uses predefined matrices and their characteristics for
 * simplicity purpose.
 */
static const size_t N = 2;
static const cl_double2 alpha = {{10,0}};
static cl_double2 X[] = {
    {{11,0}},
    {{21,0}},
};
static const int incx = 1;

static cl_double2 Y[] = {
    {{15,0}},
    {{11,0}},
};
static const int incy = 1;

static void
printResult(void)
{
    size_t i;
    printf("\nResult:\n");

    printf("Y\n");
    for (i = 0; i < N; i++) {
             printf("\t%f + %fi\n",Y[i].s[0],Y[i].s[1]);
    }
}

int
main(void)
{
    cl_int err;
    cl_platform_id platform = 0;
    cl_device_id device = 0;
    cl_context_properties props[3] = { CL_CONTEXT_PLATFORM, 0, 0 };
    cl_context ctx = 0;
    cl_command_queue queue = 0;
    cl_mem bufX, bufY;
    cl_event event = NULL;
    int ret = 0;
    int lenX = 1 + (N-1)*abs(incx);
    int lenY = 1 + (N-1)*abs(incy);

    /* Setup OpenCL environment. */
    err = clGetPlatformIDs(1, &platform, NULL);
    if (err != CL_SUCCESS) {
        printf( "clGetPlatformIDs() failed with %d\n", err );
        return 1;
    }

    err = clGetDeviceIDs(platform, CL_DEVICE_TYPE_GPU, 1, &device, NULL);
    if (err != CL_SUCCESS) {
        printf( "clGetDeviceIDs() failed with %d\n", err );
        return 1;
    }

    props[1] = (cl_context_properties)platform;
    ctx = clCreateContext(props, 1, &device, NULL, NULL, &err);
    if (err != CL_SUCCESS) {
        printf( "clCreateContext() failed with %d\n", err );
        return 1;
    }

    queue = clCreateCommandQueue(ctx, device, 0, &err);
    if (err != CL_SUCCESS) {
        printf( "clCreateCommandQueue() failed with %d\n", err );
        clReleaseContext(ctx);
        return 1;
    }

    /* Setup clAmdBlas. */
    err = clAmdBlasSetup();
    if (err != CL_SUCCESS) {
        printf("clAmdBlasSetup() failed with %d\n", err);
        clReleaseCommandQueue(queue);
        clReleaseContext(ctx);
        return 1;
    }

    /* Prepare OpenCL memory objects and place matrices inside them. */
    bufX = clCreateBuffer(ctx, CL_MEM_READ_ONLY, (lenX*sizeof(cl_double2)), NULL, &err);
    bufY = clCreateBuffer(ctx, CL_MEM_READ_WRITE, (lenY*sizeof(cl_double2)), NULL, &err);

    err = clEnqueueWriteBuffer(queue, bufX, CL_TRUE, 0, (lenX*sizeof(cl_double2)), X, 0, NULL, NULL);
    err = clEnqueueWriteBuffer(queue, bufY, CL_TRUE, 0, (lenY*sizeof(cl_double2)), Y, 0, NULL, NULL);

    /* Call clAmdBlas function. */
    err = clAmdBlasZaxpy( N, alpha, bufX, 0, incx, bufY, 0, incy, 1, &queue, 0, NULL, &event); 
    if (err != CL_SUCCESS) {
        printf("clAmdBlasZaxpy() failed with %d\n", err);
        ret = 1;
    }
    else {
        /* Wait for calculations to be finished. */
        err = clWaitForEvents(1, &event);

        /* Fetch results of calculations from GPU memory. */
        err = clEnqueueReadBuffer(queue, bufY, CL_TRUE, 0, (lenY*sizeof(cl_double2)),
                                    Y, 0, NULL, NULL);

        /* At this point you will get the result of SAXPY placed in vector Y. */
        printResult();
    }

    /* Release OpenCL memory objects. */
    clReleaseMemObject(bufY);
    clReleaseMemObject(bufX);

    /* Finalize work with clAmdBlas. */
    clAmdBlasTeardown();

    /* Release OpenCL working objects. */
    clReleaseCommandQueue(queue);
    clReleaseContext(ctx);

    return ret;
}
kknox commented 10 years ago

Hi @myurkin A binary release of clBLAS is available on the releases page: https://github.com/clMathLibraries/clBLAS/releases

Given that you are mixing mingw with visual studio, and given your questions in #43, I suspect that this is an artifact calling between mingw and msvc binaries. Can you retest with the latest binary given above? Are you capable of recompiling clBLAS with mingw and using 1 compiler for everything?

Kent

kknox commented 10 years ago

@myurkin Just to confirm, i compiled everything with msvc 2012 and your example program above worked.

kknox commented 10 years ago

Going to close this issue soon and assume it is resolved.

myurkin commented 10 years ago

Sorry for delay. The issue is not resolved. I will provide more information within a few days.

myurkin commented 10 years ago

I've tried the latest clBLAS - no changes. I haven't tried recompiling clBLAS since I am lazy to install all dependencies. However, I studied the issue somewhat deeper. The functions that are affected (two mentioned above) accept cl_double2 argument, while those unaffected (two mentioned above) - do not have such arguments. Then I found out that the problem is related to the "__cl_double2 v2" part of the union cl_double2 in cl_platform.h. (I am using version 1.1 of the header, but I checked that the relevant part are not changed up to version 2.0). If I comment out this part of the union, zaxpy.c compiles and works fine. The same hack fixed my original (larger) code.

Then I've tried to understand the problem further and looked at definition of __cl_double2 in the same header. There are several branches, but none of them helps. It is probably related to the fact that I am anyway using emmintrin.h from gcc. So all definitions boil down to something like double __attribute__ ((__vector_size__ (16))) or double __attribute__ ((__vector_size__ (16), __may_alias__))

There are two things I do not understand at the moment: 1) How cl_double2 is represented, when you compile the dll with visual studio? Is __cl_double2 used at all (i.e. is __SSE2__ defined)? If yes, can you say anything about the definition of __m128d (e.g. from some system header)? Is it compatible with the above definitions?

2) Why adding (or removing) a member of the union changes anything at all? It seems that both size and alignment of the union doesn't change. I suspect that mentioning a "vector" variable in the union somehow changes the way it is passed to the function, but I really lack knowledge on this subject. Do you know, if there is an easy way to see how a function is called?

Searching the Internet I stumbled upon the following seemingly relevant discussion http://stackoverflow.com/q/21743144 but I do not got any specific answers from it. In particular, it says that Microsoft compiler is aligning stack at only 4 bytes in 32bit mode, but it doesn't answer the question 2 above.

myurkin commented 10 years ago

Concerning workarounds. Hacking system headers is definitely not a good idea. So two other options are: 1) compile with -mno-sse. The simplest solution, but will most probably affect the performance of the remaining code. 2) include cl.h, then redefine cl_double2 to a corrected type (without __cl_double2), then include clAmdBlas.h (or clBLAS.h). Should be fine if cl_double2 is not used in some large arrays or something like that (and its v2 part is never addressed).

kknox commented 10 years ago

Hi @myurkin Sorry for they delay; when I look at cl_double2 in vs2013, nothing is defined in the union except for cl_double CL_ALIGNED(16) s[2]; However, on MSVC there is a comment in the header wrt CL_ALIGNED:

Alignment keys neutered on windows because MSVC can't swallow function arguments with alignment requirements

Also, I don't see the SSE2 defined in MSVC, so cl_double2 does not have __attribute__((vector_size(16))) defined.

So, I imagine that the clBLAS library I built with MSVC does not expect the cl_double2 parameters to be aligned, or that it is a packed data type. The application that you are compiling with mingw could be aligning the cl_double2 parameter in the call site, or pass the entire cl_double2 datatype through XMM register. MSVC would probably pass arrays as a pointer, I imagine. With the way that the header is conditionally compiled with pre-processor defines, mixing compilers would break the ABI between the app and the library.

I think the solution would be to compile everything with the same compiler, so that the ABI between the components is consistent.

myurkin commented 10 years ago

Unfortunately, compiling everything with the same compiler is not a convenient solution, as it somewhat contradicts the whole notion of dll. In particular, I want to be responsible only for the Windows executable of my code. While the user can get clBlas.dll from any source (and potentially a newer version then was available when I was compiling my code).

Concerning __SSE2__ - can it be enabled for release version of the dll? I guess it won't harm. At least, we can test if this solves the issue. However, this may in turn cause problems with the programs that do not use sse2 - again, we can test this.

kknox commented 10 years ago

I do understand that this is an inconvenient issue, but I believe a consequence of the preprocessor defines in cl.h. If the #defines are different with different compilers, the interface risks being different.

You might want to check out this stackoverflow issue wrt SSE2 defines in visual studio.

I don't believe that this is an issue with clBLAS anymore, and will be closing it soon unless you can convince me that there is an action item for us.

myurkin commented 10 years ago

I agree, that little can be done on clBLAS side. You may, however, want to mention this issue in some readme supplied with 32-bit Windows dll. Then the code using clBLAS should take care of the compatible definition of the cl_double2.