CNugteren / CLBlast

Tuned OpenCL BLAS
Apache License 2.0
1.06k stars 205 forks source link

Tests fail on Debian stretch with beignet #231

Closed vi closed 6 years ago

vi commented 6 years ago

With beignet 1.3.2-1 and CLBlast v1.2.0 it fails multiple tests:

Total Test time (real) = 397.13 sec

The following tests FAILED:
      5 - clblast_test_xdot (Failed)
      6 - clblast_test_xdotu (Failed)
      7 - clblast_test_xdotc (Failed)
      8 - clblast_test_xnrm2 (Failed)
      9 - clblast_test_xasum (Failed)
     12 - clblast_test_xgbmv (OTHER_FAULT)
     34 - clblast_test_xgemm (OTHER_FAULT)
     37 - clblast_test_xsyrk (Failed)
     38 - clblast_test_xherk (Failed)
     39 - clblast_test_xsyr2k (Failed)
     40 - clblast_test_xher2k (Failed)
     46 - clblast_test_xgemmbatched (OTHER_FAULT)

Additionally matmul build with NETLIB CLBlast fails multiplication if matrix is big enough:

$ ./matmul_cl -n 191 -a 6
...
Central cell: 42.8968
$ ./matmul_cl -n 192 -a 6
...
Central cell: 0

On master branch it also fails.

CNugteren commented 6 years ago

Thanks for reporting. Could you give a bit more info though? Which device are you testing on? And can you post the results of the failing test runs?

vi commented 6 years ago

Lenovo Thinkpad X230. Linux vi-notebook 4.9.33-grsec-64+ #85 SMP PREEMPT Sat Jul 15 00:47:47 +03 2017 x86_64 GNU/Linux 00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)

How do I get those results? Is it just the terminal output?

CNugteren commented 6 years ago

Would be helpful to run clinfo or the included clblast_test_diagnostics tool to get the name of your device (e.g. HD Graphics Haswell Ultrabook GT2 Mobile).

CMake just runs the test executables, but stores the output somewhere else. You can probably find that in subfolders on disk. Consult the CMake/CTest documentation to get more info. Otherwise, you can just manually run the test executables, e.g. ./clblast_test_xaxpy.

vi commented 6 years ago
Number of platforms                               1
  Platform Name                                   Intel Gen OCL Driver
  Platform Vendor                                 Intel
  Platform Version                                OpenCL 2.0 beignet 1.3
  Platform Profile                                FULL_PROFILE
  Platform Extensions                             cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing
  Platform Extensions function suffix             Intel

  Platform Name                                   Intel Gen OCL Driver
Number of devices                                 1
  Device Name                                     Intel(R) HD Graphics IvyBridge M GT2
  Device Vendor                                   Intel
  Device Vendor ID                                0x8086
  Device Version                                  OpenCL 1.2 beignet 1.3
  Driver Version                                  1.3
  Device OpenCL C Version                         OpenCL C 1.2 beignet 1.3
  Device Type                                     GPU
  Device Profile                                  FULL_PROFILE
  Max compute units                               16
  Max clock frequency                             1000MHz
  Device Partition                                (core)
    Max number of sub-devices                     1
    Supported partition types                     None, None, None
  Max work item dimensions                        3
  Max work item sizes                             512x512x512
  Max work group size                             512
  Preferred work group size multiple              16
  Preferred / native vector sizes                 
    char                                                16 / 8       
    short                                                8 / 8       
    int                                                  4 / 4       
    long                                                 2 / 2       
    half                                                 0 / 8        (n/a)
    float                                                4 / 4       
    double                                               0 / 2        (n/a)
  Half-precision Floating-point support           (n/a)
  Single-precision Floating-point support         (core)
    Denormals                                     No
    Infinity and NANs                             Yes
    Round to nearest                              Yes
    Round to zero                                 No
    Round to infinity                             No
    IEEE754-2008 fused multiply-add               No
    Support is emulated in software               No
    Correctly-rounded divide and sqrt operations  No
  Double-precision Floating-point support         (n/a)
  Address bits                                    32, Little-Endian
  Global memory size                              2147483648 (2GiB)
  Error Correction support                        No
  Max memory allocation                           1610612736 (1.5GiB)
  Unified memory for Host and Device              Yes
  Minimum alignment for any data type             128 bytes
  Alignment of base address                       1024 bits (128 bytes)
  Global Memory cache type                        Read/Write
  Global Memory cache size                        8192
  Global Memory cache line                        64 bytes
  Image support                                   Yes
    Max number of samplers per kernel             16
    Max size for 1D images from buffer            65536 pixels
    Max 1D or 2D image array size                 2048 images
    Base address alignment for 2D image buffers   4096 bytes
    Pitch alignment for 2D image buffers          1 bytes
    Max 2D image size                             8192x8192 pixels
    Max 3D image size                             8192x8192x2048 pixels
    Max number of read image args                 128
    Max number of write image args                8
  Local memory type                               Local
  Local memory size                               65536 (64KiB)
  Max constant buffer size                        134217728 (128MiB)
  Max number of constant args                     8
  Max size of kernel argument                     1024
  Queue properties                                
    Out-of-order execution                        No
    Profiling                                     Yes
  Prefer user sync for interop                    Yes
  Profiling timer resolution                      80ns
  Execution capabilities                          
    Run OpenCL kernels                            Yes
    Run native kernels                            Yes
    SPIR versions                                 1.2
  printf() buffer size                            1048576 (1024KiB)
  Built-in kernels                                __cl_copy_region_align4;__cl_copy_region_align16;__cl_cpy_region_unalign_same_offset;__cl_copy_region_unalign_dst_offset;__cl_copy_region_unalign_src_offset;__cl_copy_buffer_rect;__cl_copy_image_1d_to_1d;__cl_copy_image_2d_to_2d;__cl_copy_image_3d_to_2d;__cl_copy_image_2d_to_3d;__cl_copy_image_3d_to_3d;__cl_copy_image_2d_to_buffer;__cl_copy_image_3d_to_buffer;__cl_copy_buffer_to_image_2d;__cl_copy_buffer_to_image_3d;__cl_fill_region_unalign;__cl_fill_region_align2;__cl_fill_region_align4;__cl_fill_region_align8_2;__cl_fill_region_align8_4;__cl_fill_region_align8_8;__cl_fill_region_align8_16;__cl_fill_region_align128;__cl_fill_image_1d;__cl_fill_image_1d_array;__cl_fill_image_2d;__cl_fill_image_2d_array;__cl_fill_image_3d;block_motion_estimate_intel;
  Device Available                                Yes
  Compiler Available                              Yes
  Linker Available                                Yes
  Device Extensions                               cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing cl_intel_motion_estimation

NULL platform behavior
  clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...)  Intel Gen OCL Driver
  clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...)   Success [Intel]
  clCreateContext(NULL, ...) [default]            Success [Intel]
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU)  Success (1)
    Platform Name                                 Intel Gen OCL Driver
    Device Name                                   Intel(R) HD Graphics IvyBridge M GT2
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM)  No devices found in platform
  clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL)  Success (1)
    Platform Name                                 Intel Gen OCL Driver
    Device Name                                   Intel(R) HD Graphics IvyBridge M GT2

ICD loader properties
  ICD loader Name                                 OpenCL ICD Loader
  ICD loader Vendor                               OCL Icd free software
  ICD loader Version                              2.2.11
  ICD loader Profile                              OpenCL 2.1
* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]

 --- OpenCL device naming:
* Device type                   GPU
* Device name                   Intel(R) HD Graphics IvyBridge M GT2
* Platform vendor               Intel
* Platform version              OpenCL 2.0 beignet 1.3

 --- CLBlast device naming:
* Device type                   GPU
* Device name                   Intel(R) HD Graphics IvyBridge M GT2
* Device vendor                 Intel
* Device architecture           

 --- OpenCL device properties:
* Max work group size           512
* Max work item dimensions      3
* - Max work item size #0       512
* - Max work item size #1       512
* - Max work item size #2       512
* Local memory size             65536KB
* Extensions:
cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing cl_intel_motion_estimation

 --- Some OpenCL library benchmarks (functions from clpp11.h):
* queue.GetContext()            0.0003 ms
* queue.GetDevice()             0.0002 ms
* device.Name()                 0.0002 ms
* device.Vendor()               0.0001 ms
* device.Version()              0.0002 ms
* device.Platform()             0.0001 ms
* Buffer<float>(context, 1024)  0.0071 ms
$ DISPLAY= ./clblast_test_xaxpy

* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]
    -full_test [false]
    -verbose [false]
    -cblas 1 [=default]

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'SAXPY' routine. Legend:
   : -> Test produced correct results
   . -> Test returned the correct error code
   X -> Test produced incorrect results
   / -> Test returned an incorrect error code
   \ -> Test not executed: OpenCL-kernel compilation error
   o -> Test not executed: Unsupported precision
   - -> Test not completed: Reference CBLAS doesn't output error codes
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for 'default':
   ::::::::::::::::::::::::::::::::::::
   Pass rate 100.0%: 36 passed / 0 skipped / 0 failed
* Completed all test-cases for this routine. Results:
   36 test(s) passed
   0 test(s) skipped
   0 test(s) failed

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'DAXPY' routine.
* All tests skipped: Unsupported precision

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'CAXPY' routine. Legend:
   : -> Test produced correct results
   . -> Test returned the correct error code
   X -> Test produced incorrect results
   / -> Test returned an incorrect error code
   \ -> Test not executed: OpenCL-kernel compilation error
   o -> Test not executed: Unsupported precision
   - -> Test not completed: Reference CBLAS doesn't output error codes
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for 'default':
   ::::::::::::::::::::::::::::::::::::
   Pass rate 100.0%: 36 passed / 0 skipped / 0 failed
* Completed all test-cases for this routine. Results:
   36 test(s) passed
   0 test(s) skipped
   0 test(s) failed

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'ZAXPY' routine.
* All tests skipped: Unsupported precision

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'HAXPY' routine.
* All tests skipped: Unsupported precision
$ DISPLAY= ./clblast_test_xdot

* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]
    -full_test [false]
    -verbose [false]
    -cblas 1 [=default]

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'SDOT' routine. Legend:
   : -> Test produced correct results
   . -> Test returned the correct error code
   X -> Test produced incorrect results
   / -> Test returned an incorrect error code
   \ -> Test not executed: OpenCL-kernel compilation error
   o -> Test not executed: Unsupported precision
   - -> Test not completed: Reference CBLAS doesn't output error codes
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for 'default':
   XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
   Error rate 100.00%: n=7 incx=1 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=1 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=1 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=2 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=2 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=2 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=7 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=7 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=7 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=1 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=1 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=1 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=2 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=2 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=2 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=7 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=7 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=7 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=1 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=1 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=1 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=2 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=2 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=2 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=7 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=7 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=7 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=1 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=1 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=1 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=2 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=2 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=2 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=7 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=7 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=7 incy=7 offx=0 offy=0 offdot=0 
   Pass rate   0.0%: 0 passed / 0 skipped / 36 failed
* Completed all test-cases for this routine. Results:
   0 test(s) passed
   0 test(s) skipped
   36 test(s) failed

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'DDOT' routine.
* All tests skipped: Unsupported precision

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'HDOT' routine.
* All tests skipped: Unsupported precision
vi commented 6 years ago

Can CLBlast fall back to usual OpenBLAS for unsupported operations by the way?

CNugteren commented 6 years ago

OK thanks, probably Intel(R) HD Graphics IvyBridge M GT2 was sufficient info though ;-) Could you post the output of all failing tests? Either as attachment or as Gist/Pastebin, because otherwise this issue becomes a bit unreadable.

Can CLBlast fall back to usual OpenBLAS for unsupported operations by the way?

What do you mean exactly? And which routines? Are you using the Netlib API by the way instead of the OpenCL API? That's not recommended for speed as you might already know, definitely not on small GPUs such as the Intel GPU you have.

But anyway, let's fix the tests first. One thing you can try is to run the tuners (see README), because perhaps the defaults are not suitable for your particular GPU? I've tested on other Intel GPUs with Beignet with succes.

vi commented 6 years ago

What do you mean exactly? And which routines?

I see messages in tests about failed operations because of missing features in GPU ("Unsupported precision" such as double or half floats). With Netlib API I expect all GPU details to be fully abstracted from user application, but this dependance on GPU features with exceptions in case of missing things is abstraction leak. A proper way would be fall back to CPU implementation if GPU can't do something. I don't know which exact routines (never programmed for BLAS so far), but I expect CLBlast with Netlib API to be drop-in replacement for OpenBLAS (or something like that). It may even include cblas.h or even be ABI-compatible with OpenBLAS, so that existing programs may LD_PRELOAD CLBlast and receive the GPU speedup even without recompilation.


Could you post the output of all failing tests?

for i in clblasttest*; do echo $i; DISPLAY= ./$i; echo $?; done &> log ?

vi commented 6 years ago

Tried tuning (log and jsons), but the python script fails afterwards:

$ python ../scripts/database/database.py . ..
[database] Downloading database from 'https://raw.githubusercontent.com/CNugteren/CLBlast-database/master/database.json'...
[database] Loading database from '../scripts/database/database.json'
[database] Processing './clblast_copy_32.json' with 128 new items
[database] Processing './clblast_routine_gemm_32.json' with 31 new items
[database] Processing './clblast_xger_32.json' with 108 new items
[database] Processing './clblast_xdot_2_32.json' with 5 new items
[database] Processing './clblast_padtranspose_32.json' with 14 new items
[database] Processing './clblast_xgemm_direct_2_32.json' with 125 new items
[database] Processing './clblast_xaxpy_32.json' with 64 new items
[database] Processing './clblast_xgemm_2_32.json' with 229 new items
[database] Processing './clblast_xgemv_fast_32.json' with 30 new items
[database] Processing './clblast_xdot_1_32.json' with 5 new items
[database] Processing './clblast_xgemv_fast_rot_32.json' with 68 new items
[database] Processing './clblast_xgemm_direct_1_32.json' with 45 new items
[database] Processing './clblast_xgemv_32.json' with 12 new items
[database] Processing './clblast_pad_32.json' with 72 new items
[database] Processing './clblast_transpose_32.json' with 48 new items
[database] Processing './clblast_xgemm_1_32.json' with 560 new items
[database] Saving database to '../scripts/database/database.json'
[database] Calculating the best results per device/kernel...
[database] Calculating the default values...
[database] Producing a C++ database in '../src/database/kernels'...
Traceback (most recent call last):
  File "../scripts/database/database.py", line 154, in <module>
    main(sys.argv[1:])
  File "../scripts/database/database.py", line 148, in main
    clblast.print_cpp_database(database_best_results, cpp_database_path)
  File "/mnt/src/git/CLBlast/scripts/database/database/clblast.py", line 177, in print_cpp_database
    assert len(kernel_database) == 1
AssertionError
sivagnanamn commented 6 years ago

@vi To get rid of the python error, delete the ../scripts/database/database.json file and re-run the python script. It'll download a new copy of database.json again & should work without any errors.

CNugteren commented 6 years ago

Thanks for sharing the output of the tests! A quick glance shows that it might be just the reduce and matrix-multiplication kernels failing, they are used in quite a few cases. Let's first see if the tuning can fix them.

To get rid of the python error, delete the ../scripts/database/database.json file and re-run the python script. It'll download a new copy of database.json again & should work without any errors.

Don't think that's going to work, since he just got a fresh copy anyway.

Tried tuning (log and jsons), but the python script fails afterwards.

OK, thanks for sharing the JSONs. I'll take a look myself this weekend at what's going wrong and I'll try to fix it and make the error message more meaningful for future cases. I'll report back as soon as I have something for you.

I expect CLBlast with Netlib API to be drop-in replacement for OpenBLAS (or something like that). It may even include cblas.h or even be ABI-compatible with OpenBLAS, so that existing programs may LD_PRELOAD CLBlast and receive the GPU speedup even without recompilation.

I understand your point, but that might be less trivial to implement than you suggest it. First of all, you'll have to query OpenCL to see what's supported and what not. Then, you'll have to call a BLAS routine, which is not trivial to do since CLBlast also uses cblas.h for that case. But then you'll need all the extra logic to make sure this runs on any platform with any CPU BLAS, and you'll need proper testing. However, even after doing all that, the Netlib API of CLBlast is only useful if you think about speed before using it, e.g. an AXPY operation will be slower due to memory copying overhead. Even an GEMM operation might be slower. So ideally you'll want to decision making per machine/routine/parameters. My conclusion here is that it will take too much effort to make the drop-in Netlib API of CLBlast useful in the sense that you describe it. And since it's not the main/recommended API, I will not be able to work on this in the foreseeable future, but I'll accept pull requests of course.

CNugteren commented 6 years ago

OK, I've just tested with your JSON files (thanks again for sharing), and I didn't encounter any issue. So it is likely that some things changed in the database in the meantime since the release of CLBlast v1.2.0. So there are two things you could do:

vi commented 6 years ago
  1. Checked out the master (7aabeb44ccbaac2b467a585ec11a79a85fdd7e34)
  2. Re-run the script: [database] All done.
  3. Checked git diff - 26 chunks
  4. Re-compile: ninja - 195 targets
  5. Re-test: ninja test
The following tests FAILED:
      5 - clblast_test_xdot (Failed)
      6 - clblast_test_xdotu (Failed)
      7 - clblast_test_xdotc (Failed)
      8 - clblast_test_xnrm2 (Failed)
      9 - clblast_test_xasum (Failed)
     12 - clblast_test_xgbmv (OTHER_FAULT)
     34 - clblast_test_xgemm (OTHER_FAULT)
     37 - clblast_test_xsyrk (Failed)
     38 - clblast_test_xherk (Failed)
     39 - clblast_test_xsyr2k (Failed)
     40 - clblast_test_xher2k (Failed)
     46 - clblast_test_xgemmbatched (OTHER_FAULT)
     48 - clblast_test_preprocessor (OTHER_FAULT)

6. Revert what database.py done: git reset --hard. 7. Re-build ninja - 53 targets 8. Re-test: ninja test

Same running time, same Faileds, same OTHER_FAULTs.

CNugteren commented 6 years ago

OK, thanks for testing. Let's first try to resolve the OTHER_FAULT issues, and then later the one that report regular correctness issues. For three of those (GBMV, GEMM, and GEMMBATCHED), I see quite curious output, e.g.:

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'SGBMV' routine. Legend:
   : -> Test produced correct results
   . -> Test returned the correct error code
   X -> Test produced incorrect results
   / -> Test returned an incorrect error code
   \ -> Test not executed: OpenCL-kernel compilation error
   o -> Test not executed: Unsupported precision
   - -> Test not completed: Reference CBLAS doesn't output error codes
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::
   Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '101 (row-major) 112 (transposed)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::
   Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '102 (col-major) 111 (regular)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::
   Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '102 (col-major) 112 (transposed)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)
terminate called after throwing an instance of 'clblast::CLCudaAPIError'
  what():  OpenCL error: clGetDeviceIDs: -1
Aborted

This error message is from Beignet, not from CLBlast. It suddenly cannot find your device anymore, which is strange. This seems to suggest something is wrong with your OpenCL set-up, or there is perhaps a bug in beignet? I could not find any issue on the Beignet Bugzilla tracker, but perhaps you can search and file one? First also check if it is reproducible, i.e. does it always fail at exactly the same test?

Then there is another issue with clblast_test_preprocessor it seems, but this seems to be a linker issue. I can't reproduce that myself on a Debian 9 / beignet system, tried with Clang and GCC, but both seem to work fine. Which compiler are you using?

vi commented 6 years ago

Shall I try downgrading beignet to v1.2?

vi commented 6 years ago

gcc (Debian 6.3.0-18) 6.3.0 20170516

vi commented 6 years ago

Is the system supposed to be usused durign tuning/testing or it is OK to browse around (ignoring graphics lags)?

CNugteren commented 6 years ago

Shall I try downgrading beignet to v1.2?

Not sure, you could try perhaps. I believe there was at least one other CLBlast user with your GPU, since there were already tuning results, so it must have worked correctly on some system at some point.

gcc (Debian 6.3.0-18) 6.3.0 20170516

On my Debian 9 test system I get exactly the same output when I run g++ --version, and there everything compiles and links correctly. Anyway, let's not spent too much time on that, it's just a small issue of linking a not-important specific test.

Is the system supposed to be usused durign tuning/testing or it is OK to browse around (ignoring graphics lags)?

When I test with Beignet I also run X at the same time and I don't see any issues.

So I would search the issue with Beignet or with the GPU drivers. Try other versions perhaps, or otherwise report the issue with Beignet. Could well be that the other failing tests are related to this as well...

CNugteren commented 6 years ago

So, did you have any luck with another version of Beignet? Or did you report this issue with the developers of Beignet?

vi commented 6 years ago

Not yet. And I'm not sure what steps should I do for the reporting. Is there some minimal failing case which supposed to work, but doesn't?

CNugteren commented 6 years ago

I'm not sure we need a minimal failing case here. If you look at the error you are getting returned from clGetDeviceIDs it means it cannot find your GPU:

beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)

However, it did use your GPU in the tests just moments before that. So it seems some time related instability. First also check if it is reproducible, i.e. does it always fail at exactly the same test instance?

Beignet bugs can be filed here. You could refer to this issue perhaps? First also double-check the list of existing Beignet bugs

vi commented 6 years ago

Tried running utests_run, it seems to work...

$ /usr/lib/x86_64-linux-gnu/beignet/utest_run
...
summary:
----------
  total: 1000
  run: 959
  pass: 959
  fail: 0
  pass rate: 1.000000
vi commented 6 years ago

Notes:

  1. If I don't unset DISPLAY, I often get spammed repeated "Maximum number of clients reached:" message, from the test as well as other unrelated apps.

  2. ./clblast_test_xgbmv seems to fail always at the same place with the same OpenCL error: clGetDeviceIDs: -1 error.

I though about reducing the clblast_test_xgbmv to something smaller, but the testing system is too complicated and I stopped trying after observing this.

How, for example, move the 'regular behaviour' for '102 (col-major) 112 (transposed)' to the first place? What test is when 156's : being printed? What if duplicate the first test ('regular behaviour' for '101 (row-major) 111 (regular)') 4 times instead of going to further tests?

CNugteren commented 6 years ago

OK, thanks for trying. I general I don't think anything can be done from the CLBlast side. Because if calling clGetDeviceIDs works at first a few 100 times and then doesn't work anymore, that means something strange is going on. So honestly I believe it is a bug in Beignet.

But you are right that trying to pinpoint whether it always the same test that fails is a good idea. What you can first do indeed is only to test that particular case. I'll help you out. Let's try two steps:

  1. First only run the col-major & transposed case only. For that, remove Transpose::kNo, from testblas.cpp#L26 and remove Layout::kRowMajor, from testblas.hpp#L147.

  2. Then, if it still fails, you can run the test with -verbose option on the command-line. That should show you the values for m,n,kl,ku,lda,incx,iny it is testing for. You can adjust those values around testblas.hpp#L129: by changing kIncrements, kMatrixVectorDims, and kBandSizes.

vi commented 6 years ago

I instead tried this:

diff --git a/test/correctness/testblas.cpp b/test/correctness/testblas.cpp
index aa4b478..be28ed3 100644
--- a/test/correctness/testblas.cpp
+++ b/test/correctness/testblas.cpp
@@ -23,7 +23,7 @@ namespace clblast {

 // The transpose configurations to test with: template parameter dependent
 template <> const std::vector<Transpose> TestBlas<half,half>::kTransposes = {Transpose::kNo, Transpose::kYes};
-template <> const std::vector<Transpose> TestBlas<float,float>::kTransposes = {Transpose::kNo, Transpose::kYes};
+template <> const std::vector<Transpose> TestBlas<float,float>::kTransposes = {Transpose::kNo, Transpose::kNo};
 template <> const std::vector<Transpose> TestBlas<double,double>::kTransposes = {Transpose::kNo, Transpose::kYes};
 template <> const std::vector<Transpose> TestBlas<float2,float2>::kTransposes = {Transpose::kNo, Transpose::kYes, Transpose::kConjugate};
 template <> const std::vector<Transpose> TestBlas<double2,double2>::kTransposes = {Transpose::kNo, Transpose::kYes, Transpose::kConjugate};
diff --git a/test/correctness/testblas.hpp b/test/correctness/testblas.hpp
index 4e02fd2..9c0830b 100644
--- a/test/correctness/testblas.hpp
+++ b/test/correctness/testblas.hpp
@@ -144,7 +144,7 @@ template <typename T, typename U> const std::vector<size_t> TestBlas<T,U>::kMatS
 template <typename T, typename U> const std::vector<size_t> TestBlas<T,U>::kVecSizes = {0, kBufferSize - 1, kBufferSize};

 // The layout/triangle options to test with
-template <typename T, typename U> const std::vector<Layout> TestBlas<T,U>::kLayouts = {Layout::kRowMajor, Layout::kColMajor};
+template <typename T, typename U> const std::vector<Layout> TestBlas<T,U>::kLayouts = {Layout::kRowMajor, Layout::kRowMajor};
 template <typename T, typename U> const std::vector<Triangle> TestBlas<T,U>::kTriangles = {Triangle::kUpper, Triangle::kLower};
 template <typename T, typename U> const std::vector<Side> TestBlas<T,U>::kSides = {Side::kLeft, Side::kRight};
 template <typename T, typename U> const std::vector<Diagonal> TestBlas<T,U>::kDiagonals = {Diagonal::kUnit, Diagonal::kNonUnit};

and got this:

* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::
   Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::
   Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::
   Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)
terminate called after throwing an instance of 'clblast::CLCudaAPIError'
  what():  OpenCL error: clGetDeviceIDs: -1
Aborted

Is there a simple program that just calls clGetDeviceIDs in endless loop? (I'm not familiar with OpenCL/Cuda/GPU world in general yet).

CNugteren commented 6 years ago

Hmm, interesting, so you think it would perhaps always happen at the n-th call to that function?

Something like this could help you perhaps:

#include <CL/opencl.h>
#include <cstdio>

#define NUM_RUNS 50
#define PLATFORM_ID 0

int main() {
  int status;

  for (int i = 0; i < NUM_RUNS; ++i) {
    printf("Test %d\n", i);

    cl_uint num_platforms = 0;
    status = clGetPlatformIDs(0, NULL, &num_platforms);
    if (status != CL_SUCCESS) { printf("Error in clGetPlatformIDs #1\n"); return 1; }

    cl_platform_id* platforms = new cl_platform_id[num_platforms];
    status = clGetPlatformIDs(num_platforms, platforms, NULL);
    if (status != CL_SUCCESS) { printf("Error in clGetPlatformIDs #2\n"); return 1; }

    cl_uint result = 0;
    cl_platform_id platform = platforms[PLATFORM_ID];
    status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 0, NULL, &result);
    if (status != CL_SUCCESS) { printf("Error in clGetDeviceIDs\n"); return 1; }

    delete[] platforms;
  }
  return 0;
}
vi commented 6 years ago

This does not fail (increased NUM_RUNS and removed the main printf), even if I also start clblast_test_xgbmv in parallel.

Running the test under valgrind (snippet):

* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   ::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
   :::::::::::::::::::::::::::==6005== Warning: invalid file descriptor 1031 in syscall open()
==6005== Warning: invalid file descriptor 1031 in syscall open()
==6005== Warning: invalid file descriptor 1031 in syscall open()
:==6005== Warning: invalid file descriptor 1031 in syscall open()
==6005== Warning: invalid file descriptor 1031 in syscall open()
beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)
terminate called after throwing an instance of 'clblast::CLCudaAPIError'
  what():  OpenCL error: clGetDeviceIDs: -1
==6005== 
$ ulimit -n
1024

There are a lot of /dev/dri/renderD128 open files.

After ulimit -n 4096 the test succeeds.

$ ulimit -n 10
$ ./clblast_test_xgbmv 
...
* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'SGBMV' routine. Legend:
...
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
   ::::::beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)
terminate called after throwing an instance of 'clblast::CLCudaAPIError'
  what():  OpenCL error: clGetDeviceIDs: -1
Aborted
CNugteren commented 6 years ago

In your last example it fails at a different place then before, so the location is not deterministic?

vi commented 6 years ago

It fails when file descriptors run out. CLBlast test (or some dep) opens them, but does not close properly.

ulimit -n sets maximum number of file descriptors, so the fewer opened files allowed, the sooner it fails.

CNugteren commented 6 years ago

OK, never heard of those. CLBlast doesn't open any files while testing. Must be Beignet related then I guess?

Can you then re-run all the tests with that 'fix' applied and report which ones still have open issues?

vi commented 6 years ago
$ git describe --always --dirty
1.2.0_rc1-152-g37c5e8f

$ for i in clblast_test_*; do DISPLAY= ./$i ; echo RESULT $i $?; done &> log 

$ grep RESULT log
RESULT clblast_test_diagnostics 0
RESULT clblast_test_override_parameters 0
RESULT clblast_test_preprocessor 133
RESULT clblast_test_retrieve_parameters 0
RESULT clblast_test_xamax 0
RESULT clblast_test_xasum 1
RESULT clblast_test_xaxpy 0
RESULT clblast_test_xaxpybatched 0
RESULT clblast_test_xcopy 0
RESULT clblast_test_xdot 1
RESULT clblast_test_xdotc 1
RESULT clblast_test_xdotu 0
RESULT clblast_test_xgbmv 0
RESULT clblast_test_xgemm 1
RESULT clblast_test_xgemmbatched 0
RESULT clblast_test_xgemmstridedbatched 0
RESULT clblast_test_xgemv 0
RESULT clblast_test_xger 0
RESULT clblast_test_xgerc 0
RESULT clblast_test_xgeru 0
RESULT clblast_test_xhbmv 0
RESULT clblast_test_xhemm 0
RESULT clblast_test_xhemv 0
RESULT clblast_test_xher 0
RESULT clblast_test_xher2 0
RESULT clblast_test_xher2k 1
RESULT clblast_test_xherk 1
RESULT clblast_test_xhpmv 0
RESULT clblast_test_xhpr 0
RESULT clblast_test_xhpr2 0
RESULT clblast_test_xim2col 0
RESULT clblast_test_xnrm2 1
RESULT clblast_test_xomatcopy 0
RESULT clblast_test_xsbmv 0
RESULT clblast_test_xscal 0
RESULT clblast_test_xspmv 0
RESULT clblast_test_xspr 0
RESULT clblast_test_xspr2 0
RESULT clblast_test_xswap 0
RESULT clblast_test_xsymm 0
RESULT clblast_test_xsymv 0
RESULT clblast_test_xsyr 0
RESULT clblast_test_xsyr2 0
RESULT clblast_test_xsyr2k 1
RESULT clblast_test_xsyrk 1
RESULT clblast_test_xtbmv 0
RESULT clblast_test_xtpmv 0
RESULT clblast_test_xtrmm 0
RESULT clblast_test_xtrmv 0
RESULT clblast_test_xtrsm 0
RESULT clblast_test_xtrsv 0

Full log: https://gist.github.com/vi/768d8d06915b58dc57c4c9a41802ddd9

CNugteren commented 6 years ago

OK, thank you for running. So in summary the remaining failing tests are:

RESULT clblast_test_xasum 1
RESULT clblast_test_xdot 1
RESULT clblast_test_xdotc 1
RESULT clblast_test_xgemm 1
RESULT clblast_test_xher2k 1
RESULT clblast_test_xherk 1
RESULT clblast_test_xsyr2k 1
RESULT clblast_test_xsyrk 1

The first 3 all use the dot/reduce kernels, while the other 5 all use the GEMM kernels. So it seems perhaps there are only 2 real failures. Let's start with the first one first (the dot kernel). Could you run two things:

./clblast_tuner_xdot
./clblast_test_xdot --verbose

And then perhaps fiddle with the parameters for your device for that kernel (single precision only first): src/database/kernels/xdot/xdot_32.hpp#L84 and see if you can get that working with different parameters.

vi commented 6 years ago

./clblast_tuner_xdot ./clblast_test_xdot --verbose

* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]
    -precision 32 (single) [=default]
    -n 2097152 [=default]
    -fraction 1.00 [=default]
    -runs 10 [=default]
    -max_l2_norm 0.00 [=default]

* Found 6 configuration(s)
* Parameters explored: WGS1 

|   ID | total |param |       compiles |         time |   GB/s |            status |
x------x-------x------x----------------x--------------x--------x-------------------x
|  ref |     - |    - |             OK |      1.27 ms |      - |      reference OK |
x------x-------x------x----------------x--------------x--------x-------------------x
|    1 |     6 |   32 |   OK    365 ms |      1.28 ms |   13.1 |     results match |
|    2 |     6 |   64 |   OK    355 ms |      1.22 ms |   13.7 |     results match |
|    3 |     6 |  128 |   OK    375 ms |      1.18 ms |   14.3 |     results match |
|    4 |     6 |  256 |   OK    348 ms |      1.28 ms |   13.2 |     results match |
|    5 |     6 |  512 |   OK    344 ms |      1.27 ms |   13.2 |     results match |
|    6 |     6 | 1024 |   OK    353 ms |  error -55   |      - |   invalid config. | <-- skipping
x------x-------x------x----------------x--------------x--------x-------------------x

* Found best result 1.18 ms: 14.3 GB/s
* Best parameters: PRECISION=32 WGS1=128

* Writing a total of 5 results to 'clblast_xdot_1_32.json'
* Completed tuning process

* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]
    -precision 32 (single) [=default]
    -n 2097152 [=default]
    -fraction 1.00 [=default]
    -runs 10 [=default]
    -max_l2_norm 0.00 [=default]

* Found 6 configuration(s)
* Parameters explored: WGS2 

|   ID | total |param |       compiles |         time |    N/A |            status |
x------x-------x------x----------------x--------------x--------x-------------------x
|  ref |     - |    - |             OK |      0.14 ms |      - |      reference OK |
x------x-------x------x----------------x--------------x--------x-------------------x
|    1 |     6 |   32 |   OK    368 ms |      0.11 ms |    0.0 |     results match |
|    2 |     6 |   64 |   OK    363 ms |      0.14 ms |    0.0 |     results match |
|    3 |     6 |  128 |   OK    385 ms |      0.14 ms |    0.0 |     results match |
|    4 |     6 |  256 |   OK    346 ms |      0.11 ms |    0.0 |     results match |
|    5 |     6 |  512 |   OK    352 ms |      0.17 ms |    0.0 |     results match |
|    6 |     6 | 1024 |   OK    350 ms |  error -55   |      - |   invalid config. | <-- skipping
x------x-------x------x----------------x--------------x--------x-------------------x

* Found best result 0.11 ms: 0.0 N/A
* Best parameters: PRECISION=32 WGS2=32

* Writing a total of 5 results to 'clblast_xdot_2_32.json'
* Completed tuning process

* Options given/available:
    -platform 0 [=default]
    -device 0 [=default]
    -full_test [false]
    -verbose [true]
    -cblas 1 [=default]
    -clblas 0 [=default]

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'SDOT' routine. Legend:
   : -> Test produced correct results
   . -> Test returned the correct error code
   X -> Test produced incorrect results
   / -> Test returned an incorrect error code
   \ -> Test not executed: OpenCL-kernel compilation error
   o -> Test not executed: Unsupported precision
   - -> Test not completed: Reference CBLAS doesn't output error codes
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for 'default':
   Testing: n=7 incx=1 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  -0.84 (CLBlast)
   Combined average L2 error: 7.09e-01
   X
   Testing: n=7 incx=1 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  2.82 (CLBlast)
   Combined average L2 error: 7.94e+00
   X
   Testing: n=7 incx=1 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  4.17 (CLBlast)
   Combined average L2 error: 1.74e+01
   X
   Testing: n=7 incx=2 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
   Testing: n=7 incx=2 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  -2.42 (CLBlast)
   Combined average L2 error: 5.86e+00
   X
   Testing: n=7 incx=2 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
   Testing: n=7 incx=7 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  -2.50 (CLBlast)
   Combined average L2 error: 6.26e+00
   X
   Testing: n=7 incx=7 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
   Testing: n=7 incx=7 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
   Testing: n=93 incx=1 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  1.93 (CLBlast)
   Combined average L2 error: 3.72e+00
   X
   Testing: n=93 incx=1 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
   Testing: n=93 incx=1 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
   Testing: n=93 incx=2 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  7.03 (CLBlast)
   Combined average L2 error: 4.95e+01
   X
   Testing: n=93 incx=2 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  -10.82 (CLBlast)
   Combined average L2 error: 1.17e+02
   X
   Testing: n=93 incx=2 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  12.52 (CLBlast)
   Combined average L2 error: 1.57e+02
   X
   Testing: n=93 incx=7 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  9.62 (CLBlast)
   Combined average L2 error: 9.26e+01
   X
   Testing: n=93 incx=7 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  -6.64 (CLBlast)
   Combined average L2 error: 4.41e+01
   X
   Testing: n=93 incx=7 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  6.36 (CLBlast)
   Combined average L2 error: 4.05e+01
   X
   Testing: n=144 incx=1 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  2.93 (CLBlast)
   Combined average L2 error: 8.59e+00
   X
   Testing: n=144 incx=1 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  -3.51 (CLBlast)
   Combined average L2 error: 1.23e+01
   X
   Testing: n=144 incx=1 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
   Testing: n=144 incx=2 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
   Testing: n=144 incx=2 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
   Testing: n=144 incx=2 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  -1.49 (CLBlast)
   Combined average L2 error: 2.21e+00
   X
   Testing: n=144 incx=7 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
   Testing: n=144 incx=7 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
   Testing: n=144 incx=7 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
   Testing: n=4096 incx=1 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
   Testing: n=4096 incx=1 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  -41.80 (CLBlast)
   Combined average L2 error: 1.75e+03
   X
   Testing: n=4096 incx=1 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  51.62 (CLBlast)
   Combined average L2 error: 2.66e+03
   X
   Testing: n=4096 incx=2 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  82.72 (CLBlast)
   Combined average L2 error: 6.84e+03
   X
   Testing: n=4096 incx=2 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  -214.22 (CLBlast)
   Combined average L2 error: 4.59e+04
   X
   Testing: n=4096 incx=2 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  13.29 (CLBlast)
   Combined average L2 error: 1.77e+02
   X
   Testing: n=4096 incx=7 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  -28.65 (CLBlast)
   Combined average L2 error: 8.21e+02
   X
   Testing: n=4096 incx=7 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> 
   Error at index 0:  0.00 (reference) versus  22.19 (CLBlast)
   Combined average L2 error: 4.92e+02
   X
   Testing: n=4096 incx=7 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
   Error rate 100.00%: n=7 incx=1 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=1 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=1 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=2 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=7 incx=7 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=1 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=2 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=2 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=2 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=7 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=7 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=93 incx=7 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=1 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=1 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=144 incx=2 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=1 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=1 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=2 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=2 incy=2 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=2 incy=7 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=7 incy=1 offx=0 offy=0 offdot=0 
   Error rate 100.00%: n=4096 incx=7 incy=2 offx=0 offy=0 offdot=0 
   Pass rate  38.9%: 14 passed / 0 skipped / 22 failed
* Completed all test-cases for this routine. Results:
   14 test(s) passed
   0 test(s) skipped
   22 test(s) failed

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'DDOT' routine.
* All tests skipped: Unsupported precision

* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'HDOT' routine.
* All tests skipped: Unsupported precision
vi commented 6 years ago

For completess: the GPU may have felt a little bit sick at the time of test. At least the graphical scaling glitch is still here.

CNugteren commented 6 years ago

Sorry I forget about this issue. Thanks for testing. The tuner result looks OK. Not sure how to continue though since I can't test myself and start to debug the issue, because I can't reproduce it.

Perhaps this other issue #149 might help you out. It seems it is also the same GPU, but a different OpenCL (not Beignet but Apple OpenCL).

vi commented 6 years ago

If needed I can run special modified versions for debugging or maybe give access for remote debugging on my laptop.

But maybe I should "play" with Beignet versions first. I've already built one from source code, but not sure yet how to install it into Debian (or can it be used without installation).

CNugteren commented 6 years ago

But maybe I should "play" with Beignet versions first. I've already built one from source code, but not sure yet how to install it into Debian (or can it be used without installation).

You can make install it into a directory specified when you ran CMake (-DCMAKE_INSTALL_PREFIX=/path/to/install). If none specified, it will just install in your system's path and will overwrite any existing OpenCL afaik. Otherwise you'll have multiple OpenCL platforms and you'll need to select the right one.

vi commented 6 years ago

select the right one

How do I select the right one ensuring no pieces of the wrong one is on the way and also without disruptive changes to the system from root? It it just LD_LIBRARY_PATH or LD_PRELOAD or something trickier?

CNugteren commented 6 years ago

Not sure, I'm not an expert on that... But what I meant was 'select' in the OpenCL platform sense. If you do it right, you might have both Beignet's co-existing on your system, clinfo will show both. But you can of course also try to set the library path.

CNugteren commented 6 years ago

Any updates here? Or should we conclude it is not CLBlast-related?

vi commented 6 years ago

Not experimented yet with other Beignets. Not sure if it is appropriate to report bugs there without trying fresher build...

Maybe CLBlast is doing things OK, but also can contain workaround for broken platforms...

If/when I come back to experimenting with OpenCL in general and Beignet and/or CLBlast in particular, I'll comment.

CNugteren commented 6 years ago

Intel now has a new open-source implementation that is replacing Beignet. Perhaps it is time to try the new Intel NEO?

vi commented 6 years ago

Gen8 (Broadwell) and beyond

Is it something new-ish? Unlikely that it would work on my laptop.

CNugteren commented 6 years ago

Indeed, it seems that your hardware is not supported. Neo is new indeed, Beignet is now discontinued, so that won't lead to solving this issue either it seems.

How do you suggest we proceed? Do you still have time to test things? We could also close this issue and say that older hardware is not properly supported in all cases...

vi commented 6 years ago

Do you still have time to test things?

Yes, I'm constantly trying various things (CLBlast being a detour from experimenting with various deep learning toys and thinking "what if I can workaround missing OpenCL support for ... by using CLBlast instead of usual BLAS library").

How do you suggest we proceed?

Maybe like previously, me trying updated beingnet (or just waiting until eventually updated Beignet comes to Debian Stable), then maybe reporting additional issues to Beignet.

CNugteren commented 6 years ago

Any updates from your side?

vi commented 6 years ago

Not yet.

Is it something urgent or you just don't want a danging open issue? I'll report results here if/when I resume experimentation regardless of closedness status of this issue.

For now I just treat my laptop as Not Ready For GPU Computing.

CNugteren commented 6 years ago

OK. Yes, I see this as a list of things I have to work on :-) I can also add your setup to the list of known issues, close this issue, and we can follow-up later with you and/or Intel when you have time to see if anything can be fixed?

vi commented 6 years ago

Got round and installed beignet from master.

3 passing tests in ./clblast_test_xdot disappeared and all of them fail. Although ./clblast_test_xdotc keeps on working. FD leak persists.

Beignet's own tests almost succeed: pass rate: 0.999005.

Using clblast dda1e567f872d3d89f2f7cd890fb5b29ff98537c and beignet 591d387327ce35f03a6152d4c823415729e221f2.

vi commented 6 years ago

Tried beignet 1.2.1 (097365ed1a79cd03dc689b37b03552e455eb3854) and seeing more successful tests.

vi commented 6 years ago

Tests now look much better:

$ ninja test
[0/1] Running tests...
Test project /home/vi/src/git/CLBlast/build
      Start  1: clblast_test_xswap
 1/51 Test  #1: clblast_test_xswap .................   Passed    1.58 sec
      Start  2: clblast_test_xscal
 2/51 Test  #2: clblast_test_xscal .................   Passed    1.16 sec
      Start  3: clblast_test_xcopy
 3/51 Test  #3: clblast_test_xcopy .................   Passed    1.56 sec
      Start  4: clblast_test_xaxpy
 4/51 Test  #4: clblast_test_xaxpy .................   Passed    1.59 sec
      Start  5: clblast_test_xdot
 5/51 Test  #5: clblast_test_xdot ..................   Passed    0.82 sec
      Start  6: clblast_test_xdotu
 6/51 Test  #6: clblast_test_xdotu .................   Passed    0.88 sec
      Start  7: clblast_test_xdotc
 7/51 Test  #7: clblast_test_xdotc .................   Passed    0.89 sec
      Start  8: clblast_test_xnrm2
 8/51 Test  #8: clblast_test_xnrm2 .................   Passed    1.16 sec
      Start  9: clblast_test_xasum
 9/51 Test  #9: clblast_test_xasum .................   Passed    1.13 sec
      Start 10: clblast_test_xamax
10/51 Test #10: clblast_test_xamax .................   Passed    1.12 sec
      Start 11: clblast_test_xgemv
11/51 Test #11: clblast_test_xgemv .................   Passed    5.75 sec
      Start 12: clblast_test_xgbmv
12/51 Test #12: clblast_test_xgbmv .................   Passed   53.21 sec
      Start 13: clblast_test_xhemv
13/51 Test #13: clblast_test_xhemv .................   Passed    1.98 sec
      Start 14: clblast_test_xhbmv
14/51 Test #14: clblast_test_xhbmv .................   Passed    4.82 sec
      Start 15: clblast_test_xhpmv
15/51 Test #15: clblast_test_xhpmv .................   Passed    1.97 sec
      Start 16: clblast_test_xsymv
16/51 Test #16: clblast_test_xsymv .................   Passed    1.72 sec
      Start 17: clblast_test_xsbmv
17/51 Test #17: clblast_test_xsbmv .................   Passed    3.97 sec
      Start 18: clblast_test_xspmv
18/51 Test #18: clblast_test_xspmv .................   Passed    1.80 sec
      Start 19: clblast_test_xtrmv
19/51 Test #19: clblast_test_xtrmv .................   Passed   28.67 sec
      Start 20: clblast_test_xtbmv
20/51 Test #20: clblast_test_xtbmv .................   Passed   78.94 sec
      Start 21: clblast_test_xtpmv
21/51 Test #21: clblast_test_xtpmv .................   Passed   20.33 sec
      Start 22: clblast_test_xtrsv
22/51 Test #22: clblast_test_xtrsv .................   Passed   30.57 sec
      Start 23: clblast_test_xger
23/51 Test #23: clblast_test_xger ..................   Passed    1.49 sec
      Start 24: clblast_test_xgeru
24/51 Test #24: clblast_test_xgeru .................   Passed    1.73 sec
      Start 25: clblast_test_xgerc
25/51 Test #25: clblast_test_xgerc .................   Passed    1.68 sec
      Start 26: clblast_test_xher
26/51 Test #26: clblast_test_xher ..................   Passed    0.95 sec
      Start 27: clblast_test_xhpr
27/51 Test #27: clblast_test_xhpr ..................   Passed    0.77 sec
      Start 28: clblast_test_xher2
28/51 Test #28: clblast_test_xher2 .................   Passed    1.74 sec
      Start 29: clblast_test_xhpr2
29/51 Test #29: clblast_test_xhpr2 .................   Passed    1.33 sec
      Start 30: clblast_test_xsyr
30/51 Test #30: clblast_test_xsyr ..................   Passed    0.89 sec
      Start 31: clblast_test_xspr
31/51 Test #31: clblast_test_xspr ..................   Passed    0.72 sec
      Start 32: clblast_test_xsyr2
32/51 Test #32: clblast_test_xsyr2 .................   Passed    1.62 sec
      Start 33: clblast_test_xspr2
33/51 Test #33: clblast_test_xspr2 .................   Passed    1.28 sec
      Start 34: clblast_test_xgemm
34/51 Test #34: clblast_test_xgemm .................   Passed   39.85 sec
      Start 35: clblast_test_xsymm
35/51 Test #35: clblast_test_xsymm .................   Passed    5.64 sec
      Start 36: clblast_test_xhemm
36/51 Test #36: clblast_test_xhemm .................   Passed    2.36 sec
      Start 37: clblast_test_xsyrk
37/51 Test #37: clblast_test_xsyrk .................   Passed    3.41 sec
      Start 38: clblast_test_xherk
38/51 Test #38: clblast_test_xherk .................   Passed    1.78 sec
      Start 39: clblast_test_xsyr2k
39/51 Test #39: clblast_test_xsyr2k ................   Passed    5.20 sec
      Start 40: clblast_test_xher2k
40/51 Test #40: clblast_test_xher2k ................   Passed    2.59 sec
      Start 41: clblast_test_xtrmm
41/51 Test #41: clblast_test_xtrmm .................   Passed   59.26 sec
      Start 42: clblast_test_xtrsm
42/51 Test #42: clblast_test_xtrsm .................   Passed   74.42 sec
      Start 43: clblast_test_xhad
43/51 Test #43: clblast_test_xhad ..................   Passed    1.01 sec
      Start 44: clblast_test_xomatcopy
44/51 Test #44: clblast_test_xomatcopy .............   Passed    1.11 sec
      Start 45: clblast_test_xim2col
45/51 Test #45: clblast_test_xim2col ...............   Passed    3.37 sec
      Start 46: clblast_test_xaxpybatched
46/51 Test #46: clblast_test_xaxpybatched ..........   Passed    4.14 sec
      Start 47: clblast_test_xgemmbatched
47/51 Test #47: clblast_test_xgemmbatched ..........   Passed   65.15 sec
      Start 48: clblast_test_xgemmstridedbatched
48/51 Test #48: clblast_test_xgemmstridedbatched ...   Passed   63.62 sec
      Start 49: clblast_test_override_parameters
49/51 Test #49: clblast_test_override_parameters ...   Passed    9.00 sec
      Start 50: clblast_test_retrieve_parameters
50/51 Test #50: clblast_test_retrieve_parameters ...   Passed    0.16 sec
      Start 51: clblast_test_preprocessor
51/51 Test #51: clblast_test_preprocessor ..........***Exception: Other  7.65 sec

98% tests passed, 1 tests failed out of 51

Total Test time (real) = 609.63 sec

The following tests FAILED:
     51 - clblast_test_preprocessor (OTHER_FAULT)
Errors while running CTest
FAILED: CMakeFiles/test.util 

Opened filehandles of /dev/dri/renderD128 keep on accumulating during tests that take much time.

Shall I run the tuning process?