Closed vi closed 6 years ago
Thanks for reporting. Could you give a bit more info though? Which device are you testing on? And can you post the results of the failing test runs?
Lenovo Thinkpad X230. Linux vi-notebook 4.9.33-grsec-64+ #85 SMP PREEMPT Sat Jul 15 00:47:47 +03 2017 x86_64 GNU/Linux
00:02.0 VGA compatible controller: Intel Corporation 3rd Gen Core processor Graphics Controller (rev 09)
How do I get those results? Is it just the terminal output?
Would be helpful to run clinfo
or the included clblast_test_diagnostics
tool to get the name of your device (e.g. HD Graphics Haswell Ultrabook GT2 Mobile
).
CMake just runs the test executables, but stores the output somewhere else. You can probably find that in subfolders on disk. Consult the CMake/CTest documentation to get more info. Otherwise, you can just manually run the test executables, e.g. ./clblast_test_xaxpy
.
Number of platforms 1
Platform Name Intel Gen OCL Driver
Platform Vendor Intel
Platform Version OpenCL 2.0 beignet 1.3
Platform Profile FULL_PROFILE
Platform Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing
Platform Extensions function suffix Intel
Platform Name Intel Gen OCL Driver
Number of devices 1
Device Name Intel(R) HD Graphics IvyBridge M GT2
Device Vendor Intel
Device Vendor ID 0x8086
Device Version OpenCL 1.2 beignet 1.3
Driver Version 1.3
Device OpenCL C Version OpenCL C 1.2 beignet 1.3
Device Type GPU
Device Profile FULL_PROFILE
Max compute units 16
Max clock frequency 1000MHz
Device Partition (core)
Max number of sub-devices 1
Supported partition types None, None, None
Max work item dimensions 3
Max work item sizes 512x512x512
Max work group size 512
Preferred work group size multiple 16
Preferred / native vector sizes
char 16 / 8
short 8 / 8
int 4 / 4
long 2 / 2
half 0 / 8 (n/a)
float 4 / 4
double 0 / 2 (n/a)
Half-precision Floating-point support (n/a)
Single-precision Floating-point support (core)
Denormals No
Infinity and NANs Yes
Round to nearest Yes
Round to zero No
Round to infinity No
IEEE754-2008 fused multiply-add No
Support is emulated in software No
Correctly-rounded divide and sqrt operations No
Double-precision Floating-point support (n/a)
Address bits 32, Little-Endian
Global memory size 2147483648 (2GiB)
Error Correction support No
Max memory allocation 1610612736 (1.5GiB)
Unified memory for Host and Device Yes
Minimum alignment for any data type 128 bytes
Alignment of base address 1024 bits (128 bytes)
Global Memory cache type Read/Write
Global Memory cache size 8192
Global Memory cache line 64 bytes
Image support Yes
Max number of samplers per kernel 16
Max size for 1D images from buffer 65536 pixels
Max 1D or 2D image array size 2048 images
Base address alignment for 2D image buffers 4096 bytes
Pitch alignment for 2D image buffers 1 bytes
Max 2D image size 8192x8192 pixels
Max 3D image size 8192x8192x2048 pixels
Max number of read image args 128
Max number of write image args 8
Local memory type Local
Local memory size 65536 (64KiB)
Max constant buffer size 134217728 (128MiB)
Max number of constant args 8
Max size of kernel argument 1024
Queue properties
Out-of-order execution No
Profiling Yes
Prefer user sync for interop Yes
Profiling timer resolution 80ns
Execution capabilities
Run OpenCL kernels Yes
Run native kernels Yes
SPIR versions 1.2
printf() buffer size 1048576 (1024KiB)
Built-in kernels __cl_copy_region_align4;__cl_copy_region_align16;__cl_cpy_region_unalign_same_offset;__cl_copy_region_unalign_dst_offset;__cl_copy_region_unalign_src_offset;__cl_copy_buffer_rect;__cl_copy_image_1d_to_1d;__cl_copy_image_2d_to_2d;__cl_copy_image_3d_to_2d;__cl_copy_image_2d_to_3d;__cl_copy_image_3d_to_3d;__cl_copy_image_2d_to_buffer;__cl_copy_image_3d_to_buffer;__cl_copy_buffer_to_image_2d;__cl_copy_buffer_to_image_3d;__cl_fill_region_unalign;__cl_fill_region_align2;__cl_fill_region_align4;__cl_fill_region_align8_2;__cl_fill_region_align8_4;__cl_fill_region_align8_8;__cl_fill_region_align8_16;__cl_fill_region_align128;__cl_fill_image_1d;__cl_fill_image_1d_array;__cl_fill_image_2d;__cl_fill_image_2d_array;__cl_fill_image_3d;block_motion_estimate_intel;
Device Available Yes
Compiler Available Yes
Linker Available Yes
Device Extensions cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing cl_intel_motion_estimation
NULL platform behavior
clGetPlatformInfo(NULL, CL_PLATFORM_NAME, ...) Intel Gen OCL Driver
clGetDeviceIDs(NULL, CL_DEVICE_TYPE_ALL, ...) Success [Intel]
clCreateContext(NULL, ...) [default] Success [Intel]
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CPU) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics IvyBridge M GT2
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ACCELERATOR) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_CUSTOM) No devices found in platform
clCreateContextFromType(NULL, CL_DEVICE_TYPE_ALL) Success (1)
Platform Name Intel Gen OCL Driver
Device Name Intel(R) HD Graphics IvyBridge M GT2
ICD loader properties
ICD loader Name OpenCL ICD Loader
ICD loader Vendor OCL Icd free software
ICD loader Version 2.2.11
ICD loader Profile OpenCL 2.1
* Options given/available:
-platform 0 [=default]
-device 0 [=default]
--- OpenCL device naming:
* Device type GPU
* Device name Intel(R) HD Graphics IvyBridge M GT2
* Platform vendor Intel
* Platform version OpenCL 2.0 beignet 1.3
--- CLBlast device naming:
* Device type GPU
* Device name Intel(R) HD Graphics IvyBridge M GT2
* Device vendor Intel
* Device architecture
--- OpenCL device properties:
* Max work group size 512
* Max work item dimensions 3
* - Max work item size #0 512
* - Max work item size #1 512
* - Max work item size #2 512
* Local memory size 65536KB
* Extensions:
cl_khr_global_int32_base_atomics cl_khr_global_int32_extended_atomics cl_khr_local_int32_base_atomics cl_khr_local_int32_extended_atomics cl_khr_byte_addressable_store cl_khr_3d_image_writes cl_khr_image2d_from_buffer cl_khr_depth_images cl_khr_spir cl_khr_icd cl_intel_accelerator cl_intel_subgroups cl_intel_subgroups_short cl_khr_gl_sharing cl_intel_motion_estimation
--- Some OpenCL library benchmarks (functions from clpp11.h):
* queue.GetContext() 0.0003 ms
* queue.GetDevice() 0.0002 ms
* device.Name() 0.0002 ms
* device.Vendor() 0.0001 ms
* device.Version() 0.0002 ms
* device.Platform() 0.0001 ms
* Buffer<float>(context, 1024) 0.0071 ms
$ DISPLAY= ./clblast_test_xaxpy
* Options given/available:
-platform 0 [=default]
-device 0 [=default]
-full_test [false]
-verbose [false]
-cblas 1 [=default]
* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'SAXPY' routine. Legend:
: -> Test produced correct results
. -> Test returned the correct error code
X -> Test produced incorrect results
/ -> Test returned an incorrect error code
\ -> Test not executed: OpenCL-kernel compilation error
o -> Test not executed: Unsupported precision
- -> Test not completed: Reference CBLAS doesn't output error codes
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for 'default':
::::::::::::::::::::::::::::::::::::
Pass rate 100.0%: 36 passed / 0 skipped / 0 failed
* Completed all test-cases for this routine. Results:
36 test(s) passed
0 test(s) skipped
0 test(s) failed
* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'DAXPY' routine.
* All tests skipped: Unsupported precision
* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'CAXPY' routine. Legend:
: -> Test produced correct results
. -> Test returned the correct error code
X -> Test produced incorrect results
/ -> Test returned an incorrect error code
\ -> Test not executed: OpenCL-kernel compilation error
o -> Test not executed: Unsupported precision
- -> Test not completed: Reference CBLAS doesn't output error codes
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for 'default':
::::::::::::::::::::::::::::::::::::
Pass rate 100.0%: 36 passed / 0 skipped / 0 failed
* Completed all test-cases for this routine. Results:
36 test(s) passed
0 test(s) skipped
0 test(s) failed
* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'ZAXPY' routine.
* All tests skipped: Unsupported precision
* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'HAXPY' routine.
* All tests skipped: Unsupported precision
$ DISPLAY= ./clblast_test_xdot
* Options given/available:
-platform 0 [=default]
-device 0 [=default]
-full_test [false]
-verbose [false]
-cblas 1 [=default]
* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'SDOT' routine. Legend:
: -> Test produced correct results
. -> Test returned the correct error code
X -> Test produced incorrect results
/ -> Test returned an incorrect error code
\ -> Test not executed: OpenCL-kernel compilation error
o -> Test not executed: Unsupported precision
- -> Test not completed: Reference CBLAS doesn't output error codes
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for 'default':
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
Error rate 100.00%: n=7 incx=1 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=7 incx=1 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=7 incx=1 incy=7 offx=0 offy=0 offdot=0
Error rate 100.00%: n=7 incx=2 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=7 incx=2 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=7 incx=2 incy=7 offx=0 offy=0 offdot=0
Error rate 100.00%: n=7 incx=7 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=7 incx=7 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=7 incx=7 incy=7 offx=0 offy=0 offdot=0
Error rate 100.00%: n=93 incx=1 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=93 incx=1 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=93 incx=1 incy=7 offx=0 offy=0 offdot=0
Error rate 100.00%: n=93 incx=2 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=93 incx=2 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=93 incx=2 incy=7 offx=0 offy=0 offdot=0
Error rate 100.00%: n=93 incx=7 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=93 incx=7 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=93 incx=7 incy=7 offx=0 offy=0 offdot=0
Error rate 100.00%: n=144 incx=1 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=144 incx=1 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=144 incx=1 incy=7 offx=0 offy=0 offdot=0
Error rate 100.00%: n=144 incx=2 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=144 incx=2 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=144 incx=2 incy=7 offx=0 offy=0 offdot=0
Error rate 100.00%: n=144 incx=7 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=144 incx=7 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=144 incx=7 incy=7 offx=0 offy=0 offdot=0
Error rate 100.00%: n=4096 incx=1 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=4096 incx=1 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=4096 incx=1 incy=7 offx=0 offy=0 offdot=0
Error rate 100.00%: n=4096 incx=2 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=4096 incx=2 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=4096 incx=2 incy=7 offx=0 offy=0 offdot=0
Error rate 100.00%: n=4096 incx=7 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=4096 incx=7 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=4096 incx=7 incy=7 offx=0 offy=0 offdot=0
Pass rate 0.0%: 0 passed / 0 skipped / 36 failed
* Completed all test-cases for this routine. Results:
0 test(s) passed
0 test(s) skipped
36 test(s) failed
* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'DDOT' routine.
* All tests skipped: Unsupported precision
* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'HDOT' routine.
* All tests skipped: Unsupported precision
Can CLBlast fall back to usual OpenBLAS for unsupported operations by the way?
OK thanks, probably Intel(R) HD Graphics IvyBridge M GT2
was sufficient info though ;-) Could you post the output of all failing tests? Either as attachment or as Gist/Pastebin, because otherwise this issue becomes a bit unreadable.
Can CLBlast fall back to usual OpenBLAS for unsupported operations by the way?
What do you mean exactly? And which routines? Are you using the Netlib API by the way instead of the OpenCL API? That's not recommended for speed as you might already know, definitely not on small GPUs such as the Intel GPU you have.
But anyway, let's fix the tests first. One thing you can try is to run the tuners (see README), because perhaps the defaults are not suitable for your particular GPU? I've tested on other Intel GPUs with Beignet with succes.
What do you mean exactly? And which routines?
I see messages in tests about failed operations because of missing features in GPU ("Unsupported precision" such as double or half floats). With Netlib API I expect all GPU details to be fully abstracted from user application, but this dependance on GPU features with exceptions in case of missing things is abstraction leak. A proper way would be fall back to CPU implementation if GPU can't do something. I don't know which exact routines (never programmed for BLAS so far), but I expect CLBlast with Netlib API to be drop-in replacement for OpenBLAS (or something like that). It may even include cblas.h
or even be ABI-compatible with OpenBLAS, so that existing programs may LD_PRELOAD CLBlast and receive the GPU speedup even without recompilation.
Could you post the output of all failing tests?
for i in clblasttest*; do echo $i; DISPLAY= ./$i; echo $?; done &> log ?
Tried tuning (log and jsons), but the python script fails afterwards:
$ python ../scripts/database/database.py . ..
[database] Downloading database from 'https://raw.githubusercontent.com/CNugteren/CLBlast-database/master/database.json'...
[database] Loading database from '../scripts/database/database.json'
[database] Processing './clblast_copy_32.json' with 128 new items
[database] Processing './clblast_routine_gemm_32.json' with 31 new items
[database] Processing './clblast_xger_32.json' with 108 new items
[database] Processing './clblast_xdot_2_32.json' with 5 new items
[database] Processing './clblast_padtranspose_32.json' with 14 new items
[database] Processing './clblast_xgemm_direct_2_32.json' with 125 new items
[database] Processing './clblast_xaxpy_32.json' with 64 new items
[database] Processing './clblast_xgemm_2_32.json' with 229 new items
[database] Processing './clblast_xgemv_fast_32.json' with 30 new items
[database] Processing './clblast_xdot_1_32.json' with 5 new items
[database] Processing './clblast_xgemv_fast_rot_32.json' with 68 new items
[database] Processing './clblast_xgemm_direct_1_32.json' with 45 new items
[database] Processing './clblast_xgemv_32.json' with 12 new items
[database] Processing './clblast_pad_32.json' with 72 new items
[database] Processing './clblast_transpose_32.json' with 48 new items
[database] Processing './clblast_xgemm_1_32.json' with 560 new items
[database] Saving database to '../scripts/database/database.json'
[database] Calculating the best results per device/kernel...
[database] Calculating the default values...
[database] Producing a C++ database in '../src/database/kernels'...
Traceback (most recent call last):
File "../scripts/database/database.py", line 154, in <module>
main(sys.argv[1:])
File "../scripts/database/database.py", line 148, in main
clblast.print_cpp_database(database_best_results, cpp_database_path)
File "/mnt/src/git/CLBlast/scripts/database/database/clblast.py", line 177, in print_cpp_database
assert len(kernel_database) == 1
AssertionError
@vi To get rid of the python error, delete the ../scripts/database/database.json
file and re-run the python script. It'll download a new copy of database.json again & should work without any errors.
Thanks for sharing the output of the tests! A quick glance shows that it might be just the reduce and matrix-multiplication kernels failing, they are used in quite a few cases. Let's first see if the tuning can fix them.
To get rid of the python error, delete the ../scripts/database/database.json file and re-run the python script. It'll download a new copy of database.json again & should work without any errors.
Don't think that's going to work, since he just got a fresh copy anyway.
Tried tuning (log and jsons), but the python script fails afterwards.
OK, thanks for sharing the JSONs. I'll take a look myself this weekend at what's going wrong and I'll try to fix it and make the error message more meaningful for future cases. I'll report back as soon as I have something for you.
I expect CLBlast with Netlib API to be drop-in replacement for OpenBLAS (or something like that). It may even include cblas.h or even be ABI-compatible with OpenBLAS, so that existing programs may LD_PRELOAD CLBlast and receive the GPU speedup even without recompilation.
I understand your point, but that might be less trivial to implement than you suggest it. First of all, you'll have to query OpenCL to see what's supported and what not. Then, you'll have to call a BLAS routine, which is not trivial to do since CLBlast also uses cblas.h
for that case. But then you'll need all the extra logic to make sure this runs on any platform with any CPU BLAS, and you'll need proper testing. However, even after doing all that, the Netlib API of CLBlast is only useful if you think about speed before using it, e.g. an AXPY operation will be slower due to memory copying overhead. Even an GEMM operation might be slower. So ideally you'll want to decision making per machine/routine/parameters. My conclusion here is that it will take too much effort to make the drop-in Netlib API of CLBlast useful in the sense that you describe it. And since it's not the main/recommended API, I will not be able to work on this in the foreseeable future, but I'll accept pull requests of course.
OK, I've just tested with your JSON files (thanks again for sharing), and I didn't encounter any issue. So it is likely that some things changed in the database in the meantime since the release of CLBlast v1.2.0. So there are two things you could do:
scripts/database/database.json
with a corresponding v1.2.0 version from the CLBlast-database repository, direct download link here.master
branch of CLBlast. For your convenience, I have just added the tuning results as well, so with the latest master you can direct re-compile and re-run the tests, hopefully more will now pass. Not sure though, but it is worth trying first before going on.[database] All done
.git diff
- 26 chunksninja
- 195 targetsninja test
The following tests FAILED:
5 - clblast_test_xdot (Failed)
6 - clblast_test_xdotu (Failed)
7 - clblast_test_xdotc (Failed)
8 - clblast_test_xnrm2 (Failed)
9 - clblast_test_xasum (Failed)
12 - clblast_test_xgbmv (OTHER_FAULT)
34 - clblast_test_xgemm (OTHER_FAULT)
37 - clblast_test_xsyrk (Failed)
38 - clblast_test_xherk (Failed)
39 - clblast_test_xsyr2k (Failed)
40 - clblast_test_xher2k (Failed)
46 - clblast_test_xgemmbatched (OTHER_FAULT)
48 - clblast_test_preprocessor (OTHER_FAULT)
6. Revert what database.py
done: git reset --hard
.
7. Re-build ninja
- 53 targets
8. Re-test: ninja test
Same running time, same Failed
s, same OTHER_FAULT
s.
OK, thanks for testing. Let's first try to resolve the OTHER_FAULT
issues, and then later the one that report regular correctness issues. For three of those (GBMV, GEMM, and GEMMBATCHED), I see quite curious output, e.g.:
* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'SGBMV' routine. Legend:
: -> Test produced correct results
. -> Test returned the correct error code
X -> Test produced incorrect results
/ -> Test returned an incorrect error code
\ -> Test not executed: OpenCL-kernel compilation error
o -> Test not executed: Unsupported precision
- -> Test not completed: Reference CBLAS doesn't output error codes
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::
Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '101 (row-major) 112 (transposed)':
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::
Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '102 (col-major) 111 (regular)':
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::
Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '102 (col-major) 112 (transposed)':
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)
terminate called after throwing an instance of 'clblast::CLCudaAPIError'
what(): OpenCL error: clGetDeviceIDs: -1
Aborted
This error message is from Beignet, not from CLBlast. It suddenly cannot find your device anymore, which is strange. This seems to suggest something is wrong with your OpenCL set-up, or there is perhaps a bug in beignet? I could not find any issue on the Beignet Bugzilla tracker, but perhaps you can search and file one? First also check if it is reproducible, i.e. does it always fail at exactly the same test?
Then there is another issue with clblast_test_preprocessor
it seems, but this seems to be a linker issue. I can't reproduce that myself on a Debian 9 / beignet system, tried with Clang and GCC, but both seem to work fine. Which compiler are you using?
Shall I try downgrading beignet to v1.2?
gcc (Debian 6.3.0-18) 6.3.0 20170516
Is the system supposed to be usused durign tuning/testing or it is OK to browse around (ignoring graphics lags)?
Shall I try downgrading beignet to v1.2?
Not sure, you could try perhaps. I believe there was at least one other CLBlast user with your GPU, since there were already tuning results, so it must have worked correctly on some system at some point.
gcc (Debian 6.3.0-18) 6.3.0 20170516
On my Debian 9 test system I get exactly the same output when I run g++ --version
, and there everything compiles and links correctly. Anyway, let's not spent too much time on that, it's just a small issue of linking a not-important specific test.
Is the system supposed to be usused durign tuning/testing or it is OK to browse around (ignoring graphics lags)?
When I test with Beignet I also run X at the same time and I don't see any issues.
So I would search the issue with Beignet or with the GPU drivers. Try other versions perhaps, or otherwise report the issue with Beignet. Could well be that the other failing tests are related to this as well...
So, did you have any luck with another version of Beignet? Or did you report this issue with the developers of Beignet?
Not yet. And I'm not sure what steps should I do for the reporting. Is there some minimal failing case which supposed to work, but doesn't?
I'm not sure we need a minimal failing case here. If you look at the error you are getting returned from clGetDeviceIDs
it means it cannot find your GPU:
beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)
However, it did use your GPU in the tests just moments before that. So it seems some time related instability. First also check if it is reproducible, i.e. does it always fail at exactly the same test instance?
Beignet bugs can be filed here. You could refer to this issue perhaps? First also double-check the list of existing Beignet bugs
Tried running utests_run, it seems to work...
$ /usr/lib/x86_64-linux-gnu/beignet/utest_run
...
summary:
----------
total: 1000
run: 959
pass: 959
fail: 0
pass rate: 1.000000
Notes:
If I don't unset DISPLAY
, I often get spammed repeated "Maximum number of clients reached:" message, from the test as well as other unrelated apps.
./clblast_test_xgbmv
seems to fail always at the same place with the same OpenCL error: clGetDeviceIDs: -1
error.
I though about reducing the clblast_test_xgbmv to something smaller, but the testing system is too complicated and I stopped trying after observing this.
How, for example, move the 'regular behaviour' for '102 (col-major) 112 (transposed)'
to the first place? What test is when 156's :
being printed? What if duplicate the first test ('regular behaviour' for '101 (row-major) 111 (regular)'
) 4 times instead of going to further tests?
OK, thanks for trying. I general I don't think anything can be done from the CLBlast side. Because if calling clGetDeviceIDs
works at first a few 100 times and then doesn't work anymore, that means something strange is going on. So honestly I believe it is a bug in Beignet.
But you are right that trying to pinpoint whether it always the same test that fails is a good idea. What you can first do indeed is only to test that particular case. I'll help you out. Let's try two steps:
First only run the col-major & transposed case only. For that, remove Transpose::kNo,
from testblas.cpp#L26 and remove Layout::kRowMajor,
from testblas.hpp#L147.
Then, if it still fails, you can run the test with -verbose
option on the command-line. That should show you the values for m,n,kl,ku,lda,incx,iny it is testing for. You can adjust those values around testblas.hpp#L129: by changing kIncrements
, kMatrixVectorDims
, and kBandSizes
.
I instead tried this:
diff --git a/test/correctness/testblas.cpp b/test/correctness/testblas.cpp
index aa4b478..be28ed3 100644
--- a/test/correctness/testblas.cpp
+++ b/test/correctness/testblas.cpp
@@ -23,7 +23,7 @@ namespace clblast {
// The transpose configurations to test with: template parameter dependent
template <> const std::vector<Transpose> TestBlas<half,half>::kTransposes = {Transpose::kNo, Transpose::kYes};
-template <> const std::vector<Transpose> TestBlas<float,float>::kTransposes = {Transpose::kNo, Transpose::kYes};
+template <> const std::vector<Transpose> TestBlas<float,float>::kTransposes = {Transpose::kNo, Transpose::kNo};
template <> const std::vector<Transpose> TestBlas<double,double>::kTransposes = {Transpose::kNo, Transpose::kYes};
template <> const std::vector<Transpose> TestBlas<float2,float2>::kTransposes = {Transpose::kNo, Transpose::kYes, Transpose::kConjugate};
template <> const std::vector<Transpose> TestBlas<double2,double2>::kTransposes = {Transpose::kNo, Transpose::kYes, Transpose::kConjugate};
diff --git a/test/correctness/testblas.hpp b/test/correctness/testblas.hpp
index 4e02fd2..9c0830b 100644
--- a/test/correctness/testblas.hpp
+++ b/test/correctness/testblas.hpp
@@ -144,7 +144,7 @@ template <typename T, typename U> const std::vector<size_t> TestBlas<T,U>::kMatS
template <typename T, typename U> const std::vector<size_t> TestBlas<T,U>::kVecSizes = {0, kBufferSize - 1, kBufferSize};
// The layout/triangle options to test with
-template <typename T, typename U> const std::vector<Layout> TestBlas<T,U>::kLayouts = {Layout::kRowMajor, Layout::kColMajor};
+template <typename T, typename U> const std::vector<Layout> TestBlas<T,U>::kLayouts = {Layout::kRowMajor, Layout::kRowMajor};
template <typename T, typename U> const std::vector<Triangle> TestBlas<T,U>::kTriangles = {Triangle::kUpper, Triangle::kLower};
template <typename T, typename U> const std::vector<Side> TestBlas<T,U>::kSides = {Side::kLeft, Side::kRight};
template <typename T, typename U> const std::vector<Diagonal> TestBlas<T,U>::kDiagonals = {Diagonal::kUnit, Diagonal::kNonUnit};
and got this:
* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::
Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::
Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::
Pass rate 100.0%: 288 passed / 0 skipped / 0 failed
* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)
terminate called after throwing an instance of 'clblast::CLCudaAPIError'
what(): OpenCL error: clGetDeviceIDs: -1
Aborted
Is there a simple program that just calls clGetDeviceIDs in endless loop? (I'm not familiar with OpenCL/Cuda/GPU world in general yet).
Hmm, interesting, so you think it would perhaps always happen at the n-th call to that function?
Something like this could help you perhaps:
#include <CL/opencl.h>
#include <cstdio>
#define NUM_RUNS 50
#define PLATFORM_ID 0
int main() {
int status;
for (int i = 0; i < NUM_RUNS; ++i) {
printf("Test %d\n", i);
cl_uint num_platforms = 0;
status = clGetPlatformIDs(0, NULL, &num_platforms);
if (status != CL_SUCCESS) { printf("Error in clGetPlatformIDs #1\n"); return 1; }
cl_platform_id* platforms = new cl_platform_id[num_platforms];
status = clGetPlatformIDs(num_platforms, platforms, NULL);
if (status != CL_SUCCESS) { printf("Error in clGetPlatformIDs #2\n"); return 1; }
cl_uint result = 0;
cl_platform_id platform = platforms[PLATFORM_ID];
status = clGetDeviceIDs(platform, CL_DEVICE_TYPE_ALL, 0, NULL, &result);
if (status != CL_SUCCESS) { printf("Error in clGetDeviceIDs\n"); return 1; }
delete[] platforms;
}
return 0;
}
This does not fail (increased NUM_RUNS and removed the main printf), even if I also start clblast_test_xgbmv in parallel.
Running the test under valgrind (snippet):
* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
:::::::::::::::::::::::::::==6005== Warning: invalid file descriptor 1031 in syscall open()
==6005== Warning: invalid file descriptor 1031 in syscall open()
==6005== Warning: invalid file descriptor 1031 in syscall open()
:==6005== Warning: invalid file descriptor 1031 in syscall open()
==6005== Warning: invalid file descriptor 1031 in syscall open()
beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)
terminate called after throwing an instance of 'clblast::CLCudaAPIError'
what(): OpenCL error: clGetDeviceIDs: -1
==6005==
$ ulimit -n
1024
There are a lot of /dev/dri/renderD128
open files.
After ulimit -n 4096
the test succeeds.
$ ulimit -n 10
$ ./clblast_test_xgbmv
...
* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'SGBMV' routine. Legend:
...
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for '101 (row-major) 111 (regular)':
::::::beignet-opencl-icd: no supported GPU found, this is probably the wrong opencl-icd package for this hardware
(If you have multiple ICDs installed and OpenCL works, you can ignore this message)
terminate called after throwing an instance of 'clblast::CLCudaAPIError'
what(): OpenCL error: clGetDeviceIDs: -1
Aborted
In your last example it fails at a different place then before, so the location is not deterministic?
It fails when file descriptors run out. CLBlast test (or some dep) opens them, but does not close properly.
ulimit -n
sets maximum number of file descriptors, so the fewer opened files allowed, the sooner it fails.
OK, never heard of those. CLBlast doesn't open any files while testing. Must be Beignet related then I guess?
Can you then re-run all the tests with that 'fix' applied and report which ones still have open issues?
$ git describe --always --dirty
1.2.0_rc1-152-g37c5e8f
$ for i in clblast_test_*; do DISPLAY= ./$i ; echo RESULT $i $?; done &> log
$ grep RESULT log
RESULT clblast_test_diagnostics 0
RESULT clblast_test_override_parameters 0
RESULT clblast_test_preprocessor 133
RESULT clblast_test_retrieve_parameters 0
RESULT clblast_test_xamax 0
RESULT clblast_test_xasum 1
RESULT clblast_test_xaxpy 0
RESULT clblast_test_xaxpybatched 0
RESULT clblast_test_xcopy 0
RESULT clblast_test_xdot 1
RESULT clblast_test_xdotc 1
RESULT clblast_test_xdotu 0
RESULT clblast_test_xgbmv 0
RESULT clblast_test_xgemm 1
RESULT clblast_test_xgemmbatched 0
RESULT clblast_test_xgemmstridedbatched 0
RESULT clblast_test_xgemv 0
RESULT clblast_test_xger 0
RESULT clblast_test_xgerc 0
RESULT clblast_test_xgeru 0
RESULT clblast_test_xhbmv 0
RESULT clblast_test_xhemm 0
RESULT clblast_test_xhemv 0
RESULT clblast_test_xher 0
RESULT clblast_test_xher2 0
RESULT clblast_test_xher2k 1
RESULT clblast_test_xherk 1
RESULT clblast_test_xhpmv 0
RESULT clblast_test_xhpr 0
RESULT clblast_test_xhpr2 0
RESULT clblast_test_xim2col 0
RESULT clblast_test_xnrm2 1
RESULT clblast_test_xomatcopy 0
RESULT clblast_test_xsbmv 0
RESULT clblast_test_xscal 0
RESULT clblast_test_xspmv 0
RESULT clblast_test_xspr 0
RESULT clblast_test_xspr2 0
RESULT clblast_test_xswap 0
RESULT clblast_test_xsymm 0
RESULT clblast_test_xsymv 0
RESULT clblast_test_xsyr 0
RESULT clblast_test_xsyr2 0
RESULT clblast_test_xsyr2k 1
RESULT clblast_test_xsyrk 1
RESULT clblast_test_xtbmv 0
RESULT clblast_test_xtpmv 0
RESULT clblast_test_xtrmm 0
RESULT clblast_test_xtrmv 0
RESULT clblast_test_xtrsm 0
RESULT clblast_test_xtrsv 0
Full log: https://gist.github.com/vi/768d8d06915b58dc57c4c9a41802ddd9
OK, thank you for running. So in summary the remaining failing tests are:
RESULT clblast_test_xasum 1
RESULT clblast_test_xdot 1
RESULT clblast_test_xdotc 1
RESULT clblast_test_xgemm 1
RESULT clblast_test_xher2k 1
RESULT clblast_test_xherk 1
RESULT clblast_test_xsyr2k 1
RESULT clblast_test_xsyrk 1
The first 3 all use the dot/reduce kernels, while the other 5 all use the GEMM kernels. So it seems perhaps there are only 2 real failures. Let's start with the first one first (the dot kernel). Could you run two things:
./clblast_tuner_xdot
./clblast_test_xdot --verbose
And then perhaps fiddle with the parameters for your device for that kernel (single precision only first): src/database/kernels/xdot/xdot_32.hpp#L84 and see if you can get that working with different parameters.
./clblast_tuner_xdot ./clblast_test_xdot --verbose
* Options given/available:
-platform 0 [=default]
-device 0 [=default]
-precision 32 (single) [=default]
-n 2097152 [=default]
-fraction 1.00 [=default]
-runs 10 [=default]
-max_l2_norm 0.00 [=default]
* Found 6 configuration(s)
* Parameters explored: WGS1
| ID | total |param | compiles | time | GB/s | status |
x------x-------x------x----------------x--------------x--------x-------------------x
| ref | - | - | OK | 1.27 ms | - | reference OK |
x------x-------x------x----------------x--------------x--------x-------------------x
| 1 | 6 | 32 | OK 365 ms | 1.28 ms | 13.1 | results match |
| 2 | 6 | 64 | OK 355 ms | 1.22 ms | 13.7 | results match |
| 3 | 6 | 128 | OK 375 ms | 1.18 ms | 14.3 | results match |
| 4 | 6 | 256 | OK 348 ms | 1.28 ms | 13.2 | results match |
| 5 | 6 | 512 | OK 344 ms | 1.27 ms | 13.2 | results match |
| 6 | 6 | 1024 | OK 353 ms | error -55 | - | invalid config. | <-- skipping
x------x-------x------x----------------x--------------x--------x-------------------x
* Found best result 1.18 ms: 14.3 GB/s
* Best parameters: PRECISION=32 WGS1=128
* Writing a total of 5 results to 'clblast_xdot_1_32.json'
* Completed tuning process
* Options given/available:
-platform 0 [=default]
-device 0 [=default]
-precision 32 (single) [=default]
-n 2097152 [=default]
-fraction 1.00 [=default]
-runs 10 [=default]
-max_l2_norm 0.00 [=default]
* Found 6 configuration(s)
* Parameters explored: WGS2
| ID | total |param | compiles | time | N/A | status |
x------x-------x------x----------------x--------------x--------x-------------------x
| ref | - | - | OK | 0.14 ms | - | reference OK |
x------x-------x------x----------------x--------------x--------x-------------------x
| 1 | 6 | 32 | OK 368 ms | 0.11 ms | 0.0 | results match |
| 2 | 6 | 64 | OK 363 ms | 0.14 ms | 0.0 | results match |
| 3 | 6 | 128 | OK 385 ms | 0.14 ms | 0.0 | results match |
| 4 | 6 | 256 | OK 346 ms | 0.11 ms | 0.0 | results match |
| 5 | 6 | 512 | OK 352 ms | 0.17 ms | 0.0 | results match |
| 6 | 6 | 1024 | OK 350 ms | error -55 | - | invalid config. | <-- skipping
x------x-------x------x----------------x--------------x--------x-------------------x
* Found best result 0.11 ms: 0.0 N/A
* Best parameters: PRECISION=32 WGS2=32
* Writing a total of 5 results to 'clblast_xdot_2_32.json'
* Completed tuning process
* Options given/available:
-platform 0 [=default]
-device 0 [=default]
-full_test [false]
-verbose [true]
-cblas 1 [=default]
-clblas 0 [=default]
* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'SDOT' routine. Legend:
: -> Test produced correct results
. -> Test returned the correct error code
X -> Test produced incorrect results
/ -> Test returned an incorrect error code
\ -> Test not executed: OpenCL-kernel compilation error
o -> Test not executed: Unsupported precision
- -> Test not completed: Reference CBLAS doesn't output error codes
* Testing with error margins of 0.5% (relative) and 0.001 (absolute)
* Testing 'regular behaviour' for 'default':
Testing: n=7 incx=1 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus -0.84 (CLBlast)
Combined average L2 error: 7.09e-01
X
Testing: n=7 incx=1 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus 2.82 (CLBlast)
Combined average L2 error: 7.94e+00
X
Testing: n=7 incx=1 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus 4.17 (CLBlast)
Combined average L2 error: 1.74e+01
X
Testing: n=7 incx=2 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
Testing: n=7 incx=2 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus -2.42 (CLBlast)
Combined average L2 error: 5.86e+00
X
Testing: n=7 incx=2 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
Testing: n=7 incx=7 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus -2.50 (CLBlast)
Combined average L2 error: 6.26e+00
X
Testing: n=7 incx=7 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
Testing: n=7 incx=7 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
Testing: n=93 incx=1 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus 1.93 (CLBlast)
Combined average L2 error: 3.72e+00
X
Testing: n=93 incx=1 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
Testing: n=93 incx=1 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
Testing: n=93 incx=2 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus 7.03 (CLBlast)
Combined average L2 error: 4.95e+01
X
Testing: n=93 incx=2 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus -10.82 (CLBlast)
Combined average L2 error: 1.17e+02
X
Testing: n=93 incx=2 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus 12.52 (CLBlast)
Combined average L2 error: 1.57e+02
X
Testing: n=93 incx=7 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus 9.62 (CLBlast)
Combined average L2 error: 9.26e+01
X
Testing: n=93 incx=7 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus -6.64 (CLBlast)
Combined average L2 error: 4.41e+01
X
Testing: n=93 incx=7 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus 6.36 (CLBlast)
Combined average L2 error: 4.05e+01
X
Testing: n=144 incx=1 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus 2.93 (CLBlast)
Combined average L2 error: 8.59e+00
X
Testing: n=144 incx=1 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus -3.51 (CLBlast)
Combined average L2 error: 1.23e+01
X
Testing: n=144 incx=1 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
Testing: n=144 incx=2 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
Testing: n=144 incx=2 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
Testing: n=144 incx=2 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus -1.49 (CLBlast)
Combined average L2 error: 2.21e+00
X
Testing: n=144 incx=7 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
Testing: n=144 incx=7 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
Testing: n=144 incx=7 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
Testing: n=4096 incx=1 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
Testing: n=4096 incx=1 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus -41.80 (CLBlast)
Combined average L2 error: 1.75e+03
X
Testing: n=4096 incx=1 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus 51.62 (CLBlast)
Combined average L2 error: 2.66e+03
X
Testing: n=4096 incx=2 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus 82.72 (CLBlast)
Combined average L2 error: 6.84e+03
X
Testing: n=4096 incx=2 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus -214.22 (CLBlast)
Combined average L2 error: 4.59e+04
X
Testing: n=4096 incx=2 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus 13.29 (CLBlast)
Combined average L2 error: 1.77e+02
X
Testing: n=4096 incx=7 incy=1 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus -28.65 (CLBlast)
Combined average L2 error: 8.21e+02
X
Testing: n=4096 incx=7 incy=2 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] ->
Error at index 0: 0.00 (reference) versus 22.19 (CLBlast)
Combined average L2 error: 4.92e+02
X
Testing: n=4096 incx=7 incy=7 offx=0 offy=0 offdot=0 [CLBlast] [CPU BLAS] -> :
Error rate 100.00%: n=7 incx=1 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=7 incx=1 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=7 incx=1 incy=7 offx=0 offy=0 offdot=0
Error rate 100.00%: n=7 incx=2 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=7 incx=7 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=93 incx=1 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=93 incx=2 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=93 incx=2 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=93 incx=2 incy=7 offx=0 offy=0 offdot=0
Error rate 100.00%: n=93 incx=7 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=93 incx=7 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=93 incx=7 incy=7 offx=0 offy=0 offdot=0
Error rate 100.00%: n=144 incx=1 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=144 incx=1 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=144 incx=2 incy=7 offx=0 offy=0 offdot=0
Error rate 100.00%: n=4096 incx=1 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=4096 incx=1 incy=7 offx=0 offy=0 offdot=0
Error rate 100.00%: n=4096 incx=2 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=4096 incx=2 incy=2 offx=0 offy=0 offdot=0
Error rate 100.00%: n=4096 incx=2 incy=7 offx=0 offy=0 offdot=0
Error rate 100.00%: n=4096 incx=7 incy=1 offx=0 offy=0 offdot=0
Error rate 100.00%: n=4096 incx=7 incy=2 offx=0 offy=0 offdot=0
Pass rate 38.9%: 14 passed / 0 skipped / 22 failed
* Completed all test-cases for this routine. Results:
14 test(s) passed
0 test(s) skipped
22 test(s) failed
* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'DDOT' routine.
* All tests skipped: Unsupported precision
* Running on OpenCL device 'Intel(R) HD Graphics IvyBridge M GT2'.
* Starting tests for the 'HDOT' routine.
* All tests skipped: Unsupported precision
For completess: the GPU may have felt a little bit sick at the time of test. At least the graphical scaling glitch is still here.
Sorry I forget about this issue. Thanks for testing. The tuner result looks OK. Not sure how to continue though since I can't test myself and start to debug the issue, because I can't reproduce it.
Perhaps this other issue #149 might help you out. It seems it is also the same GPU, but a different OpenCL (not Beignet but Apple OpenCL).
If needed I can run special modified versions for debugging or maybe give access for remote debugging on my laptop.
But maybe I should "play" with Beignet versions first. I've already built one from source code, but not sure yet how to install it into Debian (or can it be used without installation).
But maybe I should "play" with Beignet versions first. I've already built one from source code, but not sure yet how to install it into Debian (or can it be used without installation).
You can make install
it into a directory specified when you ran CMake (-DCMAKE_INSTALL_PREFIX=/path/to/install
). If none specified, it will just install in your system's path and will overwrite any existing OpenCL afaik. Otherwise you'll have multiple OpenCL platforms and you'll need to select the right one.
select the right one
How do I select the right one ensuring no pieces of the wrong one is on the way and also without disruptive changes to the system from root? It it just LD_LIBRARY_PATH
or LD_PRELOAD
or something trickier?
Not sure, I'm not an expert on that... But what I meant was 'select' in the OpenCL platform sense. If you do it right, you might have both Beignet's co-existing on your system, clinfo
will show both. But you can of course also try to set the library path.
Any updates here? Or should we conclude it is not CLBlast-related?
Not experimented yet with other Beignets. Not sure if it is appropriate to report bugs there without trying fresher build...
Maybe CLBlast is doing things OK, but also can contain workaround for broken platforms...
If/when I come back to experimenting with OpenCL in general and Beignet and/or CLBlast in particular, I'll comment.
Intel now has a new open-source implementation that is replacing Beignet. Perhaps it is time to try the new Intel NEO?
Gen8 (Broadwell) and beyond
Is it something new-ish? Unlikely that it would work on my laptop.
Indeed, it seems that your hardware is not supported. Neo is new indeed, Beignet is now discontinued, so that won't lead to solving this issue either it seems.
How do you suggest we proceed? Do you still have time to test things? We could also close this issue and say that older hardware is not properly supported in all cases...
Do you still have time to test things?
Yes, I'm constantly trying various things (CLBlast being a detour from experimenting with various deep learning toys and thinking "what if I can workaround missing OpenCL support for ... by using CLBlast instead of usual BLAS library").
How do you suggest we proceed?
Maybe like previously, me trying updated beingnet (or just waiting until eventually updated Beignet comes to Debian Stable), then maybe reporting additional issues to Beignet.
Any updates from your side?
Not yet.
Is it something urgent or you just don't want a danging open issue? I'll report results here if/when I resume experimentation regardless of closedness status of this issue.
For now I just treat my laptop as Not Ready For GPU Computing.
OK. Yes, I see this as a list of things I have to work on :-) I can also add your setup to the list of known issues, close this issue, and we can follow-up later with you and/or Intel when you have time to see if anything can be fixed?
Got round and installed beignet from master.
3 passing tests in ./clblast_test_xdot
disappeared and all of them fail. Although ./clblast_test_xdotc
keeps on working. FD leak persists.
Beignet's own tests almost succeed: pass rate: 0.999005
.
Using clblast dda1e567f872d3d89f2f7cd890fb5b29ff98537c and beignet 591d387327ce35f03a6152d4c823415729e221f2.
Tried beignet 1.2.1 (097365ed1a79cd03dc689b37b03552e455eb3854) and seeing more successful tests.
Tests now look much better:
$ ninja test
[0/1] Running tests...
Test project /home/vi/src/git/CLBlast/build
Start 1: clblast_test_xswap
1/51 Test #1: clblast_test_xswap ................. Passed 1.58 sec
Start 2: clblast_test_xscal
2/51 Test #2: clblast_test_xscal ................. Passed 1.16 sec
Start 3: clblast_test_xcopy
3/51 Test #3: clblast_test_xcopy ................. Passed 1.56 sec
Start 4: clblast_test_xaxpy
4/51 Test #4: clblast_test_xaxpy ................. Passed 1.59 sec
Start 5: clblast_test_xdot
5/51 Test #5: clblast_test_xdot .................. Passed 0.82 sec
Start 6: clblast_test_xdotu
6/51 Test #6: clblast_test_xdotu ................. Passed 0.88 sec
Start 7: clblast_test_xdotc
7/51 Test #7: clblast_test_xdotc ................. Passed 0.89 sec
Start 8: clblast_test_xnrm2
8/51 Test #8: clblast_test_xnrm2 ................. Passed 1.16 sec
Start 9: clblast_test_xasum
9/51 Test #9: clblast_test_xasum ................. Passed 1.13 sec
Start 10: clblast_test_xamax
10/51 Test #10: clblast_test_xamax ................. Passed 1.12 sec
Start 11: clblast_test_xgemv
11/51 Test #11: clblast_test_xgemv ................. Passed 5.75 sec
Start 12: clblast_test_xgbmv
12/51 Test #12: clblast_test_xgbmv ................. Passed 53.21 sec
Start 13: clblast_test_xhemv
13/51 Test #13: clblast_test_xhemv ................. Passed 1.98 sec
Start 14: clblast_test_xhbmv
14/51 Test #14: clblast_test_xhbmv ................. Passed 4.82 sec
Start 15: clblast_test_xhpmv
15/51 Test #15: clblast_test_xhpmv ................. Passed 1.97 sec
Start 16: clblast_test_xsymv
16/51 Test #16: clblast_test_xsymv ................. Passed 1.72 sec
Start 17: clblast_test_xsbmv
17/51 Test #17: clblast_test_xsbmv ................. Passed 3.97 sec
Start 18: clblast_test_xspmv
18/51 Test #18: clblast_test_xspmv ................. Passed 1.80 sec
Start 19: clblast_test_xtrmv
19/51 Test #19: clblast_test_xtrmv ................. Passed 28.67 sec
Start 20: clblast_test_xtbmv
20/51 Test #20: clblast_test_xtbmv ................. Passed 78.94 sec
Start 21: clblast_test_xtpmv
21/51 Test #21: clblast_test_xtpmv ................. Passed 20.33 sec
Start 22: clblast_test_xtrsv
22/51 Test #22: clblast_test_xtrsv ................. Passed 30.57 sec
Start 23: clblast_test_xger
23/51 Test #23: clblast_test_xger .................. Passed 1.49 sec
Start 24: clblast_test_xgeru
24/51 Test #24: clblast_test_xgeru ................. Passed 1.73 sec
Start 25: clblast_test_xgerc
25/51 Test #25: clblast_test_xgerc ................. Passed 1.68 sec
Start 26: clblast_test_xher
26/51 Test #26: clblast_test_xher .................. Passed 0.95 sec
Start 27: clblast_test_xhpr
27/51 Test #27: clblast_test_xhpr .................. Passed 0.77 sec
Start 28: clblast_test_xher2
28/51 Test #28: clblast_test_xher2 ................. Passed 1.74 sec
Start 29: clblast_test_xhpr2
29/51 Test #29: clblast_test_xhpr2 ................. Passed 1.33 sec
Start 30: clblast_test_xsyr
30/51 Test #30: clblast_test_xsyr .................. Passed 0.89 sec
Start 31: clblast_test_xspr
31/51 Test #31: clblast_test_xspr .................. Passed 0.72 sec
Start 32: clblast_test_xsyr2
32/51 Test #32: clblast_test_xsyr2 ................. Passed 1.62 sec
Start 33: clblast_test_xspr2
33/51 Test #33: clblast_test_xspr2 ................. Passed 1.28 sec
Start 34: clblast_test_xgemm
34/51 Test #34: clblast_test_xgemm ................. Passed 39.85 sec
Start 35: clblast_test_xsymm
35/51 Test #35: clblast_test_xsymm ................. Passed 5.64 sec
Start 36: clblast_test_xhemm
36/51 Test #36: clblast_test_xhemm ................. Passed 2.36 sec
Start 37: clblast_test_xsyrk
37/51 Test #37: clblast_test_xsyrk ................. Passed 3.41 sec
Start 38: clblast_test_xherk
38/51 Test #38: clblast_test_xherk ................. Passed 1.78 sec
Start 39: clblast_test_xsyr2k
39/51 Test #39: clblast_test_xsyr2k ................ Passed 5.20 sec
Start 40: clblast_test_xher2k
40/51 Test #40: clblast_test_xher2k ................ Passed 2.59 sec
Start 41: clblast_test_xtrmm
41/51 Test #41: clblast_test_xtrmm ................. Passed 59.26 sec
Start 42: clblast_test_xtrsm
42/51 Test #42: clblast_test_xtrsm ................. Passed 74.42 sec
Start 43: clblast_test_xhad
43/51 Test #43: clblast_test_xhad .................. Passed 1.01 sec
Start 44: clblast_test_xomatcopy
44/51 Test #44: clblast_test_xomatcopy ............. Passed 1.11 sec
Start 45: clblast_test_xim2col
45/51 Test #45: clblast_test_xim2col ............... Passed 3.37 sec
Start 46: clblast_test_xaxpybatched
46/51 Test #46: clblast_test_xaxpybatched .......... Passed 4.14 sec
Start 47: clblast_test_xgemmbatched
47/51 Test #47: clblast_test_xgemmbatched .......... Passed 65.15 sec
Start 48: clblast_test_xgemmstridedbatched
48/51 Test #48: clblast_test_xgemmstridedbatched ... Passed 63.62 sec
Start 49: clblast_test_override_parameters
49/51 Test #49: clblast_test_override_parameters ... Passed 9.00 sec
Start 50: clblast_test_retrieve_parameters
50/51 Test #50: clblast_test_retrieve_parameters ... Passed 0.16 sec
Start 51: clblast_test_preprocessor
51/51 Test #51: clblast_test_preprocessor ..........***Exception: Other 7.65 sec
98% tests passed, 1 tests failed out of 51
Total Test time (real) = 609.63 sec
The following tests FAILED:
51 - clblast_test_preprocessor (OTHER_FAULT)
Errors while running CTest
FAILED: CMakeFiles/test.util
Opened filehandles of /dev/dri/renderD128
keep on accumulating during tests that take much time.
Shall I run the tuning process?
With
beignet 1.3.2-1
and CLBlast v1.2.0 it fails multiple tests:Additionally matmul build with NETLIB CLBlast fails multiplication if matrix is big enough:
On master branch it also fails.