AvtechScientific / ASL

Advanced Simulation Library - hardware accelerated multiphysics simulation platform.
http://asl.org.il
GNU Affero General Public License v3.0
217 stars 55 forks source link

Segmentation Faults in tests with POCL OpenCL driver #16

Open AvtechScientific opened 9 years ago

opoplawski commented 7 years ago

Fedora build errors with pocl:

/builddir/build/BUILD/ASL-248ad2de1fcddffd38405008779fb1ee292184e6/build-x86-64.fc26/test/testACL/testPrivateVar
Test of "Simple kernel" function.../builddir/build/BUILD/ASL-248ad2de1fcddffd38405008779fb1ee292184e6/build-x86-64.fc26/test/testACL/testPrivateVar/builddir/build/BUILD/ASL-248ad2de1fcddffd38405008779fb1ee292184e6/build-x86-64.fc26/test/testACL/testPrivateVar/builddir/build/BUILD/ASL-248ad2de1fcddffd38405008779fb1ee292184e6/build-x86-64.fc26/test/testACL/testPrivateVar/builddir/build/BUILD/ASL-248ad2de1fcddffd38405008779fb1ee292184e6/build-x86-64.fc26/test/testACL/testPrivateVar/builddir/build/BUILD/ASL-248ad2de1fcddffd38405008779fb1ee292184e6/build-x86-64.fc26/test/testACL/testPrivateVar/builddir/build/BUILD/ASL-248ad2de1fcddffd38405008779fb1ee292184e6/build-x86-64.fc26/test/testACL/testPrivateVar/builddir/build/BUILD/ASL-248ad2de1fcddffd38405008779fb1ee292184e6/build-x86-64.fc26/test/testACL/testPrivateVar: : : : : : : symbol lookup errorsymbol lookup errorsymbol lookup errorsymbol lookup errorsymbol lookup errorsymbol lookup errorsymbol lookup error: : : : : : : /builddir/.cache/pocl/kcache/PG/ODNNMOOCFAMCGHOHACMHAHGLIHDIALKHNEKIC/compute_0/1-1-1/compute_0.so/builddir/.cache/pocl/kcache/PG/ODNNMOOCFAMCGHOHACMHAHGLIHDIALKHNEKIC/compute_0/1-1-1/compute_0.so/builddir/.cache/pocl/kcache/PG/ODNNMOOCFAMCGHOHACMHAHGLIHDIALKHNEKIC/compute_0/1-1-1/compute_0.so/builddir/.cache/pocl/kcache/PG/ODNNMOOCFAMCGHOHACMHAHGLIHDIALKHNEKIC/compute_0/1-1-1/compute_0.so/builddir/.cache/pocl/kcache/PG/ODNNMOOCFAMCGHOHACMHAHGLIHDIALKHNEKIC/compute_0/1-1-1/compute_0.so/builddir/.cache/pocl/kcache/PG/ODNNMOOCFAMCGHOHACMHAHGLIHDIALKHNEKIC/compute_0/1-1-1/compute_0.so/builddir/.cache/pocl/kcache/PG/ODNNMOOCFAMCGHOHACMHAHGLIHDIALKHNEKIC/compute_0/1-1-1/compute_0.so: : : : : : : undefined symbol: _Z16_cl_native_rsqrtdundefined symbol: _Z16_cl_native_rsqrtdundefined symbol: _Z16_cl_native_rsqrtdundefined symbol: _Z16_cl_native_rsqrtdundefined symbol: _Z16_cl_native_rsqrtdundefined symbol: _Z16_cl_native_rsqrtdundefined symbol: _Z16_cl_native_rsqrtd

and

# /builddir/build/BUILD/ASL-248ad2de1fcddffd38405008779fb1ee292184e6/build-x86-64.fc26/test/testMath/testReductionFunction
testSum...Segmentation fault (core dumped)

This is with 248ad2de1fcddffd38405008779fb1ee292184e6

AvtechScientific commented 7 years ago

Dear @opoplawski ,

thank you for the report! Could you, please, specify which version of POCL do you use? As per ASL deployment info, the issue should have been fixed by the POCL team in version v0.11.

Do you face similar issue while working with other OpenCL drivers?

Thank you!

opoplawski commented 7 years ago

This is with POCL 0.14-0.1.fc26. Tests run fine with nvidia binary drivers on my local machine.

opoplawski commented 7 years ago

Additional failures on aarch64, see https://kojipkgs.fedoraproject.org//work/tasks/5670/16375670/build.log

The following tests FAILED:
      4 - testKernel (OTHER_FAULT)
      7 - testASLData (OTHER_FAULT)
      8 - testDistanceFunction (OTHER_FAULT)
AvtechScientific commented 7 years ago

This is with POCL 0.14-0.1.fc26. Tests run fine with nvidia binary drivers on my local machine.

Then it wasn't fixed and looks like a POCL issue. Do you have a detailed error output to file an issue report to the POCL team, here: https://github.com/pocl/pocl/issues . If you do report it there, please, let us know, so we can provide our input.

Additional failures on aarch64

4 - testKernel (OTHER_FAULT)
8 - testDistanceFunction (OTHER_FAULT)

those two might fail due to the lack of double precision support.

As for '7 - testASLData (OTHER_FAULT)' we need more informative error message to try to find the cause of the problem.

Thank you!

ghisvail commented 7 years ago

Has pocl 0.14 been released already? The milestone is not complete and it does not appear to be tagged.

It would be nice to try with 0.13 first, which is the latest tagged release.

opoplawski commented 7 years ago

More verbose output with pocl 0.14 on 32-bit arm:

test 4
    Start 4: testKernel
4: Test command: /builddir/build/BUILD/ASL-0.1.7/build-armv7hl-32.fc26/test/testACL/testKernel
4: Test timeout computed to be: 9.99988e+06
4: Test of "copy" function... Ok
4: Test of Kernel with double...1 warning and 4 errors generated.
4:  
4:          BUILD LOG
4:  ************************************************
4: error: /builddir/.cache/pocl/kcache/temp_v4yQlu.cl:8:7: subscript of pointer to incomplete type '__global error_undefined_type_double' (aka '__global struct error_undefined_type_double')
4: error: /builddir/.cache/pocl/kcache/temp_v4yQlu.cl:9:7: subscript of pointer to incomplete type '__global error_undefined_type_double' (aka '__global struct error_undefined_type_double')
4: error: /builddir/.cache/pocl/kcache/temp_v4yQlu.cl:9:28: subscript of pointer to incomplete type '__global error_undefined_type_double' (aka '__global struct error_undefined_type_double')
4: error: /builddir/.cache/pocl/kcache/temp_v4yQlu.cl:10:7: subscript of pointer to incomplete type '__global error_undefined_type_double' (aka '__global struct error_undefined_type_double')
4: warning: /builddir/.cache/pocl/kcache/temp_v4yQlu.cl:1:26: unsupported OpenCL extension 'cl_khr_fp64' - ignoring
4: 
4: 
4:  ************************************************
4:  
4:          KERNEL SOURCE CODE
4:  ------------------------------------------------
4: #pragma OPENCL EXTENSION cl_khr_fp64 : enable
4: 
4: __kernel void compute_0(__global double *a_d1,
4:                         __global double *a_d2,
4:                         __global double *a_d3)
4: {
4:  uint index = get_global_id(0);
4:  (a_d3[index]=2.);
4:  (a_d1[index]=(2.+pown(a_d3[index], 3)));
4:  (a_d2[index]=index);
4: }
4: terminate called after throwing an instance of 'std::logic_error'
4:   what():  ASL ERROR: Program::build() (-11).
4:  ------------------------------------------------
4/8 Test #4: testKernel .......................***Exception: Other  1.47 sec

    Start 7: testASLData
7: Test command: /builddir/build/BUILD/ASL-0.1.7/build-armv7hl-32.fc26/test/testMath/testASLData
7: Test timeout computed to be: 9.99988e+06
7: 1 warning and 2 errors generated.
7:  
7:          BUILD LOG
7:  ************************************************
7: error: /builddir/.cache/pocl/kcache/temp_UBKSin.cl:7:7: subscript of pointer to incomplete type '__global error_undefined_type_double' (aka '__global struct error_undefined_type_double')
7: error: /builddir/.cache/pocl/kcache/temp_UBKSin.cl:8:7: subscript of pointer to incomplete type '__global error_undefined_type_double' (aka '__global struct error_undefined_type_double')
7: warning: /builddir/.cache/pocl/kcache/temp_UBKSin.cl:1:26: unsupported OpenCL extension 'cl_khr_fp64' - ignoring
7: 
7: 
7:  ************************************************
7:  
7:          KERNEL SOURCE CODE
7:  ------------------------------------------------
7: #pragma OPENCL EXTENSION cl_khr_fp64 : enable
7: 
7: __kernel void compute_0(__global double *a_d1,
7:                         __global double *a_d2)
7: {
7:  uint index = get_global_id(0);
7:  (a_d1[index]=0.);
7:  (a_d2[index]=0.);
7: }
7: terminate called after throwing an instance of 'std::logic_error'
7:   what():  ASL ERROR: Program::build() (-11).
7:  ------------------------------------------------
7/8 Test #7: testASLData ......................***Exception: Other  1.20 sec

test 8
    Start 8: testDistanceFunction
8: Test command: /builddir/build/BUILD/ASL-0.1.7/build-armv7hl-32.fc26/test/testMath/testDistanceFunction
8: Test timeout computed to be: 9.99988e+06
8: 2 warnings and 5 errors generated.
8:  
8:          BUILD LOG
8:  ************************************************
8: error: /builddir/.cache/pocl/kcache/temp_5n5UQU.cl:6:9: variable has incomplete type 'error_undefined_type_double' (aka 'struct error_undefined_type_double')
8: error: /builddir/.cache/pocl/kcache/temp_5n5UQU.cl:7:9: variable has incomplete type 'error_undefined_type_double' (aka 'struct error_undefined_type_double')
8: error: /builddir/.cache/pocl/kcache/temp_5n5UQU.cl:8:9: cast to incomplete type 'error_undefined_type_double' (aka 'struct error_undefined_type_double')
8: error: /builddir/.cache/pocl/kcache/temp_5n5UQU.cl:9:9: cast to incomplete type 'error_undefined_type_double' (aka 'struct error_undefined_type_double')
8: error: /builddir/.cache/pocl/kcache/temp_5n5UQU.cl:10:6: subscript of pointer to incomplete type '__global error_undefined_type_double' (aka '__global struct error_undefined_type_double')
8: warning: /builddir/.cache/pocl/kcache/temp_5n5UQU.cl:1:26: unsupported OpenCL extension 'cl_khr_fp64' - ignoring
8: warning: /builddir/.cache/pocl/kcache/temp_5n5UQU.cl:8:23: implicit declaration of function 'convert_double' is invalid in C99
8: 
8: 
8:  ************************************************
8:  
8:          KERNEL SOURCE CODE
8:  ------------------------------------------------
8: #pragma OPENCL EXTENSION cl_khr_fp64 : enable
8: 
8: __kernel void compute_0(__global double *a_d1)
8: {
8:  uint index = get_global_id(0);
8:  double pv_d1;
8:  double pv_d2;
8:  (pv_d1=(double)((-1.+convert_double((index/102)))));
8:  (pv_d2=(double)((-1.+convert_double((index%102)))));
8:  a_d1[index] = max(min((sqrt(((pv_d1 - 50.)*(pv_d1 - 50.)+(pv_d2 - 50.)*(pv_d2 - 50.))) - 10.), (sqrt(((pv_d1 - 40.)*(pv_d1 - 40.)+(pv_d2 - 40.)*(pv_d2 - 40.))) - 10.)), (sqrt(((pv_d1 - 50.)*(pv_d1 - 50.)+(pv_d2 - 50.)*(pv_d2 - 50.))) - 20.));
8: }
8: terminate called after throwing an instance of 'std::logic_error'
8:   what():  ASL ERROR: Program::build() (-11).
8:  ------------------------------------------------
8/8 Test #8: testDistanceFunction .............***Exception: Other  2.53 sec

The testDistanceFunction error is new with pocl 0.14, but the other two test fail with 0.13 as well, though in different ways:

test 4
    Start 4: testKernel
4: Test command: /builddir/build/BUILD/ASL-0.1.7/build-armv7hl-32.fc25/test/testACL/testKernel
4: Test timeout computed to be: 9.99988e+06
4: Test of "copy" function... Ok
4: Test of Kernel with double... Ok
4: Test of KernelSIMD... Ok
4: Test of KernelSIMDUA... Ok
4: Test of kernel with PrivateVariable... Ok
4: Test of kernel with PrivateArray... Ok
4: Test of Variable functionality... Ok
4: Test of VariableReference functionality... Ok
4: Test of select function... Ok
4: Test of Swap functionality... Ok
4: Test of LocalArray and syncCopy with barrier()...6 errors generated.
4:  
4:          BUILD LOG
4:  ************************************************
4: error: /builddir/.cache/pocl/kcache/temp_JTb38I.cl:10:20: used type 'event_t' where arithmetic or pointer type is required
4: error: /builddir/.cache/pocl/kcache/temp_JTb38I.cl:11:106: used type 'event_t' where arithmetic or pointer type is required
4: error: /builddir/.cache/pocl/kcache/temp_JTb38I.cl:13:20: used type 'event_t' where arithmetic or pointer type is required
4: error: /builddir/.cache/pocl/kcache/temp_JTb38I.cl:14:106: used type 'event_t' where arithmetic or pointer type is required
4: error: /builddir/.cache/pocl/kcache/temp_JTb38I.cl:19:20: used type 'event_t' where arithmetic or pointer type is required
4: error: /builddir/.cache/pocl/kcache/temp_JTb38I.cl:20:106: used type 'event_t' where arithmetic or pointer type is required
4: 
4: 
4:  ************************************************
4:  
4:          KERNEL SOURCE CODE
4:  ------------------------------------------------
4: #pragma OPENCL EXTENSION cl_khr_fp64 : disable
4: 
4: __kernel void compute_8(__global float *a_f14,
4:                         __global float *a_f13)
4: {
4:  uint index = get_local_id(0);
4:  uint groupID = get_group_id(0);
4:  __local float la_f2[2];
4:  __local float la_f1[2];
4:  event_t event_0 = (event_t)0;
4:  event_0 = async_work_group_copy(&((__local float *)la_f1)[0], &((__global float *)a_f13)[2*groupID], 2, (event_t)0);
4:  wait_group_events (1, &event_0);
4:  event_t event_1 = (event_t)0;
4:  event_1 = async_work_group_copy(&((__local float *)la_f2)[0], &((__global float *)a_f14)[2*groupID], 2, (event_t)0);
4:  wait_group_events (1, &event_1);
4:  barrier(CLK_LOCAL_MEM_FENCE);
4:  (la_f2[index]=(la_f2[index] - la_f1[index]));
4:  barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
4:  event_t event_2 = (event_t)0;
4:  event_2 = async_work_group_copy(&((__global float *)a_f14)[2*groupID], &((__local float *)la_f2)[0], 2, (event_t)0);
4:  wait_group_events (1, &event_2);
4: }
4: terminate called after throwing an instance of 'std::logic_error'
4:   what():  ASL ERROR: Program::build() (-11).
4:  ------------------------------------------------
4/8 Test #4: testKernel .......................***Exception: Other 22.44 sec

7: #pragma OPENCL EXTENSION cl_khr_fp64 : enable
7: 
7: __kernel void compute_0(__global double *a_d1,
7:                         __global double *a_d2)
7: {
7:  uint index = get_global_id(0);
7:  (a_d1[index]=0.);
7:  (a_d2[index]=0.);
7: }
7: Test of UploadToLocalMem()...4 errors generated.
7:  
7:          BUILD LOG
7:  ************************************************
7: error: /builddir/.cache/pocl/kcache/temp_H7TiiK.cl:26:20: used type 'event_t' where arithmetic or pointer type is required
7: error: /builddir/.cache/pocl/kcache/temp_H7TiiK.cl:27:109: used type 'event_t' where arithmetic or pointer type is required
7: error: /builddir/.cache/pocl/kcache/temp_H7TiiK.cl:29:20: used type 'event_t' where arithmetic or pointer type is required
7: error: /builddir/.cache/pocl/kcache/temp_H7TiiK.cl:30:109: used type 'event_t' where arithmetic or pointer type is required
7: 
7: 
7:  ************************************************
7:  
7:          KERNEL SOURCE CODE
7:  ------------------------------------------------
7: #pragma OPENCL EXTENSION cl_khr_fp64 : disable
7: 
7: __kernel void compute_2(__global float *a_f2,
7:                         __global float *a_f1,
7:                         __global float *a_f3,
7:                         __global float *a_f4)
7: {
7:  uint index = get_local_id(0);
7:  uint groupID = get_group_id(0);
7:  __local float la_f1[125];
7:  __local float la_f2[125];
7:  (la_f1[index]=a_f1[((((index/25)+(groupID/4)*5)*100+(((index%25)/5)+((groupID%4)/2)*5)*10)+(((index%25)%5)+((groupID%4)%2)*5))]);
7:  (la_f2[index]=a_f2[((((index/25)+(groupID/4)*5)*100+(((index%25)/5)+((groupID%4)/2)*5)*10)+(((index%25)%5)+((groupID%4)%2)*5))]);
7:  (la_f1[(27+index)]=a_f1[(((((27+index)/25)+(groupID/4)*5)*100+((((27+index)%25)/5)+((groupID%4)/2)*5)*10)+((((27+index)%25)%5)+((groupID%4)%2)*5))]);
7:  (la_f2[(27+index)]=a_f2[(((((27+index)/25)+(groupID/4)*5)*100+((((27+index)%25)/5)+((groupID%4)/2)*5)*10)+((((27+index)%25)%5)+((groupID%4)%2)*5))]);
7:  (la_f1[(54+index)]=a_f1[(((((54+index)/25)+(groupID/4)*5)*100+((((54+index)%25)/5)+((groupID%4)/2)*5)*10)+((((54+index)%25)%5)+((groupID%4)%2)*5))]);
7:  (la_f2[(54+index)]=a_f2[(((((54+index)/25)+(groupID/4)*5)*100+((((54+index)%25)/5)+((groupID%4)/2)*5)*10)+((((54+index)%25)%5)+((groupID%4)%2)*5))]);
7:  (la_f1[(81+index)]=a_f1[(((((81+index)/25)+(groupID/4)*5)*100+((((81+index)%25)/5)+((groupID%4)/2)*5)*10)+((((81+index)%25)%5)+((groupID%4)%2)*5))]);
7:  (la_f2[(81+index)]=a_f2[(((((81+index)/25)+(groupID/4)*5)*100+((((81+index)%25)/5)+((groupID%4)/2)*5)*10)+((((81+index)%25)%5)+((groupID%4)%2)*5))]);
7:  if ((index<17))
7:  {
7:      (la_f1[(108+index)]=a_f1[(((((108+index)/25)+(groupID/4)*5)*100+((((108+index)%25)/5)+((groupID%4)/2)*5)*10)+((((108+index)%25)%5)+((groupID%4)%2)*5))]);
7:      (la_f2[(108+index)]=a_f2[(((((108+index)/25)+(groupID/4)*5)*100+((((108+index)%25)/5)+((groupID%4)/2)*5)*10)+((((108+index)%25)%5)+((groupID%4)%2)*5))]);
7:  };
7:  barrier(CLK_LOCAL_MEM_FENCE | CLK_GLOBAL_MEM_FENCE);
7:  event_t event_0 = (event_t)0;
7:  event_0 = async_work_group_copy(&((__global float *)a_f3)[125*groupID], &((__local float *)la_f1)[0], 125, (event_t)0);
7:  wait_group_events (1, &event_0);
7:  event_t event_1 = (event_t)0;
7:  event_1 = async_work_group_copy(&((__global float *)a_f4)[125*groupID], &((__local float *)la_f2)[0], 125, (event_t)0);
7:  wait_group_events (1, &event_1);
7: }
7: terminate called after throwing an instance of 'std::logic_error'
7:   what():  ASL ERROR: Program::build() (-11).
7:  ------------------------------------------------
7/8 Test #7: testASLData ......................***Exception: Other 12.22 sec