CHIP-SPV / chipStar

chipStar is a tool for compiling and running HIP/CUDA on SPIR-V via OpenCL or Level Zero APIs.
Other
182 stars 29 forks source link

1.1-RC3: Intel Data Center GPU Max failures on OpenCL #739

Closed franz closed 6 months ago

franz commented 8 months ago
>$ python3 ~/0/source/chipStar/scripts/check.py -v  --num-threads 1 --num-tries 1 --timeout 600 -m off $PWD dgpu
opencl

87% tests passed, 126 tests failed out of 950

Label Time Summary:
cuda        =  46.08 sec*proc (26 tests)
internal    = 101.76 sec*proc (73 tests)

Total Test time (real) = 1947.39 sec

The following tests FAILED:
          5 - Unit_deviceFunctions_CompileTest___fadd_rn_float (SEGFAULT)
         22 - Unit_deviceFunctions_CompileTest___fmul_rn_float (SEGFAULT)
         25 - Unit_deviceFunctions_CompileTest___frcp_rd_float (SEGFAULT)
         37 - Unit_deviceFunctions_CompileTest___fsub_rz_float (SEGFAULT)
         44 - Unit_deviceFunctions_CompileTest___sinf_float (SEGFAULT)
         48 - Unit_deviceFunctions_CompileTest___dadd_ru_double (SEGFAULT)
         53 - Unit_deviceFunctions_CompileTest___ddiv_rz_double (SEGFAULT)
         64 - Unit_deviceFunctions_CompileTest___dsqrt_ru_double (SEGFAULT)
         85 - Unit_deviceFunctions_CompileTest___hadd_int (SEGFAULT)
         87 - Unit_deviceFunctions_CompileTest___mul64hi_int (SEGFAULT)
         88 - Unit_deviceFunctions_CompileTest___mulhi_int (Subprocess aborted)
         91 - Unit_deviceFunctions_CompileTest___rhadd_int (Failed)
         93 - Unit_deviceFunctions_CompileTest___uhadd_int (SEGFAULT)
         94 - Unit_deviceFunctions_CompileTest___umul24_int (SEGFAULT)
         95 - Unit_deviceFunctions_CompileTest___umul64hi_int (SEGFAULT)
        104 - Unit_deviceFunctions_CompileTest_atanf_float (SEGFAULT)
        106 - Unit_deviceFunctions_CompileTest_cbrtf_float (SEGFAULT)
        112 - Unit_deviceFunctions_CompileTest_cyl_bessel_i0f_float (SEGFAULT)
        115 - Unit_deviceFunctions_CompileTest_erfcinvf_float (SEGFAULT)
        118 - Unit_deviceFunctions_CompileTest_erfinvf_float (SEGFAULT)
        124 - Unit_deviceFunctions_CompileTest_fdimf_float (Subprocess aborted)
        125 - Unit_deviceFunctions_CompileTest_fdividef_float (SEGFAULT)
        127 - Unit_deviceFunctions_CompileTest_fmaf_float (SEGFAULT)
        130 - Unit_deviceFunctions_CompileTest_fmodf_float (SEGFAULT)
        132 - Unit_deviceFunctions_CompileTest_hypotf_float (SEGFAULT)
        133 - Unit_deviceFunctions_CompileTest_ilogbf_float (SEGFAULT)
        135 - Unit_deviceFunctions_CompileTest_isinf_float (SEGFAULT)
        136 - Unit_deviceFunctions_CompileTest_isnan_float (SEGFAULT)
        137 - Unit_deviceFunctions_CompileTest_j0f_float (Subprocess aborted)
        139 - Unit_deviceFunctions_CompileTest_jnf_float (SEGFAULT)
        142 - Unit_deviceFunctions_CompileTest_llrintf_float (SEGFAULT)
        145 - Unit_deviceFunctions_CompileTest_log1pf_float (SEGFAULT)
        149 - Unit_deviceFunctions_CompileTest_lrintf_float (Subprocess aborted)
        150 - Unit_deviceFunctions_CompileTest_lroundf_float (SEGFAULT)
        153 - Unit_deviceFunctions_CompileTest_modff_float (Subprocess aborted)
        154 - Unit_deviceFunctions_CompileTest_nanf_float (SEGFAULT)
        155 - Unit_deviceFunctions_CompileTest_nearbyintf_float (SEGFAULT)
        156 - Unit_deviceFunctions_CompileTest_nextafterf_float (SEGFAULT)
        157 - Unit_deviceFunctions_CompileTest_norm3df_float (SEGFAULT)
        159 - Unit_deviceFunctions_CompileTest_normcdff_float (SEGFAULT)
        160 - Unit_deviceFunctions_CompileTest_normcdfinvf_float (SEGFAULT)
        171 - Unit_deviceFunctions_CompileTest_roundf_float (SEGFAULT)
        173 - Unit_deviceFunctions_CompileTest_scalblnf_float (SEGFAULT)
        174 - Unit_deviceFunctions_CompileTest_scalbnf_float (SEGFAULT)
        175 - Unit_deviceFunctions_CompileTest_signbit_float (SEGFAULT)
        177 - Unit_deviceFunctions_CompileTest_sincospif_float (SEGFAULT)
        180 - Unit_deviceFunctions_CompileTest_sinpif_float (Subprocess aborted)
        181 - Unit_deviceFunctions_CompileTest_sqrtf_float (Failed)
        184 - Unit_deviceFunctions_CompileTest_tgammaf_float (SEGFAULT)
        186 - Unit_deviceFunctions_CompileTest_y0f_float (Failed)
        196 - Unit_deviceFunctions_CompileTest_cbrt_double (SEGFAULT)
        197 - Unit_deviceFunctions_CompileTest_ceil_double (SEGFAULT)
        203 - Unit_deviceFunctions_CompileTest_cyl_bessel_i1_double (Subprocess aborted)
        206 - Unit_deviceFunctions_CompileTest_erfcinv_double (SEGFAULT)
        207 - Unit_deviceFunctions_CompileTest_erfcx_double (Failed)
        209 - Unit_deviceFunctions_CompileTest_exp_double (SEGFAULT)
        214 - Unit_deviceFunctions_CompileTest_fdim_double (SEGFAULT)
        218 - Unit_deviceFunctions_CompileTest_fmin_double (SEGFAULT)
        226 - Unit_deviceFunctions_CompileTest_j0_double (SEGFAULT)
        227 - Unit_deviceFunctions_CompileTest_j1_double (SEGFAULT)
        231 - Unit_deviceFunctions_CompileTest_llrint_double (SEGFAULT)
        232 - Unit_deviceFunctions_CompileTest_llround_double (SEGFAULT)
        239 - Unit_deviceFunctions_CompileTest_lround_double (SEGFAULT)
        243 - Unit_deviceFunctions_CompileTest_nan_double (SEGFAULT)
        244 - Unit_deviceFunctions_CompileTest_nearbyint_double (Subprocess aborted)
        249 - Unit_deviceFunctions_CompileTest_normcdf_double (SEGFAULT)
        257 - Unit_deviceFunctions_CompileTest_rnorm_double (SEGFAULT)
        259 - Unit_deviceFunctions_CompileTest_rnorm4d_double (SEGFAULT)
        272 - Unit_deviceFunctions_CompileTest_tanh_double (SEGFAULT)
        277 - Unit_deviceFunctions_CompileTest_yn_double (SEGFAULT)
        280 - Unit_deviceFunctions_CompileTest_abs_longlongint_int (SEGFAULT)
        282 - Unit_deviceFunctions_CompileTest_labs_longlongint_int (SEGFAULT)
        284 - Unit_deviceFunctions_CompileTest_max_int (SEGFAULT)
        288 - Unit_deviceFunctions_CompileTest___float_as_uint_unsigned (SEGFAULT)
        290 - Unit_deviceFunctions_CompileTest___longlong_as_double_double (SEGFAULT)
        296 - Unit_deviceFunctions_CompileTest_atomicAdd_double (Failed)
        297 - Unit_deviceFunctions_CompileTest_atomicAdd_system_int (SEGFAULT)
        298 - Unit_deviceFunctions_CompileTest_atomicAdd_system_usigned_int (Failed)
        299 - Unit_deviceFunctions_CompileTest_atomicAdd_system_unsigned_long_long (SEGFAULT)
        300 - Unit_deviceFunctions_CompileTest_atomicAdd_system_float (Failed)
        301 - Unit_deviceFunctions_CompileTest_atomicAdd_system_double (SEGFAULT)
        302 - Unit_deviceFunctions_CompileTest_atomicAnd_int (SEGFAULT)
        303 - Unit_deviceFunctions_CompileTest_atomicAnd_unsigned_int (SEGFAULT)
        305 - Unit_deviceFunctions_CompileTest_atomicAnd_system_int (Failed)
        306 - Unit_deviceFunctions_CompileTest_atomicAnd_system_unsigned_int (SEGFAULT)
        307 - Unit_deviceFunctions_CompileTest_atomicAnd_system_unsigned_long_long (SEGFAULT)
        311 - Unit_deviceFunctions_CompileTest_atomicCAS_system_int (SEGFAULT)
        316 - Unit_deviceFunctions_CompileTest_atomicExch_int (SEGFAULT)
        318 - Unit_deviceFunctions_CompileTest_atomicExch_unsigned_long_long (Subprocess aborted)
        322 - Unit_deviceFunctions_CompileTest_atomicExch_system_unsigned_long_long (SEGFAULT)
        330 - Unit_deviceFunctions_CompileTest_atomicMax_system_usigned_int (SEGFAULT)
        333 - Unit_deviceFunctions_CompileTest_atomicMin_unsigned_long_long (SEGFAULT)
        336 - Unit_deviceFunctions_CompileTest_atomicOr_int (SEGFAULT)
        337 - Unit_deviceFunctions_CompileTest_atomicOr_usigned_int (Subprocess aborted)
        373 - Unit_hipGraphAddMemcpyNodeFromSymbol_GlobalMemory (SEGFAULT)
        374 - Unit_hipGraphAddMemcpyNodeFromSymbol_GlobalConstMemory (SEGFAULT)
        380 - Unit_hipGraphAddMemcpyNodeToSymbol_GlobalMemory (SEGFAULT)
        381 - Unit_hipGraphAddMemcpyNodeToSymbol_GlobalConstMemory (SEGFAULT)
        406 - Unit_hipGraphMemcpyNodeSetParamsFromSymbol_Functional (SEGFAULT)
        581 - Unit_hipMemcpyFromToSymbol_Negative (SEGFAULT)
        582 - Unit_hipMemcpyToFromSymbol_SyncAndAsync (SEGFAULT)
        589 - Unit_hipMemcpyWithStream_TestwithTwoStream (Subprocess aborted)
        604 - Unit_hipMalloc_LoopRegressionAllocFreeCycles (Subprocess aborted)
        611 - Unit_hipHostMalloc_Basic (Timeout)
        618 - Unit_hipMemcpy_KernelLaunch - double (SEGFAULT)
        636 - Unit_hipMemcpyAsync_hipMultiMemcpyMultiThreadMultiStream - int (SEGFAULT)
        708 - Unit_hipMultiStream_sameDevice (SEGFAULT)
        714 - Unit_hipStreamCreate_MultistreamBasicFunctionalities (Subprocess aborted)
        798 - Unit_hipClassKernel_Virtual (Subprocess aborted)
        799 - Unit_hipClassKernel_Value (Subprocess aborted)
        806 - ABM_AddKernel_MultiTypeMultiSize - float (SEGFAULT)
        815 - syncthreadsExitedThreads (Timeout)
        841 - TestWholeProgramCompilation (Failed)
        850 - hipcc-TestFastMath (Failed)
        852 - TestLazyModuleInit (SEGFAULT)
        856 - TestLargeGlobalVar (SEGFAULT)
        858 - TestGlobalVarInit (SEGFAULT)
        862 - TestStlFunctionsDouble (SEGFAULT)
        899 - hipConstantTestDeviceSymbol (SEGFAULT)
        900 - hipTestSymbolReset (SEGFAULT)
        903 - hipTestVariableTemplateSymbols (SEGFAULT)
        924 - hipTestDeviceLink (SEGFAULT)
        937 - cuda-convolutionSeparable (Subprocess aborted)
        940 - cuda-binomialoptions (SEGFAULT)
        942 - cuda-qrng (Failed)
        947 - cuda-FDTD3d (SEGFAULT)
pvelesko commented 8 months ago

Can you please also post all the relevant system info? Seems like the same tests as on Aurora (except they timeout on Aurora). Can you also post here the results for RC1?

franz commented 8 months ago

1.1-RC1 results:

99% tests passed, 1 tests failed out of 948

Label Time Summary:
cuda        =  66.70 sec*proc (26 tests)
internal    = 115.76 sec*proc (73 tests)

Total Test time (real) = 1426.78 sec

The following tests did not run:
        539 - Unit_hipMallocManaged_HostDeviceConcurrent (Skipped)
        540 - Unit_hipMallocManaged_MultiChunkSingleDevice (Skipped)
        541 - Unit_hipMallocManaged_MultiChunkMultiDevice (Skipped)
        542 - Unit_hipMallocManaged_OverSubscription (Skipped)
        543 - Unit_hipMallocManaged_TwoPointers - int (Skipped)
        544 - Unit_hipMallocManaged_TwoPointers - float (Skipped)
        545 - Unit_hipMallocManaged_TwoPointers - double (Skipped)
        571 - Unit_hipMallocManaged_FlgParam (Skipped)
        572 - Unit_hipMallocManaged_AccessMultiStream (Skipped)
        604 - Unit_hipMallocManaged_Advanced (Skipped)
        681 - Unit_hipMemsetSync (Skipped)
        682 - Unit_hipMemsetDSync - int8_t (Skipped)
        683 - Unit_hipMemsetDSync - int16_t (Skipped)
        684 - Unit_hipMemsetDSync - uint32_t (Skipped)
        685 - Unit_hipMemset2DSync (Skipped)
        686 - Unit_hipMemset3DSync (Skipped)
        754 - Unit_hipDeviceTotalMem_NonSelectedDevice (Skipped)
        759 - Unit_hipGetDeviceCount_HideDevices (Skipped)
        766 - Unit_hipSetGetDevice_Positive_Threaded_Basic (Skipped)
        768 - Unit_hipDeviceGetP2PAttribute_Basic (Skipped)
        769 - Unit_hipDeviceGetP2PAttribute_Negative (Skipped)
        770 - Unit_hipDeviceCanAccessPeer_positive (Skipped)
        771 - Unit_hipDeviceCanAccessPeer_negative (Skipped)
        772 - Unit_hipDeviceEnableDisablePeerAccess_positive (Skipped)
        773 - Unit_hipDeviceEnablePeerAccess_negative (Skipped)
        774 - Unit_hipDeviceDisablePeerAccess_negative (Skipped)

The following tests FAILED:
        817 - syncthreadsExitedThreads (Timeout)

system info:

Clang/LLVM & SPIRV-translator : both from `https://github.com/CHIP-SPV/ ` branch `chipStar-llvm-17`
OS: Ubuntu 22.04.2 LTS
GCC: gcc (Ubuntu 11.4.0-1ubuntu1~22.04
CPU: Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz
GPU: Intel(R) Data Center GPU Max 1550
GPU Driver Version                                  23.35.27191.40

RC3 was built with:

cmake -G Ninja -DCHIP_USE_INTEL_USM=ON -DCMAKE_INSTALL_PREFIX=/nfs/site/home/mbabej/0/INSTALL/chip_17_11rc3_strict -DLLVM_CONFIG_BIN=/nfs/site/home/mbabej/0/INSTALL/LLVM_17/bin/llvm-config  -DCMAKE_BUILD_TYPE=Release -DCMAKE_CXX_FLAGS_RELEASE="-O0 -ggdb3 -march=native"   /nfs/site/home/mbabej/0/source/chipStar

changing the CXX_FLAGS to -O2 -march=native -DNDEBUG does not seem to make any difference

pjaaskel commented 8 months ago

OK, a regression between RC1 and RC3, we need to investigate and block 1.1.

pvelesko commented 8 months ago

Can you also post the igc version and a copy of the stack trace @franz

franz commented 8 months ago

IGC (libigc1) version: 1.0.14508.23-704~22.04 intel-opencl-icd version: 23.35.27191.40-775~22.04

example stack trace:

gdb-oneapi ./samples/cuda_samples/cuda-FDTD3d

Thread 1 "cuda-FDTD3d" received signal SIGSEGV, Segmentation fault.
0x0000155554c513fe in __GI___libc_free (mem=0x100000001) at ./malloc/malloc.c:3368
3368    ./malloc/malloc.c: No such file or directory.
(gdb) bt
#0  0x0000155554c513fe in __GI___libc_free (mem=0x100000001) at ./malloc/malloc.c:3368
#1  0x000015553fc46944 in ?? () from /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so
#2  0x000015553fc476f2 in ?? () from /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so
#3  0x000015553fc4ae78 in ?? () from /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so
#4  0x000015553fc4c6c5 in ?? () from /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so
#5  0x000015553f7cfa9b in ?? () from /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so
#6  0x000015553f7cfc0d in ?? () from /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so
#7  0x000015553f7a0995 in ?? () from /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so
#8  0x000015553f7a0bdd in ?? () from /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so
#9  0x000015553f7a6066 in ?? () from /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so
#10 0x000015553f7a616d in ?? () from /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so
#11 0x000015553f723e2b in ?? () from /usr/lib/x86_64-linux-gnu/intel-opencl/libigdrcl.so
#12 0x00001555554a0aae in cl::detail::Wrapper<_cl_kernel*>::~Wrapper() () from /nfs/site/home/mbabej/0/build/b_chip_17_FULL_BUILD/libCHIP.so
#13 0x00001555554a17b3 in CHIPKernelOpenCL::~CHIPKernelOpenCL() () from /nfs/site/home/mbabej/0/build/b_chip_17_FULL_BUILD/libCHIP.so
#14 0x00001555554a17d7 in CHIPKernelOpenCL::~CHIPKernelOpenCL() () from /nfs/site/home/mbabej/0/build/b_chip_17_FULL_BUILD/libCHIP.so
#15 0x000015555542ed9b in chipstar::Module::~Module() () from /nfs/site/home/mbabej/0/build/b_chip_17_FULL_BUILD/libCHIP.so
#16 0x00001555554a1620 in CHIPModuleOpenCL::~CHIPModuleOpenCL() () from /nfs/site/home/mbabej/0/build/b_chip_17_FULL_BUILD/libCHIP.so
#17 0x0000155555430c6a in chipstar::Device::~Device() () from /nfs/site/home/mbabej/0/build/b_chip_17_FULL_BUILD/libCHIP.so
#18 0x00001555554a171d in CHIPDeviceOpenCL::~CHIPDeviceOpenCL() () from /nfs/site/home/mbabej/0/build/b_chip_17_FULL_BUILD/libCHIP.so
#19 0x00001555554a165d in CHIPContextOpenCL::~CHIPContextOpenCL() () from /nfs/site/home/mbabej/0/build/b_chip_17_FULL_BUILD/libCHIP.so
#20 0x00001555554a16a7 in CHIPContextOpenCL::~CHIPContextOpenCL() () from /nfs/site/home/mbabej/0/build/b_chip_17_FULL_BUILD/libCHIP.so
#21 0x0000155555434e39 in chipstar::Backend::~Backend() () from /nfs/site/home/mbabej/0/build/b_chip_17_FULL_BUILD/libCHIP.so
#22 0x00001555554a18b5 in CHIPBackendOpenCL::~CHIPBackendOpenCL() () from /nfs/site/home/mbabej/0/build/b_chip_17_FULL_BUILD/libCHIP.so
#23 0x000015555542a980 in CHIPUninitializeCallOnce() () from /nfs/site/home/mbabej/0/build/b_chip_17_FULL_BUILD/libCHIP.so
#24 0x0000155554c45ee8 in __pthread_once_slow (once_control=0x155555517518 <Uninitialized>, init_routine=0x155554eaed50 <__once_proxy>)
    at ./nptl/pthread_once.c:116
#25 0x000015555542ae45 in void std::call_once<void (*)()>(std::once_flag&, void (*&&)()) ()
   from /nfs/site/home/mbabej/0/build/b_chip_17_FULL_BUILD/libCHIP.so
#26 0x000015555542a806 in CHIPUninitialize() () from /nfs/site/home/mbabej/0/build/b_chip_17_FULL_BUILD/libCHIP.so
#27 0x0000155555477d88 in __hipUnregisterFatBinary () from /nfs/site/home/mbabej/0/build/b_chip_17_FULL_BUILD/libCHIP.so
#28 0x0000555555556ae2 in __hip_module_dtor ()
#29 0x0000155554bf1495 in __run_exit_handlers (status=0, listp=0x155554dc5838 <__exit_funcs>, run_list_atexit=run_list_atexit@entry=true,
    run_dtors=run_dtors@entry=true) at ./stdlib/exit.c:113
#30 0x0000155554bf1610 in __GI_exit (status=<optimized out>) at ./stdlib/exit.c:143
#31 0x0000555555557c8a in main ()
pvelesko commented 8 months ago

Is the crash similar for the super simple tests like the Unit_deviceFunctions_CompileTest___fmul_rn_float ?

pjaaskel commented 8 months ago

And is the freed pointer always something that looks like a sentinel/dummy? mem=0x100000001

pjaaskel commented 8 months ago

Is git bisect helpful here?

franz commented 8 months ago

finished git bisect: 228281b98eef35e2e7bd6ebbb1ae2d35d46590bb is the first bad commit "Add capability based HIP device library link"

franz commented 8 months ago

I've checked with different driver versions; 23.30.26918.19 and 23.30.26918.20 do not crash,; 23.30.26918.28, 23.30.26918.50, and 23.35.27191.29 do crash.