JuliaGPU / oneAPI.jl

Julia support for the oneAPI programming toolkit.
https://juliagpu.org/oneapi/
Other
179 stars 21 forks source link

SYCL batching causes invalid results #445

Open sylvaticus opened 2 months ago

sylvaticus commented 2 months ago

Hello, I am trying to run a simplified "forward" passage of a neural network with GPU. On CUDA/CuArray I have always the same, correct results for my output, but with oneAPI/oneArray, the first time I have the correct result, but the subsequent times I have random results and random crashes (but never the first time). Any clue ?

using Test, oneAPI, BenchmarkTools, LinearAlgebra

# Function definitions
relu(x) = max(0,x)
forward_layer(x,w,w0,f) = f.(w*x .+ w0)
function forward_network!(y,x,w1,w2,w3,w01,w02,w03,f=relu)
    x1 = forward_layer(x,w1,w01,f)
    x2 = forward_layer(x1,w2,w02,f)
    y  .= forward_layer(x2,w3,w03,identity)
    return nothing
end

# CPU data
(nd0,nd1,nd2,ndy) = (200,300,300,1)
x   = rand(Float32,nd0);      y = Vector{Float32}(undef,ndy)
w1  = rand(Float32,nd1,nd0); w2 = rand(Float32,nd2,nd1); w3 = rand(Float32,ndy,nd2)
w01 = rand(Float32,nd1);    w02 = rand(Float32,nd2);    w03 = rand(Float32,ndy);
# CPU call
forward_network!(y,x,w1,w2,w3,w01,w02,w03,relu)

# GPU data
y_g   = oneArray{Float32}(undef,ndy)
x_g   = oneArray(x)
w1_g  = oneArray(w1);  w2_g  = oneArray(w2);  w3_g  = oneArray(w3);
w01_g = oneArray(w01); w02_g = oneArray(w02); w03_g = oneArray(w03); 
# GPU call
forward_network!(y_g,x_g,w1_g,w2_g,w3_g,w01_g,w02_g,w03_g,relu)

# Correctness check..
y ≈ Array(y_g) # true

# Second (and further) attempt..
y_g   = oneArray{Float32}(undef,ndy)
forward_network!(y_g,x_g,w1_g,w2_g,w3_g,w01_g,w02_g,w03_g,relu)
y ≈ Array(y_g) # false !

Perhaps linked to https://github.com/JuliaGPU/oneAPI.jl/issues/327 ?

Ubuntu 22.04, oneAPI v1.5.0, Intel CPU i5-8350U, UHD Graphics 620

maleadt commented 2 months ago

MWE for the correctness issue:

julia> oneMKL.gemv!('N', 1f0, oneAPI.ones(Float32, 3, 2), oneAPI.ones(Float32, 2), 0f0, oneAPI.zeros(Float32, 3))
3-element oneArray{Float32, 1, oneAPI.oneL0.DeviceBuffer}:
 2.0
 2.0
 2.0

julia> oneMKL.gemv!('N', 1f0, oneAPI.ones(Float32, 3, 2), oneAPI.ones(Float32, 2), 0f0, oneAPI.zeros(Float32, 3))
3-element oneArray{Float32, 1, oneAPI.oneL0.DeviceBuffer}:
 0.0
 0.0
 0.0

Crashes I've seen:

[14750] signal (11.1): Segmentation fault
in expression starting at REPL[51]:1
NEO::DrmAllocation::makeBOsResident(NEO::OsContext*, unsigned int, std::vector<NEO::BufferObject*, std::allocator<NEO::BufferObject*> >*, bool) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
NEO::DrmCommandStreamReceiver<NEO::XeHpgCoreFamily>::processResidency(std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> > const&, unsigned int) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
NEO::DrmCommandStreamReceiver<NEO::XeHpgCoreFamily>::flushInternal(NEO::BatchBuffer const&, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> > const&) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
NEO::DrmCommandStreamReceiver<NEO::XeHpgCoreFamily>::flush(NEO::BatchBuffer&, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> >&) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
NEO::CommandStreamReceiver::submitBatchBuffer(NEO::BatchBuffer&, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> >&) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
L0::CommandQueueImp::submitBatchBuffer(unsigned long, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> >&, void*, bool) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
L0::CommandQueueHw<(GFXCORE_FAMILY)3079>::executeCommandListsRegular(L0::CommandQueueHw<(GFXCORE_FAMILY)3079>::CommandListExecutionContext&, unsigned int, _ze_command_list_handle_t**, _ze_fence_handle_t*, _ze_event_handle_t*, unsigned int, _ze_event_handle_t**) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
L0::CommandQueueHw<(GFXCORE_FAMILY)3079>::executeCommandLists(unsigned int, _ze_command_list_handle_t**, _ze_fence_handle_t*, bool, _ze_event_handle_t*, unsigned int, _ze_event_handle_t**) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
L0::zeCommandQueueExecuteCommandLists(_ze_command_queue_handle_t*, unsigned int, _ze_command_list_handle_t**, _ze_fence_handle_t*) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
ur_queue_handle_t_::executeCommandList(std::__1::__hash_map_iterator<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::__hash_value_type<_ze_command_list_handle_t*, ur_command_list_info_t>, void*>*> >, bool, bool) at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libpi_level_zero.so (unknown line)
urEnqueueKernelLaunch at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libpi_level_zero.so (unknown line)
piEnqueueKernelLaunch at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libpi_level_zero.so (unknown line)
_pi_result sycl::_V1::detail::plugin::call_nocheck<(sycl::_V1::detail::PiApiKind)76, _pi_queue*, _pi_kernel*, unsigned long, unsigned long*, unsigned long*, unsigned long*, unsigned long, _pi_event**, _pi_event**>(_pi_queue*, _pi_kernel*, unsigned long, unsigned long*, unsigned long*, unsigned long*, unsigned long, _pi_event**, _pi_event**) const at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libsycl.so.7 (unknown line)
sycl::_V1::detail::enqueueImpKernel(std::shared_ptr<sycl::_V1::detail::queue_impl> const&, sycl::_V1::detail::NDRDescT&, std::vector<sycl::_V1::detail::ArgDesc, std::allocator<sycl::_V1::detail::ArgDesc> >&, std::shared_ptr<sycl::_V1::detail::kernel_bundle_impl> const&, std::shared_ptr<sycl::_V1::detail::kernel_impl> const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::vector<_pi_event*, std::allocator<_pi_event*> >&, std::shared_ptr<sycl::_V1::detail::event_impl> const&, std::function<void* (sycl::_V1::detail::AccessorImplHost*)> const&, _pi_kernel_cache_config) at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libsycl.so.7 (unknown line)
_ZZN4sycl3_V17handler8finalizeEvENK3 at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libsycl.so.7 (unknown line)
sycl::_V1::handler::finalize() at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libsycl.so.7 (unknown line)
void sycl::_V1::detail::queue_impl::finalizeHandler<sycl::_V1::handler>(sycl::_V1::handler&, sycl::_V1::detail::CG::CGTYPE const&, sycl::_V1::event&) at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libsycl.so.7 (unknown line)
sycl::_V1::detail::queue_impl::submit_impl(std::function<void (sycl::_V1::handler&)> const&, std::shared_ptr<sycl::_V1::detail::queue_impl> const&, std::shared_ptr<sycl::_V1::detail::queue_impl> const&, std::shared_ptr<sycl::_V1::detail::queue_impl> const&, sycl::_V1::detail::code_location const&, std::function<void (bool, bool, sycl::_V1::event&)> const*) at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libsycl.so.7 (unknown line)
sycl::_V1::detail::queue_impl::submit(std::function<void (sycl::_V1::handler&)> const&, std::shared_ptr<sycl::_V1::detail::queue_impl> const&, sycl::_V1::detail::code_location const&, std::function<void (bool, bool, sycl::_V1::event&)> const*) at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libsycl.so.7 (unknown line)
sycl::_V1::queue::submit_impl(std::function<void (sycl::_V1::handler&)>, sycl::_V1::detail::code_location const&) at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libsycl.so.7 (unknown line)
oneapi::mkl::gpu::sscal_sycl_internal(sycl::_V1::queue*, long, oneapi::mkl::value_or_pointer<float>, float*, long, std::vector<sycl::_V1::event, std::allocator<sycl::_V1::event> > const&, bool) at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libmkl_sycl_blas.so.4 (unknown line)
oneapi::mkl::gpu::sscal_sycl(sycl::_V1::queue*, long, oneapi::mkl::value_or_pointer<float>, float*, long, std::vector<sycl::_V1::event, std::allocator<sycl::_V1::event> > const&, bool) at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libmkl_sycl_blas.so.4 (unknown line)
oneapi::mkl::gpu::sgemv_sycl_internal(sycl::_V1::queue*, MKL_LAYOUT, MKL_TRANSPOSE, long, long, oneapi::mkl::value_or_pointer<float>, float const*, long, float const*, long, oneapi::mkl::value_or_pointer<float>, float*, long, std::vector<sycl::_V1::event, std::allocator<sycl::_V1::event> > const&) at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libmkl_sycl_blas.so.4 (unknown line)
oneapi::mkl::gpu::sgemv_sycl(sycl::_V1::queue*, MKL_LAYOUT, MKL_TRANSPOSE, long, long, oneapi::mkl::value_or_pointer<float>, float const*, long, float const*, long, oneapi::mkl::value_or_pointer<float>, float*, long, std::vector<sycl::_V1::event, std::allocator<sycl::_V1::event> > const&) at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libmkl_sycl_blas.so.4 (unknown line)
oneapi::mkl::blas::sgemv(sycl::_V1::queue&, MKL_LAYOUT, oneapi::mkl::transpose, long, long, oneapi::mkl::value_or_pointer<float>, float const*, long, float const*, long, oneapi::mkl::value_or_pointer<float>, float*, long, std::vector<sycl::_V1::event, std::allocator<sycl::_V1::event> > const&) at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libmkl_sycl_blas.so.4 (unknown line)
oneapi::mkl::blas::column_major::gemv(sycl::_V1::queue&, oneapi::mkl::transpose, long, long, oneapi::mkl::value_or_pointer<float>, float const*, long, float const*, long, oneapi::mkl::value_or_pointer<float>, float*, long, std::vector<sycl::_V1::event, std::allocator<sycl::_V1::event> > const&) at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libmkl_sycl_blas.so.4 (unknown line)
onemklSgemv at /workspace/srcdir/oneAPI.jl/deps/src/onemkl.cpp:716
onemklSgemv at /home/tim/Julia/pkg/oneAPI/lib/support/liboneapi_support.jl:750

And during process exit:

[13420] signal (11.1): Segmentation fault
in expression starting at none:0
NEO::DrmAllocation::bindBO(NEO::BufferObject*, NEO::OsContext*, unsigned int, std::vector<NEO::BufferObject*, std::allocator<NEO::BufferObject*> >*, bool) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
NEO::DrmAllocation::makeBOsResident(NEO::OsContext*, unsigned int, std::vector<NEO::BufferObject*, std::allocator<NEO::BufferObject*> >*, bool) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
NEO::DrmCommandStreamReceiver<NEO::XeHpgCoreFamily>::processResidency(std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> > const&, unsigned int) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
NEO::DrmCommandStreamReceiver<NEO::XeHpgCoreFamily>::flushInternal(NEO::BatchBuffer const&, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> > const&) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
NEO::DrmCommandStreamReceiver<NEO::XeHpgCoreFamily>::flush(NEO::BatchBuffer&, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> >&) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
NEO::CommandStreamReceiver::submitBatchBuffer(NEO::BatchBuffer&, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> >&) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
L0::CommandQueueImp::submitBatchBuffer(unsigned long, std::vector<NEO::GraphicsAllocation*, std::allocator<NEO::GraphicsAllocation*> >&, void*, bool) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
L0::CommandQueueHw<(GFXCORE_FAMILY)3079>::executeCommandListsRegular(L0::CommandQueueHw<(GFXCORE_FAMILY)3079>::CommandListExecutionContext&, unsigned int, _ze_command_list_handle_t**, _ze_fence_handle_t*, _ze_event_handle_t*, unsigned int, _ze_event_handle_t**) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
L0::CommandQueueHw<(GFXCORE_FAMILY)3079>::executeCommandLists(unsigned int, _ze_command_list_handle_t**, _ze_fence_handle_t*, bool, _ze_event_handle_t*, unsigned int, _ze_event_handle_t**) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
L0::zeCommandQueueExecuteCommandLists(_ze_command_queue_handle_t*, unsigned int, _ze_command_list_handle_t**, _ze_fence_handle_t*) at /home/tim/.julia/artifacts/f6b6f7783395fabf32b0337c23e95719f94b00fd/lib/libze_intel_gpu.so.1 (unknown line)
ur_queue_handle_t_::executeCommandList(std::__1::__hash_map_iterator<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::__hash_value_type<_ze_command_list_handle_t*, ur_command_list_info_t>, void*>*> >, bool, bool) at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libpi_level_zero.so (unknown line)
ur_queue_handle_t_::executeAllOpenCommandLists() at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libpi_level_zero.so (unknown line)
urQueueRelease at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libpi_level_zero.so (unknown line)
piQueueRelease at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libpi_level_zero.so (unknown line)
_pi_result sycl::_V1::detail::plugin::call_nocheck<(sycl::_V1::detail::PiApiKind)26, _pi_queue*>(_pi_queue*) const at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libsycl.so.7 (unknown line)
sycl::_V1::detail::queue_impl::~queue_impl() at /home/tim/.julia/artifacts/9ee16b343b9f98a2396dcd835a5203fa30b10f26/lib/libsycl.so.7 (unknown line)
_M_release at /opt/x86_64-linux-gnu/x86_64-linux-gnu/include/c++/8.1.0/bits/shared_ptr_base.h:161 [inlined]
~__shared_count at /opt/x86_64-linux-gnu/x86_64-linux-gnu/include/c++/8.1.0/bits/shared_ptr_base.h:712 [inlined]
~__shared_ptr at /opt/x86_64-linux-gnu/x86_64-linux-gnu/include/c++/8.1.0/bits/shared_ptr_base.h:1151 [inlined]
~queue at /opt/x86_64-linux-gnu/x86_64-linux-gnu/sys-root/usr/local/include/sycl/queue.hpp:119 [inlined]
~syclQueue_st at /workspace/srcdir/oneAPI.jl/deps/src/sycl.hpp:19 [inlined]
syclQueueDestroy at /workspace/srcdir/oneAPI.jl/deps/src/sycl.cpp:60
syclQueueDestroy at /home/tim/Julia/pkg/oneAPI/lib/support/liboneapi_support.jl:58 [inlined]
maleadt commented 2 months ago

Looks like setting SYCL_PI_LEVEL_ZERO_BATCH_SIZE from __init__ doesn't work, as we've already loaded MKL then and SYCL doesn't re-parse the environment variable (which makes sense). As a result, there's outstanding SYCL operations that aren't materialized by our synchronize().

@pengtu Is there perhaps a different way to change the batch size? It would require a bit of engineering across the Julia BinaryBuilder stack to make it possible to set env vars when the library is dlopened.