JuliaGPU / oneAPI.jl

Julia support for the oneAPI programming toolkit.
https://juliagpu.org/oneapi/
Other
182 stars 22 forks source link

When should we use __FORCE_MKL_FLUSH__ in the C interface? #401

Closed amontoison closed 6 months ago

amontoison commented 7 months ago

I don't understand why I have a segementation fault when I call some C functions that contains __FORCE_MKL_FLUSH__: https://github.com/JuliaGPU/oneAPI.jl/blob/master/deps/src/onemkl.cpp#L11-L12

I don't have anymore a segmentation fault when I remove __FORCE_MKL_FLUSH__ but it only concerns a few routines (geqrf -- LAPACK and set_csr_data -- SPARSE). Why don't we have the same behaviour with all routines? I use __FORCE_MKL_FLUSH__ after the routines that return void.

[5119] signal (11.1): Erreur de segmentation
in expression starting at REPL[7]:1
_ZNK4sycl3_V15event11get_backendEv at /home/alexis/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/conda/lib/libsycl.so.7 (unknown line)
_ZN4sycl3_V110get_nativeILNS0_7backendE2ENS0_5eventEEENS0_14backend_traitsIXT_EE11return_typeIT0_EERKS7_ at /home/alexis/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/deps/lib/liboneapi_support.so (unknown line)
onemklSgeqrf at /home/alexis/.julia/scratchspaces/8f75cd03-7ff8-4ecb-9b8f-daf728133b1b/deps/lib/liboneapi_support.so (unknown line)
onemklSgeqrf at /home/alexis/Bureau/git/oneAPI.jl/lib/support/liboneapi_support.jl:2473
unknown function (ip: 0x7fb6b8136699)
pengtu commented 7 months ago

The FORCE_MKL_FLUSH is used to make sure that the MKL task submitted to the SYCL queue has been dispatched. The SYCL runtime can temporarily hold a SYCL kernel without submitting it to the GPU driver (L0 driver in our case). The oneAPI.jl runtime works directly on L0 queue to synchronize between MKL SYCL function call and Julia statements. If a MKL kernel was held by the SYCL runtime and oneAPI.jl runtime calls zeQueueSynchronize() to wait for the MKL kernel to finish, they will be out of order. Hence, we call FORCE_MKL_FLUSH to make sure the SYCL kernel has been submitted to the L0 queue.

The FORCE_MKL_FLUSH(cmd) calls sycl::get_native<sycl::backend::ext_oneapi_level_zero(cmd) supposes to take a SYCL event returned by the MKL function as 'cmd'. If the MKL function doesn't return an event, it segfaults.

amontoison commented 7 months ago

Thanks @pengtu! The issue is how to be sure that the "usm" version and not the "buffer" version of a routine is used in the C interface? For example with geqrf here, we have the same parameters if we don't provide the argument events: documentation of geqrf.

Should we provide an empty list {} as a last parameter to the MKL routines to be sure that the usm version is used and we can call FORCE_MKL_FLUSH?

pengtu commented 7 months ago

@amontoison: Indeed that the C wrapper might have been invoking the "buffer" version. Please try passing an empty list {} as the last argument to be sure that the 'usm' version is invoked.

amontoison commented 7 months ago

@pengtu Should we only wrap the "usm" version if both version are available?

pengtu commented 6 months ago

@pengtu Should we only wrap the "usm" version if both version are available?

Yes, we shall always call the "usm" version of the oneMKL since Julia directly allocate the device array without using SYCL buffer interface.