Xiangyu-Hu / SPHinXsys

SPHinXsys provides C++ APIs for engineering simulation and optimization. It aims at complex systems driven by fluid, structure, multi-body dynamics and beyond. The multi-physics library is based on a unique and unified computational framework by which strong coupling has been achieved for all involved physics.
https://www.sphinxsys.org/
Apache License 2.0
259 stars 199 forks source link

Particle iterators using SYCL (initial implementation) #306

Closed nR3D closed 11 months ago

nR3D commented 11 months ago

Problem addressed

In order to run particle iterators in parallel using SYCL some limitations need to be addressed:

  1. Whatever object is executed inside local_dynamics_function needs to be device copyable; calls to virtual methods are also not supported by SYCL
  2. References to objects initialized inside host memory cannot be used inside the local_dynamics_function; which means that the object parameter this cannot be captured and used as reference, e.g. [&](size_t i){ this->update(i); }
  3. Each call to an iterator needs to request access to some sycl::buffers, passing the corresponding context handler cgh

Solution implemented so far

Each of these problems have been (partially) solved as follows:

  1. Since objects used inside SYCL kernels (i.e. the section of code executed inside the device) may contain data members that cannot be easily copied to device, e.g. vectors, references and pointers, a new "kernel" class needs to be written for every object that is not device copyable or which virtual methods are being called. Each kernel class will only contain sycl::accessors to the relevant data members of its corresponding base class (point 3. discusses how those accessors are initialized) or objects of other kernel classes. This ensures that every kernel class is device copyable and does not relay on virtual functions. Lastly, kernel classes need to implement those methods that substitute their base class methods. To avoid code duplication, a static method containing the function logic can be implemented (using a template to pass different types of arguments based on which class is calling it) and use this static method to implement both the kernel and base classes
  2. Kernel classes will substitute the object parameter this in local_dynamics_function whenever particle iterators are called with the ExecutionPolicy corresponding to SYCL; but at the same time this needs to be kept for every other ExecutionPolicy. To achieve this, without having to change every single LocalDynamics with a new interface or fork code execution with an if/else, a class ExecutionProxy has been introduced. It decides at compile time (based on the passed ExecutionPolicy) which object to pass to the local_dynamics_function (that will now have one more parameter in its signature). Proxy corresponding to the object is passed to the particle iterator. It will then use the passed proxy to get the right object (kernel/base class) and eventually initialize the kernel class accessors.
  3. Classes that implement corresponding kernel classes won't need to manage buffers or implement custom proxy classes, they just need to initialize a DeviceVariable for every data member to be used inside the device (which will manage their corresponding sycl::buffer), and a DeviceProxy, which implements ExecutionProxy discussed in point 2. Particle iterators pass the sycl context to the given DeviceProxy, which propagates it to every DeviceVariable. Those DeviceVariables will return the sycl::accessors that will then be passed to the kernel class.

This solution is obviously unnecessarily cumbersome and needs to be restructured, but it ensures a working execution on SYCL devices, without having to refactor any unrelated class inside the code-base, that may otherwise be conflicting with the changes needed for other possible solutions.

Run code

Only two LocalDynamics classes are implemented with the corresponding kernel classes as of now. And only Dambreak is considered, so the CMakeLists.txt has been modified to only import that test case.

In other to use SYCL the following CMake flags need to be passed

-DCMAKE_CXX_COMPILER=~/dpcpp/build-release/bin/clang++
-DCMAKE_CXX_FLAGS="-fsycl -fsycl-targets=nvptx64-nvidia-cuda --cuda-path=~/nvhpc_23.3/build/Linux_x86_64/23.3/cuda/11.0"

depending on where your DPC++ and CUDA installation is (aside from the other flags needed by SPHinXsys). Some other time I will add custom flags for SYCL targets and CUDA path, instead of using CXX_FLAGS.

In addition, the path ~/dpcpp/build-release/lib needs to be added to the environment variable LD_LIBRARY_PATH, in case the provided CXX_COMPILER is located in a custom location

See the DPC++ installation guide for more informations

DrChiZhang commented 11 months ago

What's the speedup of you implementation?

nR3D commented 11 months ago

@ChiZhangatTUM

I don't currently have a speedup to report. This PR is just laying a foundation, but every LocalDynamics (relevant to a given test-case) will need to be adapted to run this version of the particle iterators. I have already adapted two LocalDynamics (TimeStepInitialization and AdvectionTimeStepSizeForImplicitViscosity) to test particle_for and particle_reduce respectively. However, in order to keep running the rest of the code without SYCL, I had to copy back-and-forth all the data between CPU and GPU at every iteration, which results in a slowdown compared to a pure CPU execution. Once the entire Dambreak test-case will be adapted for SYCL execution, all the iterations inside the outermost while-loop will run exclusively on GPU.

I am currently looking for a way to loosen the need to adapt every LocalDynamics class, in order to get some insight on speedup much sooner.

DrChiZhang commented 11 months ago

@nR3D I have tried to compile your code with clang++ compiler and sycl-ls info as

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.15.3.0.20_160000] [opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700KF 3.0 [2023.15.3.0.20_160000] [ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3090 Ti 0.0 [CUDA 12.0]

While encounter the following error of In file included from /home/chi/MyCode/GitHub/sphinxsys-nr/src/for_2D_build/common/scalar_functions_supplementary.cpp:1: In file included from /home/chi/MyCode/GitHub/sphinxsys-nr/src/shared/common/scalar_functions.h:31: In file included from /home/chi/MyCode/GitHub/sphinxsys-nr/src/shared/common/base_data_type.h:41: In file included from /home/chi/MyPrefix/vcpkg/installed/x64-linux/include/eigen3/Eigen/Core:171: /home/chi/MyPrefix/vcpkg/installed/x64-linux/include/eigen3/Eigen/src/Core/MathFunctions.h:986:13: error: no member named 'isfinite' in the global namespace; did you mean 'std::isfinite'? return (::isfinite)(x); ^~ /usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/cmath:562:3: note: 'std::isfinite' declared here isfinite(float __x)

Which seems related to eigen3, any clue to this error?

nR3D commented 11 months ago

@ChiZhangatTUM

No, never encountered the same error. What flags are you passing to CMake?

/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/cmath:562:3: note: 'std::isfinite' declared here isfinite(float __x)

It seems to be including from GCC-11, while my installation is pointing to GCC-9, but I don't know if that's relevant to the issue. What's the output of clang++ -v?

DrChiZhang commented 11 months ago

@ChiZhangatTUM

No, never encountered the same error. What flags are you passing to CMake?

/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/cmath:562:3: note: 'std::isfinite' declared here isfinite(float __x)

It seems to be including from GCC-11, while my installation is pointing to GCC-9, but I don't know if that's relevant to the issue. What's the output of clang++ -v?

Clang -v

Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/11 Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/12 Selected GCC installation: /usr/lib/gcc/x86_64-linux-gnu/12 Candidate multilib: .;@m64 Selected multilib: .;@m64 Found CUDA installation: /usr/lib/cuda, version 11.5

I also tried icpx compiler, same error reported.

DrChiZhang commented 11 months ago

@ChiZhangatTUM No, never encountered the same error. What flags are you passing to CMake?

/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/cmath:562:3: note: 'std::isfinite' declared here isfinite(float __x)

It seems to be including from GCC-11, while my installation is pointing to GCC-9, but I don't know if that's relevant to the issue. What's the output of clang++ -v?

Clang -v

Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/11 Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/12 Selected GCC installation: /usr/lib/gcc/x86_64-linux-gnu/12 Candidate multilib: .;@m64 Selected multilib: .;@m64 Found CUDA installation: /usr/lib/cuda, version 11.5

I also tried icpx compiler, same error reported.

cxx flags

if(SPHINXSYS_USE_SYCL) message("-- SPHinXsys use SYCL") target_compile_options(sphinxsys_core INTERFACE $<$<BOOL:${SPHINXSYS_USE_SYCL}>:-fsycl -fsycl-targets=nvptx64-nvidia-cuda>) endif()

DrChiZhang commented 11 months ago

@ChiZhangatTUM No, never encountered the same error. What flags are you passing to CMake?

/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/cmath:562:3: note: 'std::isfinite' declared here isfinite(float __x)

It seems to be including from GCC-11, while my installation is pointing to GCC-9, but I don't know if that's relevant to the issue. What's the output of clang++ -v?

Clang -v Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/11 Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/12 Selected GCC installation: /usr/lib/gcc/x86_64-linux-gnu/12 Candidate multilib: .;@m64 Selected multilib: .;@m64 Found CUDA installation: /usr/lib/cuda, version 11.5 I also tried icpx compiler, same error reported.

cxx flags

if(SPHINXSYS_USE_SYCL) message("-- SPHinXsys use SYCL") target_compile_options(sphinxsys_core INTERFACE <{SPHINXSYS_USE_SYCL}:-fsycl -fsycl-targets=nvptx64-nvidia-cuda>) endif()

The following code where the error reported of eigen is template EIGEN_DEVICE_FUNC typename internal::enable_if<(!internal::is_integral::value)&&(!NumTraits::IsComplex),bool>::type isinf_impl(const T& x) {

if defined(EIGEN_GPU_COMPILE_PHASE)

return (::isinf)(x);

elif EIGEN_USE_STD_FPCLASSIFY

using std::isinf;
return isinf EIGEN_NOT_A_MACRO (x);

else

return x>NumTraits<T>::highest() || x<NumTraits<T>::lowest();

endif

}

nR3D commented 11 months ago

@ChiZhangatTUM No, never encountered the same error. What flags are you passing to CMake?

/usr/lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/cmath:562:3: note: 'std::isfinite' declared here isfinite(float __x)

It seems to be including from GCC-11, while my installation is pointing to GCC-9, but I don't know if that's relevant to the issue. What's the output of clang++ -v?

Clang -v Intel(R) oneAPI DPC++/C++ Compiler 2023.1.0 (2023.1.0.20230320) Target: x86_64-unknown-linux-gnu Thread model: posix InstalledDir: /opt/intel/oneapi/compiler/2023.1.0/linux/bin-llvm Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/11 Found candidate GCC installation: /usr/lib/gcc/x86_64-linux-gnu/12 Selected GCC installation: /usr/lib/gcc/x86_64-linux-gnu/12 Candidate multilib: .;@m64 Selected multilib: .;@m64 Found CUDA installation: /usr/lib/cuda, version 11.5 I also tried icpx compiler, same error reported.

I just replicated the error on the same DPC++ version (2023.1), while the version I was using (built from git source) is not presenting the same problem

The following code where the error reported of eigen is template EIGEN_DEVICE_FUNC typename internal::enable_if<(!internal::is_integral::value)&&(!NumTraits::IsComplex),bool>::type isinf_impl(const T& x) { #if defined(EIGEN_GPU_COMPILE_PHASE) return (::isinf)(x); #elif EIGEN_USE_STD_FPCLASSIFY using std::isinf; return isinf EIGEN_NOT_A_MACRO (x); #else return x>NumTraits::highest() || x<NumTraits::lowest(); #endif }

Seems to be the EIGEN_GPU_COMPILE_PHASE flag, strange enough that the DPC++ version solves the problem, I'll check if they added a bugfix lately. Eigen is probably using a global namespace because it's expecting to be compiled on nvcc or some other gpu compiler, but SYCL is not compiling that part for GPU, hence the error I suppose.

DrChiZhang commented 11 months ago

I solve this error by adding _#include before #include in base_datatype.h while the compilation failed due to : undefined reference to `sycl::_V1::queue::submit_impl(std::function<void (sycl::_V1::handler&)>, sycl::_V1::detail::code_location const&)'

nR3D commented 11 months ago

I solve this error by adding _#include before #include in base_data_type. while the compilation failed due to : undefined reference to `sycl::_V1::queue::submit_impl(std::function<void (sycl::_V1::handler&)>, sycl::_V1::detail::code_location const&)'

Yes, math.h solves the problem, but I don't have the same reference error, on my side it's compiling and executing fine after adding that include. Yours seems to be a linkage problem to the sycl library.

if(SPHINXSYS_USE_SYCL) message("-- SPHinXsys use SYCL") target_compile_options(sphinxsys_core INTERFACE {SPHINXSYS_USE_SYCL}:-fsycl -fsycl-targets=nvptx64-nvidia-cuda>) endif()

Are you sure those flags are getting set properly?

DrChiZhang commented 11 months ago

I solve this error by adding _#include before #include in base_data_type. while the compilation failed due to : undefined reference to `sycl::_V1::queue::submit_impl(std::function<void (sycl::_V1::handler&)>, sycl::_V1::detail::code_location const&)'

Yes, math.h solves the problem, but I don't have the same reference error, on my side it's compiling and executing fine after adding that include. Yours seems to be a linkage problem to the sycl library.

if(SPHINXSYS_USE_SYCL) message("-- SPHinXsys use SYCL") target_compile_options(sphinxsys_core INTERFACE {SPHINXSYS_USE_SYCL}:-fsycl -fsycl-targets=nvptx64-nvidia-cuda>) endif()

Are you sure those flags are getting set properly?

Seems that with target_compile_options, the flags can not be set properly. move after cmake command works fine. I will have a check later.

DrChiZhang commented 11 months ago

I solve this error by adding _#include before #include in base_data_type. while the compilation failed due to : undefined reference to `sycl::_V1::queue::submit_impl(std::function<void (sycl::_V1::handler&)>, sycl::_V1::detail::code_location const&)'

Yes, math.h solves the problem, but I don't have the same reference error, on my side it's compiling and executing fine after adding that include. Yours seems to be a linkage problem to the sycl library.

if(SPHINXSYS_USE_SYCL) message("-- SPHinXsys use SYCL") target_compile_options(sphinxsys_core INTERFACE {SPHINXSYS_USE_SYCL}:-fsycl -fsycl-targets=nvptx64-nvidia-cuda>) endif()

Are you sure those flags are getting set properly?

Seems that with target_compile_options, the flags can not be set properly. move after cmake command works fine. I will have a check later.

Modified as follows works fine:

if(SPHINXSYS_USE_SYCL) message("-- SPHinXsys use SYCL") target_compile_options(sphinxsys_core INTERFACE $<$<BOOL:${SPHINXSYS_USE_SYCL}>:-fsycl -fsycl-targets=nvidia_gpu_sm_86>) target_link_options(sphinxsys_core INTERFACE -fsycl -fsycl-targets=nvidia_gpu_sm_86) endif()

nR3D commented 11 months ago

I can add a new commit to include those targets inside CMakeLists, for example:

if(SPHINXSYS_SYCL_TARGETS)
    message("-- SPHinXsys use SYCL")
    target_compile_options(sphinxsys_core INTERFACE -fsycl -fsycl-targets=${SPHINXSYS_SYCL_TARGETS})
    target_link_options(sphinxsys_core INTERFACE -fsycl -fsycl-targets=${SPHINXSYS_SYCL_TARGETS})
endif()

Is it better to have a single SPHINXSYS_SYCL_TARGETS variable, or two separate SPHINXSYS_USE_SYCL and SPHINXSYS_SYCL_TARGETS?

DrChiZhang commented 11 months ago

I can add a new commit to include those targets inside CMakeLists, for example:

if(SPHINXSYS_SYCL_TARGETS)
    message("-- SPHinXsys use SYCL")
    target_compile_options(sphinxsys_core INTERFACE -fsycl -fsycl-targets=${SPHINXSYS_SYCL_TARGETS})
    target_link_options(sphinxsys_core INTERFACE -fsycl -fsycl-targets=${SPHINXSYS_SYCL_TARGETS})
endif()

Is it better to have a single SPHINXSYS_SYCL_TARGETS variable, or two separate SPHINXSYS_USE_SYCL and SPHINXSYS_SYCL_TARGETS?

Two variables would be better. If SPHINXSYS_SYCL_TARGETS is not set, use default one, for example nvptx64-nvidia-cuda

DrChiZhang commented 11 months ago

I can add a new commit to include those targets inside CMakeLists, for example:

if(SPHINXSYS_SYCL_TARGETS)
    message("-- SPHinXsys use SYCL")
    target_compile_options(sphinxsys_core INTERFACE -fsycl -fsycl-targets=${SPHINXSYS_SYCL_TARGETS})
    target_link_options(sphinxsys_core INTERFACE -fsycl -fsycl-targets=${SPHINXSYS_SYCL_TARGETS})
endif()

Is it better to have a single SPHINXSYS_SYCL_TARGETS variable, or two separate SPHINXSYS_USE_SYCL and SPHINXSYS_SYCL_TARGETS?

Two variables would be better. If SPHINXSYS_SYCL_TARGETS is not set, use default one, for example nvptx64-nvidia-cuda

@FabienPean-Virtonomy Hi Fabien, I am a little confusing by adding link_option here, would you please review this commit?

DrChiZhang commented 11 months ago

@FabienPean-Virtonomy Great review Fabien, I would like to go through all the conservation you and Alberto have. Thanks

Xiangyu-Hu commented 11 months ago

I merge this branch documentation a progress.