[SYCL] Fluid density relaxation

nR3D commented 10 months ago

Implemented BaseIntegration2ndHalfWithWalll kernel class to enable device execution of density relaxation

Here is the benchmark results with 80'000 particles:

SYCL fluid_density_relaxation computation: 0.27803 seconds
Reference fluid_density_relaxation computation: 13.0099 seconds
Speedup fluid_density_relaxation (main loop): 46.793x

Xiangyu-Hu commented 10 months ago

Implemented BaseIntegration2ndHalfWithWalll kernel class to enable device execution of density relaxation

Here is the benchmark results with 80'000 particles:
SYCL fluid_density_relaxation computation: 0.27803 seconds
Reference fluid_density_relaxation computation: 13.0099 seconds
Speedup fluid_density_relaxation (main loop): 46.793x

Great!

DrChiZhang commented 9 months ago

Implemented BaseIntegration2ndHalfWithWalll kernel class to enable device execution of density relaxation

Here is the benchmark results with 80'000 particles:
SYCL fluid_density_relaxation computation: 0.27803 seconds
Reference fluid_density_relaxation computation: 13.0099 seconds
Speedup fluid_density_relaxation (main loop): 46.793x

This is impressive. Well done.

DrChiZhang commented 9 months ago

@nR3D , Hi seems that "execution_unit/execution_event.hpp" is missing in the latex branch "Xiangyu-Hu/sycl". Would you please have double check on that?

nR3D commented 9 months ago

@ChiZhangatTUM execution_event.hpp contains a class that I have been prototyping for a while but I never committed. If you are referring to particle_iterators.h which includes it, then I left it by mistake and you can remove it for the time being. I will open another PR this week and I will add this bugfix too, thank you for the heads-up.

DrChiZhang commented 9 months ago

Understood. Then, the inclusion can be removed.

DrChiZhang commented 9 months ago

@nR3D , I have compiled the sycl branch and has two things to check with you. 1, seems that I have change "base_data_type.h" line 158 to the following "struct DataTypeIndex<DeviceReal, std::enable_if<std::negation_v<std::is_same<Real, DeviceReal>>>>" for compilation, otherwise reports error of " no type named 'type' in 'std::enable_if'; 'enable_if' cannot be used to disable this declaration".

2, The performance case reports the following speedup " SYCL memory operations: 3.33334 seconds.

SYCL all methods computation: 0.82506 seconds Reference all methods computation: 5.04647 seconds Speedup all methods (main loop): 6.11649x

SYCL fluid_step_initialization computation: 0.0178942 seconds Reference fluid_step_initialization computation: 0.0348679 seconds Speedup fluid_step_initialization (main loop): 1.94855x

SYCL fluid_advection_time_step computation: 0.0488581 seconds Reference fluid_advection_time_step computation: 0.0364404 seconds Speedup fluid_advection_time_step (main loop): 0.745841x

SYCL fluid_density_by_summation computation: 0.153842 seconds Reference fluid_density_by_summation computation: 0.179183 seconds Speedup fluid_density_by_summation (main loop): 1.16472x

SYCL fluid_acoustic_time_step computation: 0.0496296 seconds Reference fluid_acoustic_time_step computation: 0.0510392 seconds Speedup fluid_acoustic_time_step (main loop): 1.0284x

SYCL fluid_pressure_relaxation computation: 0.446109 seconds Reference fluid_pressure_relaxation computation: 1.99061 seconds Speedup fluid_pressure_relaxation (main loop): 4.46216x

SYCL fluid_density_relaxation computation: 0.108484 seconds Reference fluid_density_relaxation computation: 2.12183 seconds Speedup fluid_density_relaxation (main loop): 19.559x "

my environment: export CC=icx export CXX=icpx

cmake ../ -DCMAKE_BUILD_TYPE=Release -DSPHINXSYS_USE_FLOAT=ON -DSPHINXSYS_USE_SYCL=ON -DSPHINXSYS_SYCL_TARGETS=nvidia_gpu_sm_86

/usr/local/cuda-11.8/bin/nvcc

[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.6.0.22_223734] [opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700KF 3.0 [2023.16.6.0.22_223734] [ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3090 Ti 8.8 [CUDA 12.0]

is the difference in speedup related to the cuda version?

nR3D commented 9 months ago

@ChiZhangatTUM thank you for checking

I am aware of this bug, to prevent it without touching the code just disable SPHINXSYS_USE_FLOAT. Initially enable_if was meant to disable DataTypeIndex specialization in the case in which Real == DeviceReal (otherwise the compiler would complain that DataTypeIndex<DeviceReal> is a redefinition of DataTypeIndex<Real>). Unfortunately this seems a case in which enable_if cannot be used, I will investigate further on the causes, but for the time being I will probably push a bugfix that relies on the macro SPHINXSYS_USE_FLOAT to catch the case in which they are both floats.
Performance is dependent on your machine, my configuration is based on an Intel Xeon and an RTX 2080Ti, while you are using an Intel i7 and an RTX 3090Ti. If you ran the benchmark with the same amount of particles as I did (the default case is 80'000) then your GPU computation takes almost a third of mine, but your CPU took a sixth of mine, hence the difference in performance. The speedup will increase with a larger number of particles.

DrChiZhang commented 9 months ago

@ChiZhangatTUM thank you for checking

1. I am aware of this bug, to prevent it without touching the code just disable `SPHINXSYS_USE_FLOAT`. Initially `enable_if` was meant to disable `DataTypeIndex` specialization in the case in which `Real == DeviceReal` (otherwise the compiler would complain that `DataTypeIndex<DeviceReal>` is a redefinition of `DataTypeIndex<Real>`). Unfortunately this seems a case in which `enable_if` cannot be used, I will investigate further on the causes, but for the time being I will probably push a bugfix that relies on the macro `SPHINXSYS_USE_FLOAT` to catch the case in which they are both floats.

2. Performance is dependent on your machine, my configuration is based on an Intel Xeon and an RTX 2080Ti, while you are using an Intel i7 and an RTX 3090Ti. If you ran the benchmark with the same amount of particles as I did (the default case is 80'000) then your GPU computation takes almost a third of mine, but your CPU took a sixth of mine, hence the difference in performance. The speedup will increase with a larger number of particles.

The performance is due to the cuda version. CUDA 11.8 is not fully supported by SYCL. Thanks for sharing the info.

nR3D commented 9 months ago

The performance is due to the cuda version. CUDA 11.8 is not fully supported by SYCL.

11.8 is the latest partially supported version. Both our GPUs support CUDA 12, but dpc++ is using CUDA features up to 11.8 I think we might be compiling using the same version, the only difference is our CUDA architecture (yours is sm_86, mine is sm_75)

Xiangyu-Hu / SPHinXsys