Closed nR3D closed 9 months ago
- Implemented
BaseIntegration2ndHalfWithWalll
kernel class to enable device execution of density relaxationHere is the benchmark results with 80'000 particles:
SYCL fluid_density_relaxation computation: 0.27803 seconds Reference fluid_density_relaxation computation: 13.0099 seconds Speedup fluid_density_relaxation (main loop): 46.793x
Great!
- Implemented
BaseIntegration2ndHalfWithWalll
kernel class to enable device execution of density relaxationHere is the benchmark results with 80'000 particles:
SYCL fluid_density_relaxation computation: 0.27803 seconds Reference fluid_density_relaxation computation: 13.0099 seconds Speedup fluid_density_relaxation (main loop): 46.793x
This is impressive. Well done.
@nR3D , Hi seems that "execution_unit/execution_event.hpp" is missing in the latex branch "Xiangyu-Hu/sycl". Would you please have double check on that?
@ChiZhangatTUM execution_event.hpp
contains a class that I have been prototyping for a while but I never committed. If you are referring to particle_iterators.h
which includes it, then I left it by mistake and you can remove it for the time being. I will open another PR this week and I will add this bugfix too, thank you for the heads-up.
Understood. Then, the inclusion can be removed.
@nR3D , I have compiled the sycl branch and has two things to check with you.
1, seems that I have change "base_data_type.h" line 158 to the following
"struct DataTypeIndex<DeviceReal, std::enable_if<std::negation_v<std::is_same<Real, DeviceReal>>>>" for compilation, otherwise reports error of " no type named 'type' in 'std::enable_if
SYCL fluid_density_relaxation computation: 0.108484 seconds Reference fluid_density_relaxation computation: 2.12183 seconds Speedup fluid_density_relaxation (main loop): 19.559x "
my environment: export CC=icx export CXX=icpx
cmake ../ -DCMAKE_BUILD_TYPE=Release -DSPHINXSYS_USE_FLOAT=ON -DSPHINXSYS_USE_SYCL=ON -DSPHINXSYS_SYCL_TARGETS=nvidia_gpu_sm_86
/usr/local/cuda-11.8/bin/nvcc
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.6.0.22_223734] [opencl:cpu:1] Intel(R) OpenCL, 13th Gen Intel(R) Core(TM) i7-13700KF 3.0 [2023.16.6.0.22_223734] [ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA GeForce RTX 3090 Ti 8.8 [CUDA 12.0]
is the difference in speedup related to the cuda version?
@ChiZhangatTUM thank you for checking
SPHINXSYS_USE_FLOAT
. Initially enable_if
was meant to disable DataTypeIndex
specialization in the case in which Real == DeviceReal
(otherwise the compiler would complain that DataTypeIndex<DeviceReal>
is a redefinition of DataTypeIndex<Real>
). Unfortunately this seems a case in which enable_if
cannot be used, I will investigate further on the causes, but for the time being I will probably push a bugfix that relies on the macro SPHINXSYS_USE_FLOAT
to catch the case in which they are both floats.@ChiZhangatTUM thank you for checking
1. I am aware of this bug, to prevent it without touching the code just disable `SPHINXSYS_USE_FLOAT`. Initially `enable_if` was meant to disable `DataTypeIndex` specialization in the case in which `Real == DeviceReal` (otherwise the compiler would complain that `DataTypeIndex<DeviceReal>` is a redefinition of `DataTypeIndex<Real>`). Unfortunately this seems a case in which `enable_if` cannot be used, I will investigate further on the causes, but for the time being I will probably push a bugfix that relies on the macro `SPHINXSYS_USE_FLOAT` to catch the case in which they are both floats. 2. Performance is dependent on your machine, my configuration is based on an Intel Xeon and an RTX 2080Ti, while you are using an Intel i7 and an RTX 3090Ti. If you ran the benchmark with the same amount of particles as I did (the default case is 80'000) then your GPU computation takes almost a third of mine, but your CPU took a sixth of mine, hence the difference in performance. The speedup will increase with a larger number of particles.
The performance is due to the cuda version. CUDA 11.8 is not fully supported by SYCL. Thanks for sharing the info.
The performance is due to the cuda version. CUDA 11.8 is not fully supported by SYCL.
11.8 is the latest partially supported version. Both our GPUs support CUDA 12, but dpc++ is using CUDA features up to 11.8
I think we might be compiling using the same version, the only difference is our CUDA architecture (yours is sm_86
, mine is sm_75
)
BaseIntegration2ndHalfWithWalll
kernel class to enable device execution of density relaxationHere is the benchmark results with 80'000 particles: