Measure Performance without OpenGL

vitduck commented 1 year ago

Hello,

I've successfully build the CUDA version of the code.

Is it possible to measure performance without relying on OpenGL or Xvfb ? In a public supercomputer environment, it is very difficult to request installation of dependencies required for running the test.

In case of GLM, GLFW, and GLEW, I was able to install them locally via Spack. However, as mentioned in the repo, running the code through X-tunneling is not recommended.
In case of Xvfb, it is part of Xorg, and we cannot request installation of these packages in a shared environment.

Also, when running CUDA version, the following error is generated on CentOS 7.9

$ ./nbody_cuda
Error id : 65543, GLX: Failed to create context: 161

I appreciate if you can provide some suggestion to circumvent these issues.

Regards.

DuncanMcBain commented 1 year ago

Hi @vitduck,

as a temporary solution it might be possible to use the solution described here instead of messing around with the X virtual framebuffer stuff, though I haven't tried personally. It should be possible to compile Mesa and LLVMPipe without requiring that they are installed to the system.

We don't have any quick fixes for removing the graphical dependency but it's something we're considering doing in some fashion. It might be possible to simply remove the OpenGL code from the main file, though I think if we pick this task up I'd like to make a second target which builds from a separate main file that has no graphical component.

Duncan.

vitduck commented 1 year ago

Hi Duncan,

Thanks for your reply.

I do agree that a second target without graphical component is better than removing OpenGL altogether. Looking at the code, it seems that the rendering is strongly coupled with simulation part. So I am not sure it is worth the effort on your end to isolate it.

For now, I will set up a linux box to test the code.

DuncanMcBain commented 10 months ago

Hi @vitduck,

We have a PR open that should fix this issue (#30).

I hope this helps!

vitduck commented 10 months ago

Hi @DuncanMcBain Thanks very much for the notice.

I am testing the latest commit as follow:

$ module purge 
$ module load cuda/10.1 
$ sh scripts/build_cuda.sh no_render 
-- The CXX compiler identification is GNU 4.8.5
-- Check for working CXX compiler: /usr/bin/c++
-- Check for working CXX compiler: /usr/bin/c++ -- works
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- The CUDA compiler identification is NVIDIA 10.1.243
-- Check for working CUDA compiler: /apps/cuda/10.1/bin/nvcc
-- Check for working CUDA compiler: /apps/cuda/10.1/bin/nvcc -- works
-- Detecting CUDA compiler ABI info
-- Detecting CUDA compiler ABI info - done
-- Found PkgConfig: /usr/bin/pkg-config (found version "0.27.1") 
-- Looking for C++ include pthread.h
-- Looking for C++ include pthread.h - found
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Failed
-- Looking for pthread_create in pthreads
-- Looking for pthread_create in pthreads - not found
-- Looking for pthread_create in pthread
-- Looking for pthread_create in pthread - found
-- Found Threads: TRUE  
-- Found CUDA: /apps/cuda/10.1 (found version "10.1") 
-- Configuring done
-- Generating done
CMake Warning:
  Manually-specified variables were not used by the project:

    GLEW_LIBRARY

-- Build files have been written to: /scratch/optpar01/work/2024/cuda-to-sycl-nbody/build_cuda
Scanning dependencies of target nbody_cuda
[ 25%] Building CXX object src/CMakeFiles/nbody_cuda.dir/nbody.cpp.o
[ 50%] Building CXX object src/CMakeFiles/nbody_cuda.dir/sim_param.cpp.o
[ 75%] Building CUDA object src/CMakeFiles/nbody_cuda.dir/simulator.cu.o
[100%] Linking CXX executable ../../nbody_cuda
[100%] Built target nbody_cuda
Scanning dependencies of target release
[100%] Built target release

So OpenGL libs are no longer required!

However, I encounter the following error when running the compiled binary:

$ ./scripts/run_nbody.sh -b cuda 100 10  
GPUassert: initialization error /scratch/optpar01/work/2024/cuda-to-sycl-nbody/src/simulator.cuh 94

Looking the the relevant line of simulator.cuh, it is just a standard cudaMalloc

 92     ¦ ParticleData_d(size_t n) {
 93     ¦   ¦// Allocate device memory for particle coords & velocity...
 94     ¦   ¦gpuErrchk(cudaMalloc((void **)&x, sizeof(coords_t) * n));
 95     ¦   ¦gpuErrchk(cudaMalloc((void **)&y, sizeof(coords_t) * n));
 96     ¦   ¦gpuErrchk(cudaMalloc((void **)&z, sizeof(coords_t) * n));
 97     ¦ };

I tried smaller system size as well, but the error persists (We have 40 GB memory) Do you have some insight on this issue ?

DuncanMcBain commented 10 months ago

Hi @vitduck,

We won't really be able to help with the pure CUDA version of the code (we didn't write it), but if you're able to try the SYCL version we'd be happy to help with that!

vitduck commented 10 months ago

Duncan, Sorry for the an oversight on my part. The aforementioned CUDA error is due to MIG partition. Both CUDA and SYCL-migrated codes can now be built and run without rendering.

Could you kindly confirm if the following output is expected ? (If I understand correctly, the kernel time will be measured in ms)

Backend enumeration

$ sycl-ls 
[opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device 1.2 [2023.16.6.0.22_223734]
[opencl:cpu:1] Intel(R) OpenCL, AMD EPYC 7543 32-Core Processor                 3.0 [2023.16.6.0.22_223734]
[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA A100-SXM4-80GB 8.8 [CUDA 11.6]

CUDA performance

$ ./nbody_cuda 50 10 0.999998 0.005 1.0e-7 2 10000
... 
At step 10000 kernel time is 15.4361 and mean is 15.435 and stddev is: 0.0853953

SYCL/CUDA performance
```
$ SYCL_DEVICE_FILTER=cuda ./nbody_dpcpp 50 10 0.999998 0.005 1.0e-7 2 10000
...
At step 10000 kernel time is 8.60655 and mean is 8.60897 and stddev is: 0.0694211
```
I would have expected some level of parity between native CUDA and SYCL with a slight edge for the former. Here, the result unexpectedly shows that SYCL/CUDA is two times faster. I am not sure how to interpret this outcome.

DuncanMcBain commented 10 months ago

Hi @vitduck, so we have a section in the README (the last section) which covers performance and we effectively managed to get the results to be about the same between CUDA and SYCL on a 3060 GPU back when we were working on this. Obviously the software stack has changed since then so it's hard to say exactly what might be similar or different since then.

I'll check with a colleague, we might be able to send you some of our updated numbers, but also you could check with the NVIDIA NSight Compute profiling tool to see if there are any obvious things going on.

codeplaysoftware / cuda-to-sycl-nbody

Measure Performance without OpenGL #28