RenderKit / ospray

An Open, Scalable, Portable, Ray Tracing Based Rendering Engine for High-Fidelity Visualization
http://ospray.org
Apache License 2.0
1k stars 182 forks source link

Rendering not using all threads? #377

Closed paulmelis closed 4 years ago

paulmelis commented 4 years ago

I just noticed rendering using only roughly 7 out of the 8 cores in my system, as reported by htop and the system monitor graphs. I see this with ospExamples on the default scene, but also with BLOSPRAY. On the boxes scene of ospExamples it's even worse, at 60% usage. Until now I did not pay much attention to the usage, was it always the case that not all CPU cores are fully used or is this a regression?

Detection of the numer of cores seems fine, btw:

$ ~/software/ospray-superbuild-git/bin/ospExamples --osp:debug 
Embree Ray Tracing Kernels 3.6.1 (d17aec1491a6c664f4da2f0b8ba03c171cbf36a1)
  Compiler  : Intel Compiler 17.0.1
  Build     : Release 
  Platform  : Linux (64bit)
  CPU       : Unknown CPU (GenuineIntel)
   Threads  : 8
   ISA      : XMM YMM SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 POPCNT AVX F16C RDRAND AVX2 FMA3 LZCNT BMI1 BMI2 
   Targets  : SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX AVXI AVX2 
   MXCSR    : FTZ=1, DAZ=1
  Config
    Threads : 1
    ISA     : XMM YMM SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 POPCNT AVX F16C RDRAND AVX2 FMA3 LZCNT BMI1 BMI2 
    Targets : SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX AVXI AVX2  (supported)
              SSE2 SSE4.2 AVX AVX2 AVX512KNL AVX512SKX  (compile time enabled)
    Features: intersection_filter 
    Tasking : TBB2019.0 TBB_header_interface_11002 TBB_lib_interface_11006 

general:
  build threads      = 1
  build user threads = 0
  start_threads      = 0
  affinity           = 0
  frequency_level    = simd256
  hugepages          = enabled
  verbosity          = 2
  cache_size         = 134.218 MB
  max_spatial_split_replications = 2

This is with a superbuild of release-2.0.x (754d790d2b63ee7645a31c1d6878418d9db0456a) with this configuration:

#!/bin/sh
cmake \
    -DCMAKE_INSTALL_PREFIX=$HOME/software/ospray-superbuild-git \
    -DCMAKE_BUILD_TYPE=Release \
    -DBUILD_JOBS=5 \
    -DBUILD_EMBREE_FROM_SOURCE=OFF \
    -DBUILD_OIDN=ON \
    -DBUILD_OIDN_FROM_SOURCE=OFF \
    -DINSTALL_IN_SEPARATE_DIRECTORIES=OFF \
    ../scripts/superbuild
tanolino commented 4 years ago

I can second that. I tested on a 4 Core Intel with SMT, 4 Core Intel without SMT and 6 Core AMD with SMT.

I was using:

It seem that it leaves exactly one cpu core free.


Embree Ray Tracing Kernels 3.9.0 (87b805ed617c58dc3467bc6192c70fa97dbc97bd)
  Compiler  : Intel Compiler 17.0.8
  Build     : Release
  Platform  : Windows (64bit)
  CPU       : Unknown CPU (AuthenticAMD)
   Threads  : 12
   ISA      : XMM YMM SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 POPCNT AVX F16C RDRAND AVX2 FMA3 LZCNT BMI1 BMI2
   Targets  : SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX AVXI AVX2
   MXCSR    : FTZ=1, DAZ=1
  Config
    Threads : default
    ISA     : XMM YMM SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 POPCNT AVX F16C RDRAND AVX2 FMA3 LZCNT BMI1 BMI2
    Targets : SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX AVXI AVX2  (supported)
              SSE2 SSE4.2 AVX AVX2 AVX512SKX  (compile time enabled)
    Features: intersection_filter
    Tasking : TBB2019.9 TBB_header_interface_11009 TBB_lib_interface_11102

general:
  build threads      = 0
  build user threads = 0
  start_threads      = 0
  affinity           = 1
  frequency_level    = simd256
  hugepages          = disabled
  verbosity          = 2
  cache_size         = 134.218 MB
  max_spatial_split_replications = 1.2
paulmelis commented 4 years ago

In the 2.1.0 changelog there is a line

Fix issue with OSPRay ignoring tasking system thread count settings

I was hoping that had fixed this issue, but apparently not

miroslawpawlowski commented 4 years ago

So far, I was not able to reproduce situation where OSPRay would apparently not use all CPU logical cores. How about if you could change your testing procedure and see if the problem persists.

The ospExample application wasn’t designed to utilize all available CPU time. Its framerate is always synced to VSYNC (60Hz). So it can’t schedule more than 60 ospRenderFrame() calls per second. If it takes OSPRay less than 1/60 of a second to render frame, the CPU will be sitting idle. Interesting thing is that you may in fact see higher fpses than 60 in the window title bar, but this number is calculated as the OSPRay rendering time reciprocal. For example, having 100fps at title bar doesn’t mean than OSPRay has rendered 100 frames per second. It still rendered no more than 60 frames per second, but each frame took 1/100 of a second to render. The rest of the time (1/60 – 1/100) CPU was idle.

This all means that ospExample is not very well suited for benchmarking, especially when workloads are relatively light.

To test if OSPRay is really utilizing all cores I would recommend using ospBenchmark, or at least set a demanding workload in ospExample e.g. cornell_box/pathtracer/spp=64. Generally, to have framerate lower than 5 fps. On Windows additionally it is sometimes possible to disable VSYNC for OpenGL applications in graphics driver settings.

paulmelis commented 4 years ago

Here's with the 2.2.0 official binaries on the ospExamples cornell scene, pathtracer, 31 samples per pixel. FPS is easily below 5. There's always one CPU core that is underutilized:

Screenshot at 2020-08-29 09-37-11

I'm having some trouble using ospBenchmark due to its continuous switching between tests. Is there a way to select just one test to run?

Debug info:

melis@juggle 09:37:~/software/ospray-2.2.0.x86_64.linux/bin$ ./ospExamples --osp:debug

Embree Ray Tracing Kernels 3.11.0 (cc8c3aac00cdb63fb1acb44bcd0fa89a17139533)
  Compiler  : Intel Compiler 19.10.0
  Build     : Release 
  Platform  : Linux (64bit)
  CPU       : Haswell (GenuineIntel)
   Threads  : 4
   ISA      : XMM YMM SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 POPCNT AVX F16C RDRAND AVX2 FMA3 LZCNT BMI1 BMI2 
   Targets  : SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX AVXI AVX2 
   MXCSR    : FTZ=1, DAZ=1
  Config
    Threads : default
    ISA     : XMM YMM SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 POPCNT AVX F16C RDRAND AVX2 FMA3 LZCNT BMI1 BMI2 
    Targets : SSE SSE2 SSE3 SSSE3 SSE4.1 SSE4.2 AVX AVXI AVX2  (supported)
              SSE2 SSE4.2 AVX AVX2 AVX512SKX  (compile time enabled)
    Features: intersection_filter 
    Tasking : TBB2020.2 TBB_header_interface_11102 TBB_lib_interface_11102 
miroslawpawlowski commented 4 years ago

We have done some improvements in our threading system that will be included in the next release. Performance improvement and full CPU saturation have been achieved in all applications that use ospWait() after render frame scheduling ospRenderFrame(), e.g. it could be observed in ./ospBenchmark --benchmark_filter=cornell_box/spp_256/pathtracer.

The change will have no effect in ospExamples which has GUI thread that mostly waits on OpenGL glSwapBuffers(). The only way to improve CPU utilization in such case is either fire more threads (oversubscription) or give up on GUI interactivity. We don't want to do any of these.

paulmelis commented 2 years ago

Sorry to reply to a closed issue (hint: enable the Discussions section of a github project, it works really well ;-)). Just to be sure (since I'm running into an under-used CPU core again), the only way to get all CPU cores busy is to use ospWait() on the future? And increasing the number of render threads, e.g. with OSPRAY_NUM_THREADS, is not going to help in more CPU utilization?

miroslawpawlowski commented 2 years ago

Correct, by calling ospWait() the current thread CPU time is given back to OSPRay threading subsystem and it can be utilized for rendering. Creating more threads than there are logical cores is suboptimal, threads have to compete for CPU time causing a lot of thread switching, nevertheless sometimes it may help with better CPU utilization.

If you see this problem with ospExample app, let us know how we can reproduce it and we will reopen the issue.

paulmelis commented 2 years ago

I don't see an issue with the example. But in my own app I don't want to call ospWait() as I'd like to do (lightweight) processing on incoming network messages instead. I really like the asynchronous API OSPRay offers and don't want to move (back) to a setup in which I have a separate render handling thread that does call ospWait().