fiedl / hole-ice-study

This project aims to incorporate the effects of hole ice into the clsim photon propagation simulation of the icecube neutrino observatory.
Other
4 stars 2 forks source link

Is flasher study performance bad? #68

Closed fiedl closed 6 years ago

fiedl commented 6 years ago

During preliminary flasher studies (https://github.com/fiedl/hole-ice-study/issues/59), several workers run into resource limits (https://github.com/fiedl/hole-ice-study/issues/62) including run time.


EDIT: This issue has been resolved by memory optimizations (#70).

fiedl commented 6 years ago

Quantify performance

The flasher-pulse-width and brightness settings scale with the total number of recorded hits, i.e. scale with the total number of photons (https://github.com/fiedl/hole-ice-study/issues/59#issuecomment-385410216, https://github.com/fiedl/hole-ice-study/issues/57#issuecomment-384267878). They also scale (non linear) with the job run time.

image

Image source: https://github.com/fiedl/hole-ice-study/issues/57#issuecomment-384267878

fiedl commented 6 years ago

I'm extracting the number of photons from clsim propagation log files.

These files contain several bunches of photons:

[2018-05-24 11:25:47] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/results/flasher_parameter_scan/esca3.5_r2.0rdom_abs100_width127_bright63
▶ grep photon clsim.log
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 622.818 nanoseconds/photon (util: 76%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event&, const boost::posix_time::ptime&, uint64_t, bool, const std::string&, const std::string&, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 1398208 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption&, bool&, unsigned int, uint32_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 622.402 nanoseconds/photon (util: 96%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event&, const boost::posix_time::ptime&, uint64_t, bool, const std::string&, const std::string&, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 1396351 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption&, bool&, unsigned int, uint32_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 623.593 nanoseconds/photon (util: 98%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event&, const boost::posix_time::ptime&, uint64_t, bool, const std::string&, const std::string&, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 1398945 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption&, bool&, unsigned int, uint32_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 623.366 nanoseconds/photon (util: 98%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event&, const boost::posix_time::ptime&, uint64_t, bool, const std::string&, const std::string&, uint64_t))
# ...

Sum them up:

[2018-05-24 11:41:34] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/results/flasher_pulse_widths/esca0.05_r0.5rdom_abs100_width112
▶ grep photons clsim.log |cut -d " " -f 9 |paste -sd+ |bc
36993469

For different flasher-pulse widths:

[2018-05-24 11:42:56] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/results/flasher_pulse_widths
▶ for folder in esca*; do echo $folder; grep photons $folder/clsim.log |cut -d " " -f 9 |paste -sd+ |bc; done
esca0.05_r0.5rdom_abs100_width112
36993469
esca0.05_r0.5rdom_abs100_width127
41876232
esca0.05_r0.5rdom_abs100_width16
5745055
esca0.05_r0.5rdom_abs100_width32
10949845
esca0.05_r0.5rdom_abs100_width48
16158510
esca0.05_r0.5rdom_abs100_width64
21365987
esca0.05_r0.5rdom_abs100_width80
26573831
esca0.05_r0.5rdom_abs100_width96
31791195
simulation_total_number_of_photons_by_width = {
  112: 36993469,
  127: 41876232,
  16: 5745055,
  32: 10949845,
  48: 16158510,
  64: 21365987,
  80: 26573831,
  96: 31791195
}

Plot those:

[2018-05-24 11:50:06] fiedl@fiedl-mbp ~
▶ which plot
plot () {
    plot_expression="$*"
    [[ -z $plot_expression ]] && plot_expression="'<cat'"
    gnuplot -e "set term png; plot $plot_expression" | ~/.iterm2/imgcat
}
▶ echo "112 36993469\n127 41876232\n16 5745055\n32 10949845\n48 16158510\n64 21365987\n80 26573831\n96 31791195" |plot

plot

The number of propagated photons scales linear with the pulse width setting.

fiedl commented 6 years ago

Plot the simulation time vs. the number of propagated photons:

[2018-05-24 12:07:56] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_simulation_time_vs_number_of_photons.py

bildschirmfoto 2018-05-24 um 12 09 06

fiedl commented 6 years ago

Plot simulation time per photon vs. pulse width:

[2018-05-24 13:28:01] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_simulation_time_vs_number_of_photons.py

bildschirmfoto 2018-05-24 um 13 29 58

Simulation time is about 1e6 ns per photon.

fiedl commented 6 years ago

This is really strange, because the propagation time is much smaller.

[2018-05-24 13:35:33] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/results/flasher_pulse_widths
▶ grep "kernel statistics" esca0.05_r0.5rdom_abs100_width*/clsim.log |head -n 1
esca0.05_r0.5rdom_abs100_width112/clsim.log:INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 553.535 nanoseconds/photon (util: 76%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event&, const boost::posix_time::ptime&, uint64_t, bool, const std::string&, const std::string&, uint64_t))

▶ grep "kernel statistics" esca0.05_r0.5rdom_abs100_width*/clsim.log |cut -d " " -f 5 |shuf |head -n 20
253.643
258.869
253.707
651.095
253.765
254.317
259.274
253.622
653.176
558.418
253.789
554.337
555.549
571.87
259.242
676.258
652.15
674.927
253.72
570.226

The mean is about 500ns for the propagation of each photon (compared to 1e6 ns per photon for the whole simulation):

[2018-05-24 13:39:29] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/results/flasher_pulse_widths
▶ grep "kernel statistics" esca0.05_r0.5rdom_abs100_width*/clsim.log |cut -d " " -f 5 |awk '{ total += $1; count++ } END { print total/count }'
476.894

This means the huge performance loss does happen outside the gpu kernel.

fiedl commented 6 years ago

Compare performance to standard clsim

I'm running a simple flasher simulation with my clsim vs. standard clsim.

clsim with hole-ice simulation

I'm running a flasher simulation with pulse width settings and a brightness settings of 8, 16 and 24.

[2018-05-24 13:39:50] fiedl@wgs16
▶ qrsh -l gpu -l tmpdir_size=10G -l h_rss=8G -l s_rt=11:00:00

[2018-05-24 14:21:47] fiedl@kepler26 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FlasherSimulation
▶ ./run.rb --help
Usage: bundle exec ruby run.rb [options]
        --hole-ice=METHOD            simulation or approximation
        --no-hole-ice                Run without hole-ice code
        --width=WIDTH                Flahser pulse width, min = 0, max = 127
        --brightness=BRIGHTNESS      Flasher pulse brightness, min = 0, max = 127
        --effective-scattering-length=METRES
        --absorption-length=METRES
        --hole-ice-radius-in-dom-radii=R
        --cables                     Simulate cable shadows

[2018-05-24 14:23:46] fiedl@kepler26 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FlasherSimulation
▶ $ICESIM/env-shell.sh

[2018-05-24 14:35:23] fiedl@kepler26 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FlasherSimulation
▶ rm tmp/*
▶ time ./run.rb --hole-ice=simulation --width=8 --brightness=8 --effective-scattering-length=0.1 --absorption-length=0.1 --hole-ice-radius-in-dom-radii=0.5
0.43s user 0.42s system 2% cpu 29.088 total

[2018-05-24 14:51:06] fiedl@kepler26 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FlasherSimulation
▶ rm tmp/*
▶ time ./run.rb --hole-ice=simulation --width=16 --brightness=16 --effective-scattering-length=0.1 --absorption-length=0.1 --hole-ice-radius-in-dom-radii=0.5
0.42s user 0.37s system 0% cpu 2:10.41 total

[2018-05-24 16:44:36] fiedl@kepler21 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FlasherSimulation
▶ rm tmp/*
▶ time ./run.rb --hole-ice=simulation --width=24 --brightness=24 --effective-scattering-length=100.0 --absorption-length=100.0 --hole-ice-radius-in-dom-radii=0.5

NOTICE (I3Tray): I3Tray finishing... (I3Tray.cxx:473 in void I3Tray::Execute(unsigned int))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 568.109 nanoseconds/photon (util: 87%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 162012 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 588.473 nanoseconds/photon (util: 100%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 155908 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 595.862 nanoseconds/photon (util: 100%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 121164 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimModule): Flushing I3Tray.. (I3CLSimModule.cxx:1639 in virtual void I3CLSimModule::Finish())

0.34s user 0.30s system 0% cpu 12:21.84 total

Setting: 8, runtime: 29s. Setting: 16 (4 times as much photons), runtime: 130s (4.5 times as much simulation time). Setting: 24 (9 times as much photons), runtime: 741s (25.5 times as much simulation time).

fiedl commented 6 years ago

standard clsim

How to compile: 2018-01-23 Installing IceSim in Zeuthen

Switch to standard clsim for icesim-v05-00-07:

[2018-05-24 15:17:48] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/clsim
▶ git co icesim-v05-00-07

[2018-05-24 15:00:46] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/software/icecube-simulation-V05-00-07
▶ ll src |grep clsim
lrwxr-xr-x  1 fiedl icecube   44 Jan 25 11:48 clsim -> /afs/ifh.de/group/amanda/scratch/fiedl/clsim

Compile:

[2018-05-24 15:30:18] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/software/icecube-simulation-V05-00-07
▶ rm -r debug_build
▶ mkdir debug_build

[2018-05-24 15:30:52] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/software/icecube-simulation-V05-00-07/debug_build
▶ cmake -D CMAKE_BUILD_TYPE=Debug -D SYSTEM_PACKAGES=true -D CMAKE_BUILD_TYPE:STRING=Debug ../src
▶ ./env-shell.sh

[2018-05-24 15:32:00] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/software/icecube-simulation-V05-00-07/debug_build
▶ make -j 6
fiedl commented 6 years ago

Perform simulation with standard clsim:

[2018-05-24 16:33:24] fiedl@kepler21 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FlasherSimulation
▶ rm tmp/*
▶ time ./run.rb --hole-ice=approximation --width=24 --brightness=24 --effective-scattering-length=100.0 --absorption-length=100.0 --hole-ice-radius-in-dom-radii=0.5

NOTICE (I3Tray): I3Tray finishing... (I3Tray.cxx:473 in void I3Tray::Execute(unsigned int))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 37.0069 nanoseconds/photon (util: 30%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1108 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 130810 (I3CLSimStepToPhotonConverterOpenCL.cxx:990 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 36.9052 nanoseconds/photon (util: 79%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1108 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 124503 (I3CLSimStepToPhotonConverterOpenCL.cxx:990 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 35.8627 nanoseconds/photon (util: 99%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1108 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 95246 (I3CLSimStepToPhotonConverterOpenCL.cxx:990 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimModule): Flushing I3Tray.. (I3CLSimModule.cxx:1588 in virtual void I3CLSimModule::Finish())

0.31s user 0.28s system 0% cpu 2:02.91 total

Switch back to hole-ice clsim:

[2018-05-24 16:39:49] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/clsim
▶ git co sf/hole-ice-2018

[2018-05-24 16:40:29] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/software/icecube-simulation-V05-00-07/debug_build
▶ ./env-shell.sh
▶ make
fiedl commented 6 years ago

Performance comparison

With the same flasher settings and large scattering and absorption lengths on the same hardware:

Simulation run time

Kernel run time

This does not look good.

fiedl commented 6 years ago

I'm not convinced that the number of downloaded photons is the same as the number of propagated photons. If only hits are written down, then the number of propagated photons is much larger.

If a large number of medium changes is responsible for the performance loss, then this simulation will be useless in a real scenario where all cylinders and cables are implemented in the geometry.

If ppc is faster, what is ppc doing differently?

fiedl commented 6 years ago

How much does the number of cylinders affect simulation time and kernel time?

Configuration:

Run flasher simulation:

[2018-05-24 17:55:29] fiedl@kepler21 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FlasherSimulation
▶ rm tmp/*
▶ time ./run.rb --hole-ice=simulation --width=24 --brightness=24 --effective-scattering-length=100.0 --absorption-length=100.0 --hole-ice-radius-in-dom-radii=0.5

# ...
NOTICE (I3Tray): I3Tray finishing... (I3Tray.cxx:473 in void I3Tray::Execute(unsigned int))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 544.713 nanoseconds/photon (util: 86%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 162068 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 562.101 nanoseconds/photon (util: 100%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 155938 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 569.499 nanoseconds/photon (util: 100%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 121236 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimModule): Flushing I3Tray.. (I3CLSimModule.cxx:1639 in virtual void I3CLSimModule::Finish())
# ...

0.29s user 0.26s system 0% cpu 12:30.89 total
Simulation run time
Kernel run time

The number of cylinder appears to play a subdominant role.

fiedl commented 6 years ago

How much do ice layers affect the simulation and kernel run time?

Configuration:

Run the simulation:

[2018-05-25 09:13:58] fiedl@wgs16 ~
▶ qrsh -l gpu -l tmpdir_size=10G -l h_rss=8G -l s_rt=14:00:00

[2018-05-25 09:14:19] fiedl@kepler19 ~
▶ $ICESIM/env-shell.sh

[2018-05-25 09:23:06] fiedl@kepler19 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FlasherSimulation
▶ time ./run.rb --hole-ice=simulation --width=24 --brightness=24 --effective-scattering-length=100.0 --absorption-length=100.0 --hole-ice-radius-in-dom-radii=0.5

NOTICE (I3Tray): I3Tray finishing... (I3Tray.cxx:473 in void I3Tray::Execute(unsigned int))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 598.538 nanoseconds/photon (util: 82%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 167356 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 606.615 nanoseconds/photon (util: 100%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 161083 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 608.811 nanoseconds/photon (util: 100%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 123460 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimModule): Flushing I3Tray.. (I3CLSimModule.cxx:1639 in virtual void I3CLSimModule::Finish())

0.36s user 0.29s system 0% cpu 14:21.80 total

This cluster worker seemed overall slow, also when creating the geometry.

Simulation run time
Kernel run time
fiedl commented 6 years ago

What when commenting out most of the medium-propagation code

Configuration:

Run the flasher simulation:

[2018-05-25 09:57:34] fiedl@kepler19 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FlasherSimulation
▶ time ./run.rb --hole-ice=simulation --width=24 --brightness=24 --effective-scattering-length=100.0 --absorption-length=100.0 --hole-ice-radius-in-dom-radii=0.5

NOTICE (I3Tray): I3Tray finishing... (I3Tray.cxx:473 in void I3Tray::Execute(unsigned int))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 22.3237 nanoseconds/photon (util: 15%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 167503 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 22.2416 nanoseconds/photon (util: 40%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 161078 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 21.6648 nanoseconds/photon (util: 36%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 123383 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimModule): Flushing I3Tray.. (I3CLSimModule.cxx:1639 in virtual void I3CLSimModule::Finish())

0.30s user 0.28s system 0% cpu 2:34.29 total

This is much faster

Simulation run time
Kernel run time
fiedl commented 6 years ago

These results suggest that the new code for medium changes takes 30 times as long as standard clsim.

But earlier performance measurements didn't indicate this at all:

image

Source: https://github.com/fiedl/hole-ice-study/issues/49

fiedl commented 6 years ago

Trying to narrow down where the performance is lost.

Hole Ice      Layers    Sorting                        Result
0             0         0                              fast
1             0         1                              slow
1             1         1                              slow
0             1         0                              slow
1             0         0                              slow
0             0         1                              fast

If hole ice or layers are on, then it's slow. This suggests that the performance loss happens if there are medium changes.

fiedl commented 6 years ago

Adding an arbitrary medium change by hand:

int number_of_medium_changes = 1;
floating_t distances_to_medium_changes[MEDIUM_LAYERS] = {0.0, 1.0};
int currentPhotonLayer = min(max(findLayerForGivenZPos(photonPosAndTime.z), 0), MEDIUM_LAYERS-1);
floating_t local_scattering_lengths[MEDIUM_LAYERS] = {getScatteringLength(currentPhotonLayer, photonDirAndWlen.w), 100.0};
floating_t local_absorption_lengths[MEDIUM_LAYERS] = {getAbsorptionLength(currentPhotonLayer, photonDirAndWlen.w), 100.0};
Hole Ice      Layers    Sorting                               Result
0             0         0         +arbitrary medium change    fast
0             0         1         +arbitrary medium change    fast
fiedl commented 6 years ago

Why is the arbitrary medium change fast and one hole-ice cylinder slow?

Hole Ice      Layers    Sorting                               Result
1             0         1                                     slow
0             0         1         +arbitrary medium change    fast

What if adding the hole-ice cylinder but re-setting to the stipulative values after that?

  #ifdef HOLE_ICE
    add_hole_ice_cylinders_on_photon_path_to_medium_changes(
      photonPosAndTime,
      photonDirAndWlen,
      photonRange,
      numberOfCylinders,
      cylinderPositionsAndRadii,

      // These values will be updates within this function:
      &number_of_medium_changes,
      distances_to_medium_changes,
      local_scattering_lengths,
      local_absorption_lengths
    );
  #endif

  number_of_medium_changes = 1;
  distances_to_medium_changes[0] = 0.0;
  distances_to_medium_changes[1] = 1.0;
  currentPhotonLayer = min(max(findLayerForGivenZPos(photonPosAndTime.z), 0), MEDIUM_LAYERS-1);
  local_scattering_lengths[0] = getScatteringLength(currentPhotonLayer, photonDirAndWlen.w);
  local_scattering_lengths[1] = 100.0;
  local_absorption_lengths[0] = getAbsorptionLength(currentPhotonLayer, photonDirAndWlen.w);
  local_absorption_lengths[1] = 100.0;

This is interesting: It's not as fast as without hole ice, but it's also not slow: 48 ns per photon.

NOTICE (I3Tray): I3Tray finishing... (I3Tray.cxx:473 in void I3Tray::Execute(unsigned int))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 43.7711 nanoseconds/photon (util: 31%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 127445 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 49.148 nanoseconds/photon (util: 81%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 132378 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 51.7027 nanoseconds/photon (util: 100%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1210 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 106994 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimModule): Flushing I3Tray.. (I3CLSimModule.cxx:1639 in virtual void I3CLSimModule::Finish())
Hole Ice      Layers    Sorting                               Result
1             0         1                                     slow
0             0         1         +arbitrary medium change    fast
1             0         1         +reset medium change        somewhat fast
fiedl commented 6 years ago

What if adding hole ice, without arbitrary medium change, but skipping loop_over_media_and_calculate_geometrical_distances_up_to_the_next_scattering_point, replacing it with a less complicated calculation.

  // loop_over_media_and_calculate_geometrical_distances_up_to_the_next_scattering_point(
  //   number_of_medium_changes,
  //   distances_to_medium_changes,
  //   local_scattering_lengths,
  //   local_absorption_lengths,
  //
  //   // These values will be updates within this function:
  //   sca_step_left,
  //   abs_lens_left,
  //   distancePropagated,
  //   distanceToAbsorption
  // );

  // Spend the rest of the budget with the last medium properties.
  if (*sca_step_left > 0) {
    *distancePropagated += *sca_step_left * local_scattering_lengths[number_of_medium_changes];
    *distanceToAbsorption += *abs_lens_left * local_absorption_lengths[number_of_medium_changes];
    *abs_lens_left -= my_divide(*distancePropagated, local_absorption_lengths[number_of_medium_changes]);
  }

  // If the photon is absorbed, only propagate up to the absorption point.
  if (*distanceToAbsorption < *distancePropagated) {
    *distancePropagated = *distanceToAbsorption;
    *distanceToAbsorption = ZERO;
    *abs_lens_left = ZERO;
  }
Hole Ice      Layers    Sorting                               Result
1             0         1          simple calculation         slow
fiedl commented 6 years ago

Profiling

I'm trying some kernel profiling in https://github.com/fiedl/hole-ice-study/issues/69.

fiedl commented 6 years ago

Total number of photons

I'm not convinced that the downloaded number of photons is the total number of propagated photons. Depending on the configuration, only photons that hit a DOM need to be downloaded.

fiedl commented 6 years ago

I'm repeating a small flasher simulation on the cluster with that new output in order to get the scaling factor of the number of downloaded photons and the total number of propagated photons.

[2018-05-26 11:03:59] fiedl@wgs16 ~
▶ qrsh -l gpu -l tmpdir_size=10G -l h_rss=8G -l s_rt=15:00:00

[2018-05-26 11:04:41] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/clsim
▶ git pull

[2018-05-26 11:07:10] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/software/icecube-simulation-V05-00-07/debug_build
▶ $ICESIM/env-shell.sh
▶ make

[2018-05-26 11:05:34] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/software/icecube-simulation-V05-00-07/debug_build/clsim
▶ $ICESIM/env-shell.sh

[2018-05-26 11:09:27] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FlasherSimulation
▶ rm tmp/*
▶ time ./run.rb --hole-ice=simulation --width=24 --brightness=24 --effective-scattering-length=100.0 --absorption-length=100.0 --hole-ice-radius-in-dom-radii=0.5

NOTICE (I3Tray): I3Tray finishing... (I3Tray.cxx:473 in void I3Tray::Execute(unsigned int))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 546.852 nanoseconds/photon, 409291894 photons (util: 88%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 162068 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 563.272 nanoseconds/photon, 409291942 photons (util: 100%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 155938 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 571.064 nanoseconds/photon, 323089575 photons (util: 100%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
INFO (I3CLSimStepToPhotonConverterOpenCL): Num photons to copy (buffer 0): 121236 (I3CLSimStepToPhotonConverterOpenCL.cxx:1092 in void I3CLSimStepToPhotonConverterOpenCL::OpenCLThread_impl_downloadPhotons(boost::this_thread::disable_interruption &, bool &, unsigned int, uint32_t))
INFO (I3CLSimModule): Flushing I3Tray.. (I3CLSimModule.cxx:1639 in virtual void I3CLSimModule::Finish())

0.32s user 0.29s system 0% cpu 12:18.44 total

From that output, the scaling factor is 2600.

fiedl commented 6 years ago

I'm updating the above plots to take this scaling factor into account: https://github.com/fiedl/hole-ice-study/commit/a4a466b3b00612017cd2574c0e66e4441bb53b1a

[2018-05-26 11:34:25] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_simulation_time_vs_number_of_photons.py
bildschirmfoto 2018-05-26 um 11 35 11

Note that the above gauging simulation has not the same hole-ice cylinder configuration as the simulations that are the data source of the plot. Thus, the simulation time per photon can only show the order of magnitude.

Nevertheless, it shows that the simulation time per photon is of the same magnitude as the time spent in the kernel per photon. This is plausible because the main purpose of the simulation is photon propagation, which is done in the kernel. This should be where the intensive work is done. :ok:

Furthermore, with the new output of the total number of photons, the time spent in the kernel can actually be compared to the overall simulation time, which will be done during profiling: https://github.com/fiedl/hole-ice-study/issues/69.

fiedl commented 6 years ago

This issue has been resolved by memory optimizations (#70).