Open fiedl opened 6 years ago
Preparation: https://github.com/fiedl/clsim/commit/7d7414d1455ee73826e21bf23594afad11fb3a64
[2018-05-26 13:26:30] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ ./run.rb --number-of-photons=1e4 --number-of-runs=1 --number-of-parallel-runs=1 --cpu
[2018-05-26 13:55:23] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep PROFILING tmp/clsim.log > tmp/profiling.txt
The individual steps of the medium propagation code make a small portion of the propagation simulation time.
[2018-05-26 13:56:45] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ ../lib/plot_profiling_results.py tmp/profiling.txt
Standard clsim handles the medium propagation differently. The original source can be found here: https://github.com/fiedl/clsim/blob/icesim-v05-00-07/resources/kernels/propagation_kernel.c.cl
I've copied the original source to clsim/resources/kernels/lib/propagation_through_media/standard_clsim.c
.
The method can be switched from the propagation_kernel.c.cl
by applying the desired method and commenting out the other one:
// propagation_kernel.c.cl
// ...
floating_t distancePropagated = 0;
floating_t distanceToAbsorption = 0;
// apply_propagation_through_different_media(
// photonPosAndTime,
// photonDirAndWlen,
// #ifdef HOLE_ICE
// numberOfCylinders,
// cylinderPositionsAndRadii,
// cylinderScatteringLengths,
// cylinderAbsorptionLengths,
// #endif
// &sca_step_left,
// &abs_lens_left,
// &distancePropagated,
// &distanceToAbsorption
// );
apply_propagation_through_different_media_with_standard_clsim(
photonPosAndTime,
photonDirAndWlen,
&sca_step_left,
&abs_lens_left,
&distancePropagated,
&distanceToAbsorption
);
Run the profiling on the cpu with standard clsim:
[2018-05-26 15:44:19] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ time ./run.rb --number-of-photons=1e4 --number-of-runs=1 --number-of-parallel-runs=1 --cpu
# Ctrl-C
9794,71s user 7443,22s system 99% cpu 4:48:27,72 total
I cancelled the job, because it ran forever. Also, it might not be comparable to the previous run, because this run has less profiling outputs, i.e. takes a shorter amount of time for profiling.
Nevertheless, try to get some histograms out of that run:
[2018-05-26 20:37:28] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ ../lib/plot_profiling_results.py tmp/profiling.txt
In order to make it comparable, profile hole-ice-2018 again, but without intermediate profiling steps. Profile just the whole scattering step and the portion of it that goes into the new hole-ice code.
[2018-05-26 21:23:59] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 --cpu > tmp/output.log
236,82s user 248,03s system 193% cpu 4:10,93 total
Some statistics:
[2018-05-26 21:28:34] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep "kernel statistics" tmp/clsim.log
PROFILINFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 1.69622e+08 nanoseconds/photon, 1024 photons (util: 95%) (Apple Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz) (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
[2018-05-26 21:33:31] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ ../lib/plot_profiling_results.py tmp/profiling.txt
And, to compare it properly, the same statistics with standard clsim:
[2018-05-26 21:35:21] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ rm tmp/*
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 --cpu > tmp/output.log
194,35s user 212,68s system 184% cpu 3:40,22 total
[2018-05-26 21:41:04] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ ../lib/plot_profiling_results.py tmp/profiling.txt
In order to get an idea how much the profiling itself costs, I'm adding a second profiling output at the same position in the code:
clock_t t2 = clock();
clock_t t1 = clock();
apply_propagation_through_different_media(
photonPosAndTime,
photonDirAndWlen,
#ifdef HOLE_ICE
numberOfCylinders,
cylinderPositionsAndRadii,
cylinderScatteringLengths,
cylinderAbsorptionLengths,
#endif
&sca_step_left,
&abs_lens_left,
&distancePropagated,
&distanceToAbsorption
);
printf("PROFILING apply_propagation_through_different_media %lu\n", clock() - t1);
printf("PROFILING apply_propagation_through_different_media_2 %lu\n", clock() - t2);
Without this extra output, the clock count has been 2.50 +- 2.61 (https://github.com/fiedl/hole-ice-study/issues/69#issuecomment-392282882).
With this extra output, the clock count is:
This is really strange. Thus, I can't really tell from this how much the performance measurement itself costs. Repeating the same measurement does confirm the strange result.
Another attempt to measure the performance cost of the profiling itself: Drop the inner clock counter and measure just the overall scattering step.
//clock_t t1 = clock();
apply_propagation_through_different_media(
photonPosAndTime,
photonDirAndWlen,
#ifdef HOLE_ICE
numberOfCylinders,
cylinderPositionsAndRadii,
cylinderScatteringLengths,
cylinderAbsorptionLengths,
#endif
&sca_step_left,
&abs_lens_left,
&distancePropagated,
&distanceToAbsorption
);
//printf("PROFILING apply_propagation_through_different_media %lu\n", clock() - t1);
When both the t0 counter and the t1 counter are active, the outer measurement is 30.02 +- 29.37.
When only the t0 counter is active, the outer measurement is 3.00 +- 3.79.
These results suggest that the dominant (most expensive) operation is the printf
that outputs the performance measurement result, which costs about 25 ticks whereas the scattering step only costs about 5 ticks.
Repeat the same measurement of inner and outer clock counter, but print the result outside the counter.
// start profiling the simulation step here
clock_t t0 = clock();
floating_t sca_step_left = -my_log(RNG_CALL_UNIFORM_OC);
floating_t distancePropagated = 0;
floating_t distanceToAbsorption = 0;
clock_t t1 = clock();
clock_t t2 = clock();
apply_propagation_through_different_media(
photonPosAndTime,
photonDirAndWlen,
#ifdef HOLE_ICE
numberOfCylinders,
cylinderPositionsAndRadii,
cylinderScatteringLengths,
cylinderAbsorptionLengths,
#endif
&sca_step_left,
&abs_lens_left,
&distancePropagated,
&distanceToAbsorption
);
clock_t t3 = clock();
clock_t t4 = clock();
// ...
clock_t t5 = clock();
printf("PROFILING propagation_kernel_simulation_step %lu\n", t5 - t0);
printf("PROFILING apply_propagation_through_different_media %lu\n", t3 - t2);
printf("PROFILING apply_propagation_through_different_media_outer %lu\n", t4 - t1);
This suggests that the performance measurement itself costs about 4.2 ticks. Thus, the overall scattering step is 3.5 ticks, including the the hole-ice-2018 code, which is 2.8 ticks. If this is correct, the hole-ice-2018 code costs 80% of the total scattering step.
Repeating the same for standard clsim:
clock_t t1 = clock();
clock_t t2 = clock();
apply_propagation_through_different_media_with_standard_clsim(
photonPosAndTime,
photonDirAndWlen,
&sca_step_left,
&abs_lens_left,
&distancePropagated,
&distanceToAbsorption
);
clock_t t3 = clock();
clock_t t4 = clock();
With standard clsim, the performance measurement again is about 4.1 ticks. The overall scattering step is 3.3 ticks, the standard-clsim media propagation code is 2.5 ticks, which is 76% of the total scattering step.
Comparing the hole-ice-2018 algorithm to the standard-clsim algorithm:
When subtracting the performance measurement, the total scattering step costs:
Thus, the hole-ice-2018 code costs about 6% more cpu time than standard clsim.
[2018-05-26 23:45:26] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/clsim
▶ git pull
[2018-05-26 23:45:57] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study
▶ git pull
[2018-05-26 23:45:47] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ rm tmp/*
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 > tmp/output.log
5.14s user 3.93s system 12% cpu 1:15.24 total
[2018-05-26 23:49:15] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ grep "kernel statistics" tmp/clsim.log
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 1.54137e+06 nanoseconds/photon, 1005 photons (util: 4%) (NVIDIA CUDA Tesla K80) (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
[2018-05-26 23:50:31] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mv tmp $SCRATCH/hole-ice-study/results/profiling/hole-ice-2018_gpu
[2018-05-27 00:00:50] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_profiling_results.py ~/hole-ice-study/results/profiling/hole-ice-2018_gpu/profiling.txt
The performance measurement costs 117 ticks. The medium propagation costs 38022 ticks, the total scattering step costs 41060 ticks. The medium propagation makes 92.6% of the total scattering step gpu time.
[2018-05-26 23:53:19] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/clsim
▶ git pull
[2018-05-26 23:51:34] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mkdir tmp
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 > tmp/output.log
5.58s user 3.90s system 8% cpu 1:49.88 total
[2018-05-26 23:57:04] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ grep "kernel statistics" tmp/clsim.log
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 1.04102e+06 nanoseconds/photon, 1005 photons (util: 2%) (NVIDIA CUDA Tesla K80) (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
[2018-05-26 23:57:19] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mv tmp $SCRATCH/hole-ice-study/results/profiling/standard-clsim_gpu
[2018-05-27 00:03:38] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_profiling_results.py ~/hole-ice-study/results/profiling/standard-clsim_gpu/profiling.txt
Performance measurement itself costs 114 ticks. The standard-clsim media propagation costs 1773 ticks, the total scattering step costs 5571 ticks. The media propagation makes 32% of the total scattering step when running standard clsim on the gpu.
On the gpu, the hole-ice-2018 code costs 7.37 as much time as standard clsim.
Run detailed gpu profiling for the hole-ice-2018 code.
[2018-05-27 00:26:25] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/clsim
▶ git pull
[2018-05-27 00:27:41] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mkdir tmp
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 > tmp/output.log
1.95s user 1.44s system 3% cpu 1:31.76 total
[2018-05-27 00:29:47] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ grep "kernel statistics" tmp/clsim.log
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 1.96359e+06 nanoseconds/photon, 1005 photons (util: 10%) (NVIDIA CUDA Tesla K80) (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))
[2018-05-27 00:30:17] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mv tmp $SCRATCH/hole-ice-study/results/profiling/hole-ice-2018_detail_gpu
[2018-05-27 00:38:12] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_profiling_results.py ~/hole-ice-study/results/profiling/hole-ice-2018_detail_gpu/profiling.txt
This is pretty unexpected. The start code that has nothing to do with the hole ice is most expensive. The relevant parts cost 2740 ticks. (Standard clsim medium propagation costs 1773 ticks.)
Why is the start code so expensive? Profile it in more detail:
clock_t t0 = clock();
int number_of_medium_changes = 0;
clock_t t01 = clock();
floating_t distances_to_medium_changes[MEDIUM_LAYERS] = {0.0};
clock_t t02 = clock();
int currentPhotonLayer = min(max(findLayerForGivenZPos(photonPosAndTime.z), 0), MEDIUM_LAYERS-1);
clock_t t03 = clock();
floating_t local_scattering_lengths[MEDIUM_LAYERS] = {getScatteringLength(currentPhotonLayer, photonDirAndWlen.w)};
clock_t t04 = clock();
floating_t local_absorption_lengths[MEDIUM_LAYERS] = {getAbsorptionLength(currentPhotonLayer, photonDirAndWlen.w)};
clock_t t05 = clock();
const floating_t photonRange = *sca_step_left * local_scattering_lengths[0];
clock_t t1 = clock();
// ...
printf("PROFILING start %lu\n", t1 - t0);
printf("PROFILING start_01 %lu\n", t01 - t0);
printf("PROFILING start_02 %lu\n", t02 - t01);
printf("PROFILING start_03 %lu\n", t03 - t02);
printf("PROFILING start_04 %lu\n", t04 - t03);
printf("PROFILING start_05 %lu\n", t05 - t04);
printf("PROFILING start_1 %lu\n", t1 - t05);
printf("PROFILING add_ice_layers %lu\n", t2 - t1);
printf("PROFILING add_hole_ice %lu\n", t3 - t2);
printf("PROFILING sort %lu\n", t4 - t3);
printf("PROFILING loop_over_media %lu\n", t5 - t4);
printf("PROFILING total %lu\n", t5 - t0);
[2018-05-27 00:52:53] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_profiling_results.py ~/hole-ice-study/results/profiling/hole-ice-2018_start_gpu/profiling.txt
From the detailed profiling of the start code, one can see that these three commands are the most expensive ones in the hole-ice-2018 code:
floating_t distances_to_medium_changes[MEDIUM_LAYERS] = {0.0};
floating_t local_scattering_lengths[MEDIUM_LAYERS] = {getScatteringLength(currentPhotonLayer, photonDirAndWlen.w)};
floating_t local_absorption_lengths[MEDIUM_LAYERS] = {getAbsorptionLength(currentPhotonLayer, photonDirAndWlen.w)};
As the first and the second have almost the same performance cost, the costly part is not getScatteringLength
but the allocation of the MEDIUM_LAYERS
-sized array.
Which part of the new code does cost how much computation time?
Some considerations
What needs to be done
Results