Profiling - Githubissues

fiedl commented 6 years ago

Which part of the new code does cost how much computation time?

Some considerations

Profiling within the kernel code and outputting the results is very imperformant itself and thus can only be done for a small number of photons, whereas the real performance issues arise when there are many photons as in flasher simulations (https://github.com/fiedl/hole-ice-study/issues/68).
In the GPU kernel, one can't just ask for the current time. But one can use the GPU clock. The clock counter are not nanoseconds, but are just GPU cycle ticks. Nevertheless, this gives an understanding of the relative amount of computation spent at the several parts of the code.

What needs to be done

[x] Profiling the new medium propagation code: Layers, hole ice, sorting, medium loop.
[x] What portion of the total propagation is spent in the new code?
[x] How does that compare to standard clsim?
[x] Are there differences when using GPU vs. CPU?

Results

On the CPU, the hole-ice-2018 code is almost as fast as standard clsim: https://github.com/fiedl/hole-ice-study/issues/69#issuecomment-392289709
But on the GPU, the hole-ice-2018 code costs 7 times as much time as standard clsim! https://github.com/fiedl/hole-ice-study/issues/69#issuecomment-392290662
The most expensive operations are allocating large arrays: https://github.com/fiedl/hole-ice-study/issues/69#issuecomment-392293208

bildschirmfoto 2018-08-02 um 17 34 45

fiedl commented 6 years ago

Profiling the propagation kernel on the cpu

Preparation: https://github.com/fiedl/clsim/commit/7d7414d1455ee73826e21bf23594afad11fb3a64

[2018-05-26 13:26:30] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ ./run.rb --number-of-photons=1e4 --number-of-runs=1 --number-of-parallel-runs=1 --cpu

[2018-05-26 13:55:23] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep PROFILING tmp/clsim.log > tmp/profiling.txt

fiedl commented 6 years ago

The individual steps of the medium propagation code make a small portion of the propagation simulation time.

[2018-05-26 13:56:45] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ ../lib/plot_profiling_results.py tmp/profiling.txt

fiedl commented 6 years ago

How does this compare to standard clsim?

Standard clsim handles the medium propagation differently. The original source can be found here: https://github.com/fiedl/clsim/blob/icesim-v05-00-07/resources/kernels/propagation_kernel.c.cl

I've copied the original source to clsim/resources/kernels/lib/propagation_through_media/standard_clsim.c.

The method can be switched from the propagation_kernel.c.cl by applying the desired method and commenting out the other one:

// propagation_kernel.c.cl
// ...

floating_t distancePropagated = 0;
floating_t distanceToAbsorption = 0;

// apply_propagation_through_different_media(
//   photonPosAndTime,
//   photonDirAndWlen,
//   #ifdef HOLE_ICE
//     numberOfCylinders,
//     cylinderPositionsAndRadii,
//     cylinderScatteringLengths,
//     cylinderAbsorptionLengths,
//   #endif
//   &sca_step_left,
//   &abs_lens_left,
//   &distancePropagated,
//   &distanceToAbsorption
// );

apply_propagation_through_different_media_with_standard_clsim(
  photonPosAndTime,
  photonDirAndWlen,
  &sca_step_left,
  &abs_lens_left,
  &distancePropagated,
  &distanceToAbsorption
);

fiedl commented 6 years ago

Run the profiling on the cpu with standard clsim:

[2018-05-26 15:44:19] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ time ./run.rb --number-of-photons=1e4 --number-of-runs=1 --number-of-parallel-runs=1 --cpu

# Ctrl-C
9794,71s user 7443,22s system 99% cpu 4:48:27,72 total

I cancelled the job, because it ran forever. Also, it might not be comparable to the previous run, because this run has less profiling outputs, i.e. takes a shorter amount of time for profiling.

Nevertheless, try to get some histograms out of that run:

[2018-05-26 20:37:28] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ ../lib/plot_profiling_results.py tmp/profiling.txt

fiedl commented 6 years ago

In order to make it comparable, profile hole-ice-2018 again, but without intermediate profiling steps. Profile just the whole scattering step and the portion of it that goes into the new hole-ice code.

[2018-05-26 21:23:59] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 --cpu > tmp/output.log

236,82s user 248,03s system 193% cpu 4:10,93 total

Some statistics:

[2018-05-26 21:28:34] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep "kernel statistics" tmp/clsim.log
PROFILINFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 1.69622e+08 nanoseconds/photon, 1024 photons (util: 95%) (Apple Intel(R) Core(TM) i7-4870HQ CPU @ 2.50GHz)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))

[2018-05-26 21:33:31] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ ../lib/plot_profiling_results.py tmp/profiling.txt

fiedl commented 6 years ago

And, to compare it properly, the same statistics with standard clsim:

[2018-05-26 21:35:21] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ rm tmp/*
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 --cpu > tmp/output.log

194,35s user 212,68s system 184% cpu 3:40,22 total

[2018-05-26 21:41:04] fiedl@fiedl-mbp ~/hole-ice-study/scripts/FiringRange master ⚡
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ ../lib/plot_profiling_results.py tmp/profiling.txt

fiedl commented 6 years ago

[x] add mean and standard deviation to these plots.
[x] calculate how much the profiling itself costs and subtract this

fiedl commented 6 years ago

In order to get an idea how much the profiling itself costs, I'm adding a second profiling output at the same position in the code:

        clock_t t2 = clock();
        clock_t t1 = clock();
        apply_propagation_through_different_media(
          photonPosAndTime,
          photonDirAndWlen,
          #ifdef HOLE_ICE
            numberOfCylinders,
            cylinderPositionsAndRadii,
            cylinderScatteringLengths,
            cylinderAbsorptionLengths,
          #endif
          &sca_step_left,
          &abs_lens_left,
          &distancePropagated,
          &distanceToAbsorption
        );
        printf("PROFILING apply_propagation_through_different_media %lu\n", clock() - t1);
        printf("PROFILING apply_propagation_through_different_media_2 %lu\n", clock() - t2);

Without this extra output, the clock count has been 2.50 +- 2.61 (https://github.com/fiedl/hole-ice-study/issues/69#issuecomment-392282882).

With this extra output, the clock count is:

3.16 +- 3.65 for the inner clock counter
32.59 +- 32.70 for the outer clock counter
56.74 +- 44.80 for the whole scattering step clock counter

This is really strange. Thus, I can't really tell from this how much the performance measurement itself costs. Repeating the same measurement does confirm the strange result.

fiedl commented 6 years ago

Another attempt to measure the performance cost of the profiling itself: Drop the inner clock counter and measure just the overall scattering step.

        //clock_t t1 = clock();
        apply_propagation_through_different_media(
          photonPosAndTime,
          photonDirAndWlen,
          #ifdef HOLE_ICE
            numberOfCylinders,
            cylinderPositionsAndRadii,
            cylinderScatteringLengths,
            cylinderAbsorptionLengths,
          #endif
          &sca_step_left,
          &abs_lens_left,
          &distancePropagated,
          &distanceToAbsorption
        );
        //printf("PROFILING apply_propagation_through_different_media %lu\n", clock() - t1);

When both the t0 counter and the t1 counter are active, the outer measurement is 30.02 +- 29.37.

When only the t0 counter is active, the outer measurement is 3.00 +- 3.79.

These results suggest that the dominant (most expensive) operation is the printf that outputs the performance measurement result, which costs about 25 ticks whereas the scattering step only costs about 5 ticks.

fiedl commented 6 years ago

Repeat the same measurement of inner and outer clock counter, but print the result outside the counter.

        // start profiling the simulation step here
        clock_t t0 = clock();

        floating_t sca_step_left = -my_log(RNG_CALL_UNIFORM_OC);
        floating_t distancePropagated = 0;
        floating_t distanceToAbsorption = 0;

        clock_t t1 = clock();
        clock_t t2 = clock();
        apply_propagation_through_different_media(
          photonPosAndTime,
          photonDirAndWlen,
          #ifdef HOLE_ICE
            numberOfCylinders,
            cylinderPositionsAndRadii,
            cylinderScatteringLengths,
            cylinderAbsorptionLengths,
          #endif
          &sca_step_left,
          &abs_lens_left,
          &distancePropagated,
          &distanceToAbsorption
        );
        clock_t t3 = clock();
        clock_t t4 = clock();

        // ...

        clock_t t5 = clock();
        printf("PROFILING propagation_kernel_simulation_step %lu\n", t5 - t0);
        printf("PROFILING apply_propagation_through_different_media %lu\n", t3 - t2);
        printf("PROFILING apply_propagation_through_different_media_outer %lu\n", t4 - t1);

This suggests that the performance measurement itself costs about 4.2 ticks. Thus, the overall scattering step is 3.5 ticks, including the the hole-ice-2018 code, which is 2.8 ticks. If this is correct, the hole-ice-2018 code costs 80% of the total scattering step.

fiedl commented 6 years ago

Repeating the same for standard clsim:

        clock_t t1 = clock();
        clock_t t2 = clock();
        apply_propagation_through_different_media_with_standard_clsim(
          photonPosAndTime,
          photonDirAndWlen,
          &sca_step_left,
          &abs_lens_left,
          &distancePropagated,
          &distanceToAbsorption
        );
        clock_t t3 = clock();
        clock_t t4 = clock();

With standard clsim, the performance measurement again is about 4.1 ticks. The overall scattering step is 3.3 ticks, the standard-clsim media propagation code is 2.5 ticks, which is 76% of the total scattering step.

fiedl commented 6 years ago

Comparing the hole-ice-2018 algorithm to the standard-clsim algorithm:

When subtracting the performance measurement, the total scattering step costs:

hole-ice-2018: 3.5 ticks
standard clsim: 3.3 ticks

Thus, the hole-ice-2018 code costs about 6% more cpu time than standard clsim.

fiedl commented 6 years ago

GPU profiling

hole-ice-2018 code

[2018-05-26 23:45:26] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/clsim
▶ git pull

[2018-05-26 23:45:57] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study
▶ git pull

[2018-05-26 23:45:47] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ rm tmp/*
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 > tmp/output.log
5.14s user 3.93s system 12% cpu 1:15.24 total

[2018-05-26 23:49:15] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ grep "kernel statistics" tmp/clsim.log
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 1.54137e+06 nanoseconds/photon, 1005 photons (util: 4%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))

[2018-05-26 23:50:31] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mv tmp $SCRATCH/hole-ice-study/results/profiling/hole-ice-2018_gpu

[2018-05-27 00:00:50] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_profiling_results.py ~/hole-ice-study/results/profiling/hole-ice-2018_gpu/profiling.txt

The performance measurement costs 117 ticks. The medium propagation costs 38022 ticks, the total scattering step costs 41060 ticks. The medium propagation makes 92.6% of the total scattering step gpu time.

fiedl commented 6 years ago

GPU profiling for standard clsim

[2018-05-26 23:53:19] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/clsim
▶ git pull

[2018-05-26 23:51:34] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mkdir tmp
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 > tmp/output.log

5.58s user 3.90s system 8% cpu 1:49.88 total

[2018-05-26 23:57:04] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ grep "kernel statistics" tmp/clsim.log
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 1.04102e+06 nanoseconds/photon, 1005 photons (util: 2%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))

[2018-05-26 23:57:19] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mv tmp $SCRATCH/hole-ice-study/results/profiling/standard-clsim_gpu

[2018-05-27 00:03:38] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_profiling_results.py ~/hole-ice-study/results/profiling/standard-clsim_gpu/profiling.txt

Performance measurement itself costs 114 ticks. The standard-clsim media propagation costs 1773 ticks, the total scattering step costs 5571 ticks. The media propagation makes 32% of the total scattering step when running standard clsim on the gpu.

On the gpu, the hole-ice-2018 code costs 7.37 as much time as standard clsim.

fiedl commented 6 years ago

Run detailed gpu profiling for the hole-ice-2018 code.

[2018-05-27 00:26:25] fiedl@wgs16 /afs/ifh.de/group/amanda/scratch/fiedl/clsim
▶ git pull

[2018-05-27 00:27:41] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mkdir tmp
▶ time ./run.rb --number-of-photons=1e3 --number-of-runs=1 --number-of-parallel-runs=1 > tmp/output.log

1.95s user 1.44s system 3% cpu 1:31.76 total

[2018-05-27 00:29:47] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ grep PROFILING tmp/clsim.log |grep -v INFO > tmp/profiling.txt
▶ grep "kernel statistics" tmp/clsim.log
INFO (I3CLSimStepToPhotonConverterOpenCL): kernel statistics: 1.96359e+06 nanoseconds/photon, 1005 photons (util: 10%) (NVIDIA CUDA Tesla K80)  (I3CLSimStepToPhotonConverterOpenCL.cxx:1212 in boost::posix_time::ptime I3CLSimStepToPhotonConverterOpenCL::DumpStatistics(const cl::Event &, const boost::posix_time::ptime &, uint64_t, bool, const std::string &, const std::string &, uint64_t))

[2018-05-27 00:30:17] fiedl@kepler18 /afs/ifh.de/group/amanda/scratch/fiedl/hole-ice-study/scripts/FiringRange
▶ mv tmp $SCRATCH/hole-ice-study/results/profiling/hole-ice-2018_detail_gpu

[2018-05-27 00:38:12] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_profiling_results.py ~/hole-ice-study/results/profiling/hole-ice-2018_detail_gpu/profiling.txt

This is pretty unexpected. The start code that has nothing to do with the hole ice is most expensive. The relevant parts cost 2740 ticks. (Standard clsim medium propagation costs 1773 ticks.)

fiedl commented 6 years ago

Why is the start code so expensive? Profile it in more detail:

  clock_t t0 = clock();
  int number_of_medium_changes = 0;
  clock_t t01 = clock();
  floating_t distances_to_medium_changes[MEDIUM_LAYERS] = {0.0};
  clock_t t02 = clock();
  int currentPhotonLayer = min(max(findLayerForGivenZPos(photonPosAndTime.z), 0), MEDIUM_LAYERS-1);
  clock_t t03 = clock();
  floating_t local_scattering_lengths[MEDIUM_LAYERS] = {getScatteringLength(currentPhotonLayer, photonDirAndWlen.w)};
  clock_t t04 = clock();
  floating_t local_absorption_lengths[MEDIUM_LAYERS] = {getAbsorptionLength(currentPhotonLayer, photonDirAndWlen.w)};
  clock_t t05 = clock();
  const floating_t photonRange = *sca_step_left * local_scattering_lengths[0];
  clock_t t1 = clock();

  // ...

  printf("PROFILING start %lu\n", t1 - t0);
  printf("PROFILING start_01 %lu\n", t01 - t0);
  printf("PROFILING start_02 %lu\n", t02 - t01);
  printf("PROFILING start_03 %lu\n", t03 - t02);
  printf("PROFILING start_04 %lu\n", t04 - t03);
  printf("PROFILING start_05 %lu\n", t05 - t04);
  printf("PROFILING start_1 %lu\n", t1 - t05);
  printf("PROFILING add_ice_layers %lu\n", t2 - t1);
  printf("PROFILING add_hole_ice %lu\n", t3 - t2);
  printf("PROFILING sort %lu\n", t4 - t3);
  printf("PROFILING loop_over_media %lu\n", t5 - t4);
  printf("PROFILING total %lu\n", t5 - t0);

[2018-05-27 00:52:53] fiedl@fiedl-mbp ~/hole-ice-study/scripts/lib master ⚡
▶ python plot_profiling_results.py ~/hole-ice-study/results/profiling/hole-ice-2018_start_gpu/profiling.txt

fiedl commented 6 years ago

From the detailed profiling of the start code, one can see that these three commands are the most expensive ones in the hole-ice-2018 code:

  floating_t distances_to_medium_changes[MEDIUM_LAYERS] = {0.0};
  floating_t local_scattering_lengths[MEDIUM_LAYERS] = {getScatteringLength(currentPhotonLayer, photonDirAndWlen.w)};
  floating_t local_absorption_lengths[MEDIUM_LAYERS] = {getAbsorptionLength(currentPhotonLayer, photonDirAndWlen.w)};

As the first and the second have almost the same performance cost, the costly part is not getScatteringLength but the allocation of the MEDIUM_LAYERS-sized array.

fiedl commented 6 years ago

fiedl / hole-ice-study

Profiling #69

Some considerations

What needs to be done

Results

Profiling the propagation kernel on the cpu

How does this compare to standard clsim?

GPU profiling

hole-ice-2018 code

GPU profiling for standard clsim