Open slizzered opened 9 years ago
So I did some tests and throughput drops quite significantly for lower numbers of rays. I will follow this up with some profiling, maybe there is a way to optimize the overhead away (CUDA streams might be a solution if the GPU is not fully utilized).
Executed on Node Kepler002 in the Hypnos cluster
C example ./bin/calcPhiASE -c calcPhiASE.cfg --min-rays=X
where X
and the executable[1] vary between the runs. The config file is the one supplied with the example in the current old
except for min-rays:
minRays | runtime old [s] |
runtime new [s] |
throughput old /new |
---|---|---|---|
10^5 | 137 | 224 | 0.61 |
10^6 | 448 | 531 | 0.84 |
10^7 | 3100** | 3150** | 0.98 |
* old
is the current dev 2272f9bba5140cafd patched with 726b0473827fa08
new
is basically old
but additionally patched with 0973d1ac7bab12b6
\ runtimes estimated after 10% of the simulation were completed. These times should be representative enough to get a good grasp on the performance implications.
Ok so I did some refactoring and debugging, the code got a lot faster. As an added benefit, it would be trivial to add CUDA streams.
minRays | runtime old [s] |
runtime new [s] |
throughput old /new |
---|---|---|---|
10^5 | 137 | 190 | 0.72 |
10^6 | 448 | 476 | 0.94 |
10^7 | 3100** | 3000** | 1.03 |
* old
is the current dev 2272f9bba5140cafd patched with 726b0473827fa08
new
is basically old
but additionally patched with 0973d1ac7bab12b6 and 42cf48b668d25e7e2ae4494
\ runtimes estimated after 10% of the simulation were completed. These times should be representative enough to get a good grasp on the performance implications.
This is a test to see if we can split the calculation of a sample point into multiple kernel-calls (one per reflection slice). The reason is, that our current code computes all reflection slices in a single huge array. This old style had several disadvantages that could be fixed:
numberOfReflectionSlices
is huge. As big as indicesOfPrisms, and so it is part of the bottleneck for the number of rays.numberOfReflectionSlices
andraysPerPrism
are actually linearized 2D arrays that contain all the reflection planes. This leads to more difficult code while we do themapRaysToPrisms
.This is nice and all, but splitting the reflections might introduce some problems:
All in all, the performance implications need to be tested. I believe that this commit can improve long-term code quality and will directly enable #2. But if the performance suffers, we might need to code some workaround (maybe use the split functionality only for really high ray numbers where the tradeoff is not so bad and we really NEED it).