Closed psharda closed 2 months ago
Numbers below are for a test PopIII simulation in 3D on moth
GPU @ ANU (AMD). The test ran for 1000 steps, starting with 0 AMR levels and finishing with 1 AMR level on top of the base level.
Test A: Current version of actual_rhs.H
Test B: Current version + free-free cooling
Test C: Test B but with the new version of actual_rhs.H implemented in this PR.
Setup | Test time (seconds) | Mupdates/second | % chemistry
Test A | 1202.919 | 1.504 | 76.72%
Test B | 1206.848 | 1.499 | 76.55%
Test C | 1143.627 | 1.582 | 75.61%
So the new version is faster by ~23 seconds. This time difference is likely to be larger for later times, when the density increases, chemistry takes longer, and more AMR levels are added.
Ok, so the chemistry itself is 7% faster on AMD with these improvements.
I am a bit puzzled as to why it's not faster. Maybe there is an underlying performance issue with the AMD compiler. Do you have performance numbers for NVIDIA, or on CPU?
Did you have Sympy simplify the expressions after doing the pow -> exp/log conversion?
Did you have Sympy simplify the expressions after doing the pow -> exp/log conversion?
@BenWibking I do CSE first, then cxxcode everything. The second step does the pow --> exp/log conversion.
The real performance improvements should come from merging products of exponentials to a single std::exp()
call and reusing the log computations between terms. Is it possible to do more CSE after transforming pow to exp/log?
To expand a bit further: the compiler won't (in general) combine exp(a)*exp(b)
into exp(a+b)
because the rounding properties are different, even though the latter form should be twice as fast.
It looks like you can do the above simplification with sympy.powsimp
: https://docs.sympy.org/latest/tutorials/intro-tutorial/simplification.html#powsimp
I don't think the compiler knows (or is even allowed to assume) anything about the mathematical properties of exp()
and log()
. It also looks like the standard library can't mark many of the math functions with __attribute__(const)
to let the compiler do CSE, as the global floating-point rounding mode can change their results.
I replaced most calls to
std::pow
with a combination ofstd::exp
andstd::log
to speedup chemistry. Also added free-free cooling, which was missing from the previous version. In the future, I will follow #1586 and #1591 to usefast_exp
andfast_log
.The unit test passes, with small differences in the number densities of H and H2 as compared to the reference solution (3.5% and 0.3% respectively). The difference in H is more likely due to free-free cooling.