Fast exp algorithm implementation.

zhichen3 commented 2 months ago

A fast exp algorithm implementation following: Ref:

A Fast, Compact Approximation of the Exponential Function by Schraudolph 1999 (https://doi.org/10.1162/089976699300016467).
On a Fast Compact Approximation of the Exponential Function by Cawley 2000 (https://doi.org/10.1162/089976600300015033).
https://gist.github.com/jrade/293a73f89dfef51da6522428c857802d
https://github.com/ekmett/approximate/blob/master/cbits/fast.c

There are two versions: jrade_exp: Its roughly 6-times faster than std::exp. It has reasonable accuracy across all range (roughly ~2% relative error), we follow Ref (3)

ekmett_exp: It is about twice as slow as jrade_exp, but still ~ 3 times faster than std::exp, but it gives much better accuracy, ~0.1% relative error for most cases.

There are float and double versions for both versions.

The main driver is fast_exp, which also included a simple taylor approximation, 1+x, when x < 0.1, currently it is defaulted to use jrade_exp but one can just switch to ekmett_exp for better accuracy.

We also use memcpy approach to avoid undefined behavior from type-punning when using union (pointed out by Eric), and this approach is demonstrated in Ref 3, which we adapt for ekmett_exp.

Sidenote: when using it with NSE solve, it makes the convergence rate go bad in nse_test. Maybe its because of the insufficient accuracy?

BenWibking commented 2 months ago

@psharda this could make chemistry a lot faster...

yut23 commented 2 months ago

Ref 1 says:

Versions of EXP that use 8-byte (long long) integers do not suffer from this staircase effect, but were found to be unacceptably slow on the typical workstation platforms.

I wonder if that's still true with today's machines?

zhichen3 commented 2 months ago

Ref 1 says:

Versions of EXP that use 8-byte (long long) integers do not suffer from this staircase effect, but were found to be unacceptably slow on the typical workstation platforms.

I wonder if that's still true with today's machines?

Not only that. I think I might actually need this long long int version (hopefully still faster than std::exp) if it can really get rid of this staircase effect. When I tried to use fast_exp in NSE solve, I had to increase the tolerance for the NSE solve to ~1.e-6 to 1.e-7 to be able in order to solve it successfully. And I think the main issue is the staircase effect, which happens right at $\Delta x$ ~ 1.e-6.

I also tried to just work with higher tolerances, but it doesn't give good results in detonation. I wonder if its because the algorithm is so sensitive to getting the correct NSE massfractions...

yut23 commented 2 months ago

There's a 64-bit version here, if you want to try it: https://gist.github.com/jrade/293a73f89dfef51da6522428c857802d

yut23 commented 2 months ago

It also uses memcpy to avoid the UB from type punning through a union. We should probably do that too, seeing as we've had issues with reinterpret_cast in ROCm before.

zhichen3 commented 2 months ago

It also uses memcpy to avoid the UB from type punning through a union. We should probably do that too, seeing as we've had issues with reinterpret_cast in ROCm before.

I think you can do it directly by just shifting the it by 4 byte. So its just a single long long int but instead doing 2^20 do 2^52 when calculating the constants

zhichen3 commented 2 months ago

I find good references here: Some useful links: https://stackoverflow.com/questions/53882855/simple-explanation-of-the-ankerl-fast-exponent-algorithm https://github.com/ekmett/approximate/blob/master/cbits/fast.c https://martin.ankerl.com/2007/02/11/optimized-exponential-functions-for-java/ https://martin.ankerl.com/2007/10/04/optimized-pow-approximation-for-java-and-c-c/

I just followed: https://stackoverflow.com/questions/53882855/simple-explanation-of-the-ankerl-fast-exponent-algorithm

zhichen3 commented 2 months ago

might be worth it to implement those fast log and fast pow as well.

zhichen3 commented 2 months ago

int64 version does got rid of the stair effect and its actually faster than int32 version.

zhichen3 commented 2 months ago

It also uses memcpy to avoid the UB from type punning through a union. We should probably do that too, seeing as we've had issues with reinterpret_cast in ROCm before.

ah okay.

zhichen3 commented 2 months ago

For whatever reason, it doesn't work well with NSE solve. Theres not an issue of solving using a low tolerance, but it would slow things down and it ruins convergence rate. So I'm leaving it out for now.

zingale commented 2 months ago

this looks good to me. @yut23 are you okay as well?

zingale commented 2 months ago

@zhichen3 can you update the comment in the first cell to reflect the current state of things so we can have a good / accurate merge message

zhichen3 commented 2 months ago

updated, I assume you meant the cell on the top of the page.

AMReX-Astro / Microphysics

Fast exp algorithm implementation. #1586