ComputationalRadiationPhysics / picongpu

Performance-Portable Particle-in-Cell Simulations for the Exascale Era :sparkles:
https://picongpu.readthedocs.io
Other
696 stars 218 forks source link

Fast Math Verification #2832

Open ax3l opened 5 years ago

ax3l commented 5 years ago

This issue documents a verification run for "fast math" (-ffast-math on gcc or --use_fast_math on nvcc).

I suspect that for long running simulations (e.g. 30k steps and more), it has a significant influence on final energy spectra.

Fast math is by default enabled in the PIConGPU GPU/cuda backend (via ALPAKA_CUDA_FAST_MATH) and by default disabled on the CPU backends (such as OpenMP/omp2b) where it needs to be passed by CXXFLAGS to control, e.g. via export CXXFLAGS="-g0 -O3 -m64 -ffast-math".

Method

Running the FoilLCT (a0=5, plane wave laser, 192 n_c, 1mu foil) example with 8.cfg repeatedly with varied:

runs: (first number as in cmakeFlags)

Diff to Default example

Commit

Branch: topic-fastMathFoilTest

Output

On HZDR file dirs in /bigdata/hplsim/development/huebl/foilLCT_fm/.

Version & Software

PIConGPU 0.4.2 on Hemera (HZDR) P100 with CUDA 9.2

ax3l commented 5 years ago

First test looks surprisingly stable after 60'000 steps (plane wave, a0=5): h_all_60000

Energy is not fully converged yet and one could do as follow-up tests:

sbastrakov commented 5 years ago

From your data the speedup due to fast math seems rather minor. Comparing configurations 000 and 300, calculation time is 38:52 vs. 39:32 and full simulation time is also about 40 seconds apart, in relative terms only about 1.7% speedup.

ax3l commented 5 years ago

Just for notes: I did a bit of I/O (text-based histograms every 100 steps) during the sim and one might want to compare pure simulation time without init. But I quite tend to agree that the "risk" of fast-math for the little currently seen speedup might not be worth it to make it "default on".

Nevertheless, the influence from the first view seems little and the next steps as outlined above need to be done to verify further.

Nevertheless, one has to verify that for several setups and without I/O. Further research welcome!