qmc-robot commented 7 years ago

Reported by: ye-luo

Hi all, The mixed precision has been added for QMCPACK. Add -D QMC_MIXED_PRECISION=1 to activate it. Single precision is set as the base precision. Double precision is set as the full precision.

The single precision is used almost everywhere including particle/lattice coordinates, distance tables, wave functions (SPO, determinants, Jastrows), Hamiltonians. To retain accuracy, a lot of reductions (estimators for energy components, gradient/laplacian of WF), coulumb/pseudo potentials initialization and random walking trajectory updates are in DP. Uniform and gaussian RNGs are always in DP but not necessary needed for accuracy. They are useful to check MP against DP.

A recompute is introduced to recompute the inverse matrix for determinants from scratch, inversion in DP.

1

By default, recompute at the end of every block as the GPU code.

The short tests in the test suite all pass. The mixed precision code has been tested mainly on solids with real/complex builds with VMC VMC+drift and DMC runs. Using SD+J1+J2 wavefunction. Certain parts of WF optimization needs DP, not fixed yet.

DP is fully DP calculation, SP is DP code with SP spline, MP is mostly SP calculation.

1) In solid ZrO2 with 144x2 electrons, I checked the complex code.

VMC runs: LocalEnergy Variance ratio no drift DP -953.463912 +/- 0.000598 19.293480 +/- 0.013491 0.0202 SP -953.465124 +/- 0.000534 19.286142 +/- 0.005167 0.0202 Good news to tell MP-nocompute -953.463879 +/- 0.000757 19.287460 +/- 0.006794 0.0202 MP -953.464368 +/- 0.000717 19.296810 +/- 0.009718 0.0202

with drift DP -953.464105 +/- 0.000571 19.285185 +/- 0.010345 0.0202 SP -953.464476 +/- 0.000586 19.288203 +/- 0.007079 0.0202 MP-nocompute -953.464267 +/- 0.000510 19.293128 +/- 0.011464 0.0202 MP -953.464501 +/- 0.000635 19.291002 +/- 0.006508 0.0202

DMC runs

tw_id energy error tw_x tw_y tw_z kpoint_id weight

tw0 -955.0420 0.0028 -0.25 0.25 0.25 3 0.5000000 tw1 -955.0408 0.0021 -0.25 -0.25 0.25 2 1.0000000 tw2 -955.0375 0.0025 -0.25 -0.25 -0.25 1 0.5000000

all_tw -955.04027 0.00141 12/ncell -79.58669 0.00012

tw_id energy error tw_x tw_y tw_z kpoint_id weight

tw0 -955.0374 0.0027 -0.25 0.25 0.25 3 0.5000000 tw1 -955.0423 0.0026 -0.25 -0.25 0.25 2 1.0000000 tw2 -955.0344 0.0021 -0.25 -0.25 -0.25 1 0.5000000

all_tw -955.03910 0.00156 12/ncell -79.58659 0.00013

The DMC results are consistent with in 0.1 mHa per formula unit.

2) In solid TiO2 with 864x2 electrons, VMC runs. LocalEnergy Variance ratio cpu-MP-recompute4 -6513.765656 +/- 0.005351 171.717760 +/- 1.120446 0.0264 cpu-MP -6513.767989 +/- 0.006393 170.280852 +/- 0.323286 0.0261 cpu-SP -6513.758155 +/- 0.007937 170.783247 +/- 0.306188 0.0262 cpu-DP -6513.767100 +/- 0.007118 170.458162 +/- 0.217447 0.0262 gpu -6513.756115 +/- 0.009468 170.240843 +/- 0.190607 0.0261

In brief, the accuracy is not compromised with the mixed precision code. Actually, reducing the recompute frequency doesn't seem hurt the accuracy at all.

By default the mixed precision is switched off, the code should behave as the trunk. Please provide feedback under this post. Ye

qmc-robot commented 7 years ago

Comment by: prckent

It would be helpful to describe where mixed precision is implemented, what benefits have been seen, what testing, what limitations etc.

qmc-robot commented 7 years ago

Comment by: prckent

What is the recompute frequency? What is being recomputed?

The cofactor/inverse slater matrices? This needs to be consistent with the GPU version if so, or the GPU version updated appropriately etc.

qmc-robot commented 7 years ago

Comment by: prckent

I see the recompute is indeed the slater matrix recompute.

Can we call the parameter blocks_between_slater_matrix_recompute ? It is longer, but more obvious in what it does. Other ideas are welcome. Can you implement the same in the GPU code? We have to avoid feature divergence at all costs.

qmc-robot commented 7 years ago

Comment by: ye-luo

Direct comparison between double precision calculation with single precision spline and mixed precision calculation. The energy difference for each component vs iterations on ZrO2 hartree-fock VMC no drift run with the complex code. The fluctuation is in a reasonable range and gets larger as time goes but the statistics are OK. Image: Screenshotfrom2016-09-20162739.png

qmc-robot commented 7 years ago

Comment by: ye-luo

The recompute is implemented in a way to propagate from TrialWaveFunction to each individual component. The default is doing nothing like for the Jastrows. SD has a specialization to recompute in DP. This is the way both CPU and GPU code implements. For this reason, I'd like to call it blocks_between_recompute instead of blocks_between_slater_matrix_recompute explicitly to keep a consistent logic. Also for users, they don't need to know what is recomputed exactly. Adding a consistent implementation to the GPU code should be very simple. I only need to add some counter check in front of calling the recompute. So I will do it soon.

qmc-robot commented 7 years ago

Comment by: prckent

Understood.

[ This is not something to necessarily fix now, but we really ought to check the differences found after recomputing. This will be increasingly problematic with larger runs. We might also have to fix our "log" handling which does not look to have the best numerics as written. ]

qmc-robot commented 7 years ago

Comment by: ye-luo

I have aligned the recompute behaviours between CPU and GPU codes. The fix is trivial.

qmc-robot commented 7 years ago

Comment by: prckent

A note will be needed in the manual...

qmc-robot commented 7 years ago

Comment by: markdewing

How did you get the SP results? I tried compiling with OHMMS_PRECISION set to 'float', and it failed to compile (both the mixed_precision branch and trunk)

qmc-robot commented 7 years ago

Comment by: ye-luo

Do not touch OHMMS_PRECISION. Add -D MIXED_PRECISION=1 in your cmake command line.

qmc-robot commented 7 years ago

Comment by: markdewing

In the initial text, there are results for DP, SP, and MP (and MP-nocompute). I assume DP is the original code (double precision), MP is mixed precision (-D MIXED_PRECISION=1), and SP is single precision ? Where did the SP results come from? (Or does DP/SP refer to just the orbital spline precision?)

qmc-robot commented 7 years ago

Comment by: markdewing

For consistency, the flag name should start with QMC_ (QMC_MIXED_PRECISION - to match QMC_MPI and QMC_COMPLEX)

There seems to be two ways of thinking about mixed precision

Most of the code is in single precision, with some parts in double precision where precision is required
Most of the code is in double precision, with some parts in single precision where precision can be reduced, and the code can go faster

Given that enabling mixed precision sets the base precision to 'float', my guess is option 1 might be the better way to think about it? This choice affects naming (base precision + extended/full precision vs. base_precision + reduced precision)

For increased clarity, the setting of the two precision values should be done in the same place. Currently, OHMMS_PRECISION is adjusted in CMakeLists.txt, and the full precision value is hard-coded to 'double' in coulomb_types.h. Maybe add an OHMMS_PRECISION_FULL variable?

The type variables for the two precision values (at least for Coulomb interactions) are pRealType and mRealType. What do the 'p' and'm' prefixes mean?

qmc-robot commented 7 years ago

Comment by: ye-luo

Yes. It is better to add QMC_. I will do it for MIXED_PRECISION. I'm also considering to control CUDA_PRECISION by MIXED_PRECISION. It is a good idea to add OHMMS_PRECISION_FULL to unify the hard-coded doubles. During the development I need piece by piece controling but now It is time to merge them.

The CPU code has option 1 base = float, full precision is only hard code in Configure.h and coulomb_types.h and then propagates to the code. The GPU code has option 2. base = double and reduce precision (CUDA_PRECISION) = float.

This discrepancy should be unified and take option 2 probably in the new code but not the current QMCPACK.

In coulomb case, pRealType means the realtype in particleset, mRealType means my RealType.

Mark, could you please check your tests and replace the hard-coded double with RealType?

qmc-robot commented 7 years ago

Comment by: prckent

This is a great improvement. For non experts we should give an guide of which variables are in which precision in the output. i.e. << "\n Base precision = " << GET_MACRO_VAL(OHMMS_PRECISION) << " for wavefunctions, positions and lattice coordinates" (will need some wordsmithing to not be too long)

qmc-robot commented 7 years ago

Comment by: ye-luo

Did some further tide up. now the precision definition is clean. Host side: base: OHMMS_PRECISION full: OHMMS_PRECISION_FULL GPU side: base: CUDA_PRECISION full: CUDA_PRECISION_FULL

qmc-robot commented 7 years ago

Comment by: ye-luo

@prckent I add a line above the precision printout pointing the user to read the manual.

prckent commented 7 years ago

OK to close? There are some ongoing issues with some of the mixed precision tests covered by issue #46

QMCPACK / qmcpack

Mixed precision preview #44

tw_id energy error tw_x tw_y tw_z kpoint_id weight

tw0 -955.0420 0.0028 -0.25 0.25 0.25 3 0.5000000 tw1 -955.0408 0.0021 -0.25 -0.25 0.25 2 1.0000000 tw2 -955.0375 0.0025 -0.25 -0.25 -0.25 1 0.5000000

tw_id energy error tw_x tw_y tw_z kpoint_id weight

tw0 -955.0374 0.0027 -0.25 0.25 0.25 3 0.5000000 tw1 -955.0423 0.0026 -0.25 -0.25 0.25 2 1.0000000 tw2 -955.0344 0.0021 -0.25 -0.25 -0.25 1 0.5000000