Spline bug the second - Githubissues

qmc-robot commented 7 years ago

Reported by: jtkrogel

Note: the crystal structure associated with this ticket is to remain private among the developers

Problem

Large VMC variance encountered when using real code at the gamma point
Other real valued twists are fine with the real code
All twists are fine with the complex code (i.e. gamma run w/ real code is broken, exact same input is fine for complex code)

Observations

Problem is apparent w/ Jastrow (twist0 is gamma)


oahu>qmca -q ev *J3*scalar*
                            LocalEnergy               Variance           ratio 
dmcJ3_comp_twist0  series 0  -2573.984458 +/- 0.018028   62.769322 +/- 0.380993   0.0244 
dmcJ3_comp_twist1  series 0  -2573.956292 +/- 0.054766   62.173276 +/- 0.363645   0.0242 

dmcJ3_real_twist0  series 0  -2272.833662 +/- 2.041285   1099920.558714 +/- 19477.842043   483.9424 
dmcJ3_real_twist1  series 0  -2574.016425 +/- 0.018762   63.536917 +/- 0.316388   0.0247

And also w/o Jastrow


oahu>qmca -q ev *J0*scalar*
                            LocalEnergy               Variance           ratio 
dmcJ0_comp_twist0  series 0  -2554.831042 +/- 0.073880   406.263676 +/- 6.761861   0.1590 
dmcJ0_comp_twist1  series 0  -2554.869480 +/- 0.046005   407.847141 +/- 5.939818   0.1596 

dmcJ0_real_twist0  series 0  -2184.008348 +/- 2.959968   1320686.004763 +/- 49900.421665   604.7074 
dmcJ0_real_twist1  series 0  -2554.944868 +/- 0.106715   404.221088 +/- 3.652783   0.1582

The issue is localized to the kinetic energy


oahu>qmca -q k dmcJ0_*_twist0.*scalar*
dmcJ0_comp_twist0  series 0  Kinetic               =  1549.597363 +/- 0.363819 
dmcJ0_real_twist0  series 0  Kinetic               =  1919.979739 +/- 3.012542 

oahu>qmca -q p dmcJ0_*_twist0.*scalar*
dmcJ0_comp_twist0  series 0  LocalPotential        =  -4104.428404 +/- 0.371510 
dmcJ0_real_twist0  series 0  LocalPotential        =  -4103.988086 +/- 0.514058

In contrast to the first "spline bug", the energy and variance are significantly larger at each and every step

Other details

Workflow is SCF->PW2QMCPACK->QMCPACK
PWSCF version 5.1 was used to generate the wavefunction (LDA+U)
QMCPACK revision 7044 was used for the VMC runs
All runs were performed on EOS at OLCF

Files

File: bug_package.tgz|File: bug_package.tgz -- tar file w/ all files
Original orbital h5 file not included, but it can be transferred
Mn.opt.upf -- Mn PP in upf format
Mn.opt.xml -- Mn PP in FSAtom format
Ni.opt.upf -- Ni PP in upf format
Ni.opt.xml -- Ni PP in FSAtom format
O.opt.upf -- O PP in upf format
O.opt.xml -- O PP in FSAtom format
scf.in -- PWSCF input file
scf.out -- PWSCF log output
p2q.in -- pw2qmcpack input file
p2q.out -- pw2qmcpack log output
Pattern for QMCPACK files J0/J3 => No Jastrow/3 body Jastrow comp/real => complex code/real code twist0/twist1 => gamma twist/non-gamma twist .in.xml => input file .out => log output .qsub.in => EOS submission file ** .s000.scalar.dat => VMC output data

All files


dmcJ0_comp_twist0.in.xml           dmcJ0_real_twist1.qsub.in          dmcJ3_real_twist1.in.xml
dmcJ0_comp_twist0.out              dmcJ0_real_twist1.s000.scalar.dat  dmcJ3_real_twist1.out
dmcJ0_comp_twist0.qsub.in          dmcJ3_comp_twist0.in.xml           dmcJ3_real_twist1.qsub.in
dmcJ0_comp_twist0.s000.scalar.dat  dmcJ3_comp_twist0.out              dmcJ3_real_twist1.s000.scalar.dat
dmcJ0_comp_twist1.in.xml           dmcJ3_comp_twist0.qsub.in          Mn.opt.upf
dmcJ0_comp_twist1.out              dmcJ3_comp_twist0.s000.scalar.dat  Mn.opt.xml
dmcJ0_comp_twist1.qsub.in          dmcJ3_comp_twist1.in.xml           Ni.opt.upf
dmcJ0_comp_twist1.s000.scalar.dat  dmcJ3_comp_twist1.out              Ni.opt.xml
dmcJ0_real_twist0.in.xml           dmcJ3_comp_twist1.qsub.in          O.opt.upf
dmcJ0_real_twist0.out              dmcJ3_comp_twist1.s000.scalar.dat  O.opt.xml
dmcJ0_real_twist0.qsub.in          dmcJ3_real_twist0.in.xml           p2q.in
dmcJ0_real_twist0.s000.scalar.dat  dmcJ3_real_twist0.out              p2q.out
dmcJ0_real_twist1.in.xml           dmcJ3_real_twist0.qsub.in          scf.in
dmcJ0_real_twist1.out              dmcJ3_real_twist0.s000.scalar.dat  scf.out

File: bug_package.tgz|File: bug_package.tgz

qmc-robot commented 7 years ago

Comment by: prckent

Wow! @ye-luo-luo @ye-luo-luo @ye-luo-luo @ye-luo-luo

This is an important breakthrough in the spline bug arena. This is a VMC RUN (even though the files are labeled DMC) and the energies are crazy in the real code on a point by point basis, i.e. it is not a small part of phase space that is wrong. This suggests a problem with pointers, indexing, conversions etc. It is interesting that the code does not crash and the complex version appears to get a reasonable (correct?) result.

My suggestion is that Ye @ye-luo-luo takes a look at this as a priority unless he is "full". A first challenge is to reproduce the problem on another system. It is certainly a very real bug on a Cray Intel system (eos) so is likely to be general.

This is our scariest and most important known bug. Hopefully spline bug the second is the same one as spline bug the first.

(Comment written after speaking to Jaron by phone)

qmc-robot commented 7 years ago

Comment by: prckent

This bug is particularly scary, just like the first spline bug, because we can not rule out the possibility that it is slightly biasing the results of production runs that otherwise appear normal.

qmc-robot commented 7 years ago

Comment by: ye-luo

No problems shows on BG/Q. I did the DFT (QE 5.3.0) and VMC(QMCPAC rev 7337) both on BG/Q qmca -q ev *.scalar.dat dmcJ0_comp_twist0 series 0 -2554.870550 +/- 0.007271 404.970022 +/- 1.933910 0.1585 dmcJ0_real_twist0 series 0 -2554.833197 +/- 0.010036 407.298302 +/- 1.146522 0.1594

transfer the h5 to EOS and run QMCPACK rev7344 on EOS qmca -q ev -e 5 *.scalar.dat dmcJ0_comp_twist0 series 0 -2554.834115 +/- 0.079253 404.658918 +/- 5.366533 0.1584 dmcJ0_real_twist0 series 0 -2555.016185 +/- 0.084016 400.174285 +/- 3.251338 0.1566

@jtkrogel Please 1, transfer the h5 file to Mira and I will run QMCPACK. 2, copy my h5 (/gpfs/mira-fs1/projects/QMCSim/yeluo/spline_bug/Jaron/rerun/pwscf_output/pwscf.pwscf.h5) to your machine and run vmc with your qmcpack build. So we can further investigate the issue.

qmc-robot commented 7 years ago

Comment by: jtkrogel

I was able to track down the original orbital file and I am now transferring it to Mira (/gpfs/mira-fs1/projects/QMCSim/jtkrogel/transfer/for_ye/01_spline_bug2/pwscf.pwscf.h5). I reran with this file and the large variance behavior is present; Ye's H5 file resulted in a normal variance with rev7044 on EOS.

The file itself is irregularly large, which may relate to the problems seen (it is ~106 GB (!!) compared to 3.5 GB from Ye). This is directly due to the presence of psi_r data in the file. There may be a bug in QMCPACK's handling of the file.

It is unclear (a) why the file is so large, (b) why QMCPACK failed w/ the real code at this and one other volume, but was (apparently) fine at other volumes w/ real or complex code even though large H5 files were produced at each volume.

I will rebuild QE on EOS to see if normal size files and normal behavior result. I will also check the "spline bug 1" orbital file for size irregularities.

qmc-robot commented 7 years ago

Comment by: jtkrogel

The "spline bug 1" files do not present size irregularities as compared w/ Ye's file. They were also generated with QE 5.1, but on another machine (OIC5).

qmc-robot commented 7 years ago

Comment by: prckent

It is definitely important to exchange the exact file that is known to cause problems on at least one machine.

Perhaps the FFTs are incorrect when the plane wave orbitals are transformed to the real space mesh inside QMCPACK, or some of the surrounding code is bad? This could be FFT and machine ( BG vs Intel ) dependent.

qmc-robot commented 7 years ago

Comment by: ye-luo

@jtkrogel Is the transfer still going? I don't see anything but a single txt file.

Try this on your sick file.

h5ls -r pwscf.pwscf.h5/electrons > size.out grep psi_g size.out | awk '{print $3,$4}' | uniq My file yields {56799, 2} I was wondering if one k point associated with the gamma point WF was corrupted or at least with a wrong size.

qmc-robot commented 7 years ago

Comment by: jtkrogel

@ye-luo The files transfer is still going, estimated remaining time is 1 hour (transfer 60% complete at 10MB/s).

The 106GB file yields the same: eos>h5ls -r pwscf.pwscf.h5/electrons | grep psi_g | awk '{print $3,$4}' | uniq {56799, 2}

I found that the file contains psi_r data on a 120x120x120 grid. This accounts for the size difference between your file (3.5GB) and mine (106GB): 120.*3/567993.5 = 106.4

Am I correct that QMCPACK is reading this psi_r data? If so, this clearly has a bearing on the source and/or location of the bug.

qmc-robot commented 7 years ago

Comment by: jtkrogel

@ye-luo The file transfer is now complete to Mira (/gpfs/mira-fs1/projects/QMCSim/jtkrogel/transfer/for_ye/01_spline_bug2/pwscf.pwscf.h5).

qmc-robot commented 7 years ago

Comment by: ye-luo

Since you also got {56799,2}, it means we have the same number of g-vectors in the reciprocal space in DFT. Our files agree on the psi_g size per k points and band . Your estimate of the file size is correct, the psi_r dominates.

I remember now, when I did the conversion, I commented "write_psir = .true." to avoid the real space WF in h5. This makes the huge difference in file size. QMCPACK should not use the psi_r. It loads the psi_g and does the FFT internally.

qmc-robot commented 7 years ago

Comment by: jtkrogel

There is one psi_r for each psi_g in the file. Since psi_g dominates the file size, the ratio of psi_r size (120**3) to psi_g size (56799) gives the approximate file size ratio also.

qmc-robot commented 7 years ago

Comment by: ye-luo

@jtkrogel QMCPACK may use the psi_r in the past but not now, the current code doesn't care about psi_r.

I tried the h5 from Jaron, the issue was reproduced. rerun-h5-Jaron/dmcJ0_comp_twist0 series 0 -2554.833358 +/- 0.015526 404.594802 +/- 1.375452 0.1584 rerun-h5-Jaron/dmcJ0_real_twist0 series 0 -2178.009295 +/- 1.189484 1375076.599433 +/- 21024.229610 631.3456

I reran the DFT with the exact input files (using collect, and write_psir=.true.), I got a h5 of size 83GB. The size can be different from machine to machine due to different libhdf5. rerun-dft-collect/dmcJ0_comp_twist0 series 0 -2554.798981 +/- 0.041567 399.644731 +/- 0.489775 0.1564 rerun-dft-collect/dmcJ0_real_twist0 series 0 -2554.839310 +/- 0.024979 406.804079 +/- 2.831341 0.1592

I also tried a conversion with write_psir=.false. rerun-dft-collect-nopsi_r/dmcJ0_comp_twist0 series 0 -2554.860441 +/- 0.017068 406.506065 +/- 2.525934 0.1591 rerun-dft-collect-nopsi_r/dmcJ0_real_twist0 series 0 -2554.848793 +/- 0.023298 405.541774 +/- 1.655292 0.1587

Another case, neither collect nor write_psir, rerun-dft-no-collect/dmcJ0_comp_twist0 series 0 -2554.824338 +/- 0.013409 405.586031 +/- 1.107173 0.1588 rerun-dft-no-collect/dmcJ0_real_twist0 series 0 -2554.844554 +/- 0.015553 404.325833 +/- 1.217949 0.1583

So it seems that a corrupt h5 causes the crazy behaviour.

qmc-robot commented 7 years ago

Comment by: jtkrogel

As far as I can tell, the H5 file is valid; it just contains what we currently consider to be irrelevant information. It seems clear that QMCPACK is mishandling the file (same file but complex works and real doesn't).

The bigger question I have is whether this mishandling is generic of large file sizes, i.e. will we run into this routinely in the future in say 256-512 atom defect cells of NiO? I think we should still track down and patch the source of this mishandling.

qmc-robot commented 7 years ago

Comment by: prckent

I think we need to narrow down the difference further (or at least I need to understand it better).

e.g. Is the file being read incorrectly or somehow processed incorrectly internally after reading?

Could we be near an integer limit somewhere, hence the apparent "large file" dependency?

Have we reproduced this problem on enough different systems that we can claim it is not (say) a problem with a particular HDF5 version and installation?

qmc-robot commented 7 years ago

Comment by: prckent

@jtkrogel Are the kinetic energies of the orbitals in the two files identical?

qmc-robot commented 7 years ago

Comment by: jtkrogel

We've also regenerated the orbital file w/ psir=.false. on EOS using QE 5.1. The file size is 3.4GB as expected.

The large change in energy and variance is still present for this file (will refer to this file as "vol0.98_eos_small"): dmcJ0_real_twist0 series 0 -2177.251376 +/- 3.313361 1383981.935105 +/- 56682.731771 635.6556

Rerunning Ye's small file ("vol0.98_mira_small") on EOS does not present the problem, consistent w/ Ye's runs: dmcJ0_real_twist0 series 0 -2554.890268 +/- 0.081604 399.433829 +/- 4.111847 0.1563

I've calculated the per orbital kinetic energy by directly summing the coefficients (and k^2) with a Python tool of mine. The largest KE difference across all orbitals in the two files above is 0.5 mHa. Typical KE's per orbital range from 0.5-8.0 Ha.

Conclusions: problem can persist w/o psir data (i.e. in small files). The two small files that do/do not trigger the bug are nearly identical in orbital contents (as manifested by matching oribtal KE), and so the problem is most likely limited to QMCPACK's usage of the files, not the files themselves (i.e. probably not the converter, unless something else in the files differs materially).

qmc-robot commented 7 years ago

Comment by: prckent

Does the current converter & QE 5.3 give the same file? Ye made a lot of changes. Do you need any help with an eos build?

qmc-robot commented 7 years ago

Comment by: jtkrogel

So far I'm of the opinion that the evidence points away from a bug in the QE SCF+conversion. We have cases w/ the bug appearing using orbital files from QE5.1 and QE5.3. Tests outside of QMCPACK on files w/ and w/o apparent bug show the orbitals are the same. Bug only shows up going from complex to real. To me everything is pointing pretty clearly to a bug inside QMCPACK and not with the orbital files. Thoughts?

I'm happy to explore QE5.3 on EOS. It would be good to know for sure that the converter is behaving consistently across versions/builds. I think we should start digging into QMCPACK itself concurrent with this.

I also plan to rerun the KE checks across all of Ye's QE 5.3 files from spline bug 1 that showed huge sensitivity in the VMC variance when making small changes in the DFT convergence parameters. If the KE's are not sensitive, then we will know the problem is in QMCPACK and it will give us an easier way to track down the bug (force VMC variance to match across files) than doing full walker traces, etc.

qmc-robot commented 7 years ago

Comment by: prckent

Q1. How do we explain Ye's "rerun" results https://app.assembla.com/spaces/qmcdev/tickets/49/details?comment=1113850913

Q2. Have we showed a difference between complex and real versions for a single electron? i.e. cut down one of the "bad" runs to have only 1 or 2 electrons?

qmc-robot commented 7 years ago

Comment by: prckent

No real need to run 5.3 on eos. 5.1 is plenty old though.

qmc-robot commented 7 years ago

Comment by: ye-luo

@jtkrogel "We have cases w/ the bug appearing using orbital files from QE5.1 and QE5.3." Do we have one from QE5.3?

I will try to compare the real/complex QMCPACK in the following aspects: spline coefficients, phase, evaluation. If possible, run directly with planewave WF. Not possible today for the maintenance but probably tomorrow.

qmc-robot commented 7 years ago

Comment by: prckent

Planewave is super slow. I advise trying to cut down before running.

qmc-robot commented 7 years ago

Comment by: jtkrogel

Paul/Ye: you are correct, we do not yet have a QE5.3 run that produces a buggy result for this ticket (I misread/misremembered Ye's rerun result). We do know that odd results can happen using the 5.3 toolchain (in context of similarly huge variances obtained intermittently for spline bug 1 https://app.assembla.com/spaces/qmcdev/tickets/40/details?comment=1113785513).

Might this be the time to put orbital KE sums (PW sum from H5 and riemann sum or analytic for splined orbs) into QMCPACK for orbital quality checks? Fringe benefit would be no more need to do meshfactor scans at the VMC level.

We are starting some low electron count runs here.

qmc-robot commented 7 years ago

Comment by: prckent

Let us get to the root cause of the spline bugs before adding general tests. Anything looking at orbital quality should be coordinated with the APW projection conversion which needs something similar.
5.3 runs needed then. Will remove one parameter from our comparison matrix.

qmc-robot commented 7 years ago

Comment by: jtkrogel

Agreed on 1 and 2. QE 5.3 runs will be useful. I'm suspicious of the intermittency issues shown in the other bug, so I may vary the convergence parameters and look for a similar pattern. If the resulting orbital KE's come out similar across the board then I think we can safely conclude that any variability seen at the VMC level resides solely in QMCPACK.

qmc-robot commented 7 years ago

Comment by: jtkrogel

The problematic "small" 3.4 GB file ("vol0.98_eos_small") is now available on Mira: /gpfs/mira-fs1/projects/QMCSim/jtkrogel/transfer/for_ye/01_spline_bug2/vol0.98_eos_small/pwscf.pwscf.h5

qmc-robot commented 7 years ago

Comment by: jtkrogel

Some brief results for a reduced number of electrons (occupying from 2 to 100 up and down orbitals, full occupation is nup=208, ndown=200).

Results with real code:


nud   LocalEnergy                 Variance                          ratio 
  2  -1711.586138 +/-  0.022133       25.163081 +/-      0.201123     0.0147 
  4  -1757.027013 +/-  0.035981       52.466286 +/-      0.386141     0.0299 
  5  -1764.564283 +/-  3.576613    31020.140442 +/-   6531.582336    17.5795 
 10  -1500.868653 +/- 19.722575  1505729.495207 +/- 111609.817630  1003.2387 
 25  -2021.780528 +/-  4.744664   255578.583315 +/-  21096.982477   126.4126 
 50  -2375.185522 +/-  0.573504    12489.309935 +/-   2157.800910     5.2582 
100  -2511.362514 +/-  0.064731      375.992336 +/-     10.289014     0.1497

Combined w/ the complex results below, it looks as though problem orbitals might start around nup/down==5. Still, the variance pattern is strange, with nup/down=100 appearing almost normal. This suggests more intermittency to me rather than specific problem orbitals.

Results with complex code:


nud   LocalEnergy                 Variance                     ratio 
  2  -1711.568537 +/- 0.000000     26.088409 +/-    0.000000   0.0152 
  4  -1757.114780 +/- 0.024181     49.651033 +/-    0.349108   0.0283 
  5  -1779.701254 +/- 0.042572     63.252476 +/-    1.437347   0.0355 
 10  -1867.579168 +/- 4.446231   2696.267913 +/- 2406.244516   1.4437  <=== large variance
 25  -2095.414049 +/- 0.081082    203.158111 +/-    1.780770   0.0970 
 50  -2378.986474 +/- 0.290910    249.256439 +/-    3.030250   0.1048 
100  -2511.442451 +/- 0.115850    315.629761 +/-    1.803683   0.1257

The results at nup/down==10 show the first signs that there may also be a problem with the complex code. We are currently rerunning the nup/down=10 complex case w/ variations in the DFT convergence parameters to see if this behavior remains.

qmc-robot commented 7 years ago

Comment by: prckent

This hints at an MPI bug in our code, or memory usage problem (bad pointer usage, incorrect free/alloc etc.). One possible strategy would be to compute a checksum of the spline buffers on each MPI task.

qmc-robot commented 7 years ago

Comment by: jtkrogel

We have rerun this in serial, and the problem persists. Also, running with traces on confirms that every walker on every node has a large kinetic energy.

The large variance seen above with complex for nud==10 is due to the runs being too short (equilibration issues w/ partial occupation, similar to isolated molecules in a box).

With equilibration properly accounted for, the real code demonstrates no issue w/ occupation up to nud=100. We are currently performing a bisection search on nud between 100 and 200 to find at least one orbital that has a large kinetic energy.

qmc-robot commented 7 years ago

Comment by: ye-luo

The bug has been fixed by improving the orbital phase rotation algorithm. QMCPACK real code is no more picky on h5. Image: Orbital_scan.png|Image: Orbital_scan.png

qmc-robot commented 7 years ago

Comment by: ye-luo

@jtkrogel could you confirm the new fix on last friday solves the bug? I would like to close the ticket asap.

qmc-robot commented 7 years ago

Comment by: jtkrogel

@ye-luo I can confirm that spline bug 2 is now resolved; for our test case, we get identical results as complex. I am running long trace runs for spline bug 1 to see whether it is also resolved.

qmc-robot commented 7 years ago

Comment by: ye-luo

Fantastic. In principle, I would like to urge everyone using the real code + spline to adopt this fix. it is critical.

qmc-robot commented 7 years ago

Comment by: jtkrogel

I absolutely agree. Has anyone run long versions of the ctest runs to see if there are changes vs the reference values?

qmc-robot commented 7 years ago

Comment by: ye-luo

In principle, the energy should not change because any rotation is valid. But the variance may reduce a bit if the old rotation scheme doesn't like the h5. For the tests, I noticed the diamond files are generated with old pwscf. When I scan the orbitals, there is some strange behaviour. If I reran the DFT, it becomes normal. However, no change in energy / flux estimator changes.

qmc-robot commented 7 years ago

Comment by: prckent

Long tests: no. These need to be run. We have not been running them on oxygen recently due to clashes with the nightlies (something is taking too long, needs to be investigated).

QMCPACK / qmcpack

Spline bug the second #49