Closed qmc-robot closed 7 years ago
Comment by: prckent
Wow! @
ye-luo-luo @
ye-luo-luo @
ye-luo-luo @
ye-luo-luo
This is an important breakthrough in the spline bug arena. This is a VMC RUN (even though the files are labeled DMC) and the energies are crazy in the real code on a point by point basis, i.e. it is not a small part of phase space that is wrong. This suggests a problem with pointers, indexing, conversions etc. It is interesting that the code does not crash and the complex version appears to get a reasonable (correct?) result.
My suggestion is that Ye @
ye-luo-luo takes a look at this as a priority unless he is "full". A first challenge is to reproduce the problem on another system. It is certainly a very real bug on a Cray Intel system (eos) so is likely to be general.
This is our scariest and most important known bug. Hopefully spline bug the second is the same one as spline bug the first.
(Comment written after speaking to Jaron by phone)
Comment by: prckent
This bug is particularly scary, just like the first spline bug, because we can not rule out the possibility that it is slightly biasing the results of production runs that otherwise appear normal.
Comment by: ye-luo
No problems shows on BG/Q. I did the DFT (QE 5.3.0) and VMC(QMCPAC rev 7337) both on BG/Q qmca -q ev *.scalar.dat dmcJ0_comp_twist0 series 0 -2554.870550 +/- 0.007271 404.970022 +/- 1.933910 0.1585 dmcJ0_real_twist0 series 0 -2554.833197 +/- 0.010036 407.298302 +/- 1.146522 0.1594
transfer the h5 to EOS and run QMCPACK rev7344 on EOS qmca -q ev -e 5 *.scalar.dat dmcJ0_comp_twist0 series 0 -2554.834115 +/- 0.079253 404.658918 +/- 5.366533 0.1584 dmcJ0_real_twist0 series 0 -2555.016185 +/- 0.084016 400.174285 +/- 3.251338 0.1566
@
jtkrogel Please
1, transfer the h5 file to Mira and I will run QMCPACK.
2, copy my h5 (/gpfs/mira-fs1/projects/QMCSim/yeluo/spline_bug/Jaron/rerun/pwscf_output/pwscf.pwscf.h5) to your machine and run vmc with your qmcpack build.
So we can further investigate the issue.
Comment by: jtkrogel
I was able to track down the original orbital file and I am now transferring it to Mira (/gpfs/mira-fs1/projects/QMCSim/jtkrogel/transfer/for_ye/01_spline_bug2/pwscf.pwscf.h5). I reran with this file and the large variance behavior is present; Ye's H5 file resulted in a normal variance with rev7044 on EOS.
The file itself is irregularly large, which may relate to the problems seen (it is ~106 GB (!!) compared to 3.5 GB from Ye). This is directly due to the presence of psi_r data in the file. There may be a bug in QMCPACK's handling of the file.
It is unclear (a) why the file is so large, (b) why QMCPACK failed w/ the real code at this and one other volume, but was (apparently) fine at other volumes w/ real or complex code even though large H5 files were produced at each volume.
I will rebuild QE on EOS to see if normal size files and normal behavior result. I will also check the "spline bug 1" orbital file for size irregularities.
Comment by: jtkrogel
The "spline bug 1" files do not present size irregularities as compared w/ Ye's file. They were also generated with QE 5.1, but on another machine (OIC5).
Comment by: prckent
It is definitely important to exchange the exact file that is known to cause problems on at least one machine.
Perhaps the FFTs are incorrect when the plane wave orbitals are transformed to the real space mesh inside QMCPACK, or some of the surrounding code is bad? This could be FFT and machine ( BG vs Intel ) dependent.
Comment by: ye-luo
@
jtkrogel Is the transfer still going? I don't see anything but a single txt file.
Try this on your sick file.
h5ls -r pwscf.pwscf.h5/electrons > size.out grep psi_g size.out | awk '{print $3,$4}' | uniq My file yields {56799, 2} I was wondering if one k point associated with the gamma point WF was corrupted or at least with a wrong size.
Comment by: jtkrogel
@
ye-luo The files transfer is still going, estimated remaining time is 1 hour (transfer 60% complete at 10MB/s).
The 106GB file yields the same: eos>h5ls -r pwscf.pwscf.h5/electrons | grep psi_g | awk '{print $3,$4}' | uniq {56799, 2}
I found that the file contains psi_r data on a 120x120x120 grid. This accounts for the size difference between your file (3.5GB) and mine (106GB): 120.*3/567993.5 = 106.4
Am I correct that QMCPACK is reading this psi_r data? If so, this clearly has a bearing on the source and/or location of the bug.
Comment by: jtkrogel
@
ye-luo The file transfer is now complete to Mira (/gpfs/mira-fs1/projects/QMCSim/jtkrogel/transfer/for_ye/01_spline_bug2/pwscf.pwscf.h5).
Comment by: ye-luo
Since you also got {56799,2}, it means we have the same number of g-vectors in the reciprocal space in DFT. Our files agree on the psi_g size per k points and band . Your estimate of the file size is correct, the psi_r dominates.
I remember now, when I did the conversion, I commented "write_psir = .true." to avoid the real space WF in h5. This makes the huge difference in file size. QMCPACK should not use the psi_r. It loads the psi_g and does the FFT internally.
Comment by: jtkrogel
There is one psi_r for each psi_g in the file. Since psi_g dominates the file size, the ratio of psi_r size (120**3) to psi_g size (56799) gives the approximate file size ratio also.
Comment by: ye-luo
@
jtkrogel QMCPACK may use the psi_r in the past but not now, the current code doesn't care about psi_r.
I tried the h5 from Jaron, the issue was reproduced. rerun-h5-Jaron/dmcJ0_comp_twist0 series 0 -2554.833358 +/- 0.015526 404.594802 +/- 1.375452 0.1584 rerun-h5-Jaron/dmcJ0_real_twist0 series 0 -2178.009295 +/- 1.189484 1375076.599433 +/- 21024.229610 631.3456
I reran the DFT with the exact input files (using collect, and write_psir=.true.), I got a h5 of size 83GB. The size can be different from machine to machine due to different libhdf5. rerun-dft-collect/dmcJ0_comp_twist0 series 0 -2554.798981 +/- 0.041567 399.644731 +/- 0.489775 0.1564 rerun-dft-collect/dmcJ0_real_twist0 series 0 -2554.839310 +/- 0.024979 406.804079 +/- 2.831341 0.1592
I also tried a conversion with write_psir=.false. rerun-dft-collect-nopsi_r/dmcJ0_comp_twist0 series 0 -2554.860441 +/- 0.017068 406.506065 +/- 2.525934 0.1591 rerun-dft-collect-nopsi_r/dmcJ0_real_twist0 series 0 -2554.848793 +/- 0.023298 405.541774 +/- 1.655292 0.1587
Another case, neither collect nor write_psir, rerun-dft-no-collect/dmcJ0_comp_twist0 series 0 -2554.824338 +/- 0.013409 405.586031 +/- 1.107173 0.1588 rerun-dft-no-collect/dmcJ0_real_twist0 series 0 -2554.844554 +/- 0.015553 404.325833 +/- 1.217949 0.1583
So it seems that a corrupt h5 causes the crazy behaviour.
Comment by: jtkrogel
As far as I can tell, the H5 file is valid; it just contains what we currently consider to be irrelevant information. It seems clear that QMCPACK is mishandling the file (same file but complex works and real doesn't).
The bigger question I have is whether this mishandling is generic of large file sizes, i.e. will we run into this routinely in the future in say 256-512 atom defect cells of NiO? I think we should still track down and patch the source of this mishandling.
Comment by: prckent
I think we need to narrow down the difference further (or at least I need to understand it better).
e.g. Is the file being read incorrectly or somehow processed incorrectly internally after reading?
Could we be near an integer limit somewhere, hence the apparent "large file" dependency?
Have we reproduced this problem on enough different systems that we can claim it is not (say) a problem with a particular HDF5 version and installation?
Comment by: prckent
@
jtkrogel Are the kinetic energies of the orbitals in the two files identical?
Comment by: jtkrogel
We've also regenerated the orbital file w/ psir=.false. on EOS using QE 5.1. The file size is 3.4GB as expected.
The large change in energy and variance is still present for this file (will refer to this file as "vol0.98_eos_small"): dmcJ0_real_twist0 series 0 -2177.251376 +/- 3.313361 1383981.935105 +/- 56682.731771 635.6556
Rerunning Ye's small file ("vol0.98_mira_small") on EOS does not present the problem, consistent w/ Ye's runs: dmcJ0_real_twist0 series 0 -2554.890268 +/- 0.081604 399.433829 +/- 4.111847 0.1563
I've calculated the per orbital kinetic energy by directly summing the coefficients (and k^2) with a Python tool of mine. The largest KE difference across all orbitals in the two files above is 0.5 mHa. Typical KE's per orbital range from 0.5-8.0 Ha.
Conclusions: problem can persist w/o psir data (i.e. in small files). The two small files that do/do not trigger the bug are nearly identical in orbital contents (as manifested by matching oribtal KE), and so the problem is most likely limited to QMCPACK's usage of the files, not the files themselves (i.e. probably not the converter, unless something else in the files differs materially).
Comment by: prckent
Does the current converter & QE 5.3 give the same file? Ye made a lot of changes. Do you need any help with an eos build?
Comment by: jtkrogel
So far I'm of the opinion that the evidence points away from a bug in the QE SCF+conversion. We have cases w/ the bug appearing using orbital files from QE5.1 and QE5.3. Tests outside of QMCPACK on files w/ and w/o apparent bug show the orbitals are the same. Bug only shows up going from complex to real. To me everything is pointing pretty clearly to a bug inside QMCPACK and not with the orbital files. Thoughts?
I'm happy to explore QE5.3 on EOS. It would be good to know for sure that the converter is behaving consistently across versions/builds. I think we should start digging into QMCPACK itself concurrent with this.
I also plan to rerun the KE checks across all of Ye's QE 5.3 files from spline bug 1 that showed huge sensitivity in the VMC variance when making small changes in the DFT convergence parameters. If the KE's are not sensitive, then we will know the problem is in QMCPACK and it will give us an easier way to track down the bug (force VMC variance to match across files) than doing full walker traces, etc.
Comment by: prckent
Q1. How do we explain Ye's "rerun" results https://app.assembla.com/spaces/qmcdev/tickets/49/details?comment=1113850913
Q2. Have we showed a difference between complex and real versions for a single electron? i.e. cut down one of the "bad" runs to have only 1 or 2 electrons?
Comment by: prckent
No real need to run 5.3 on eos. 5.1 is plenty old though.
Comment by: ye-luo
@
jtkrogel "We have cases w/ the bug appearing using orbital files from QE5.1 and QE5.3." Do we have one from QE5.3?
I will try to compare the real/complex QMCPACK in the following aspects: spline coefficients, phase, evaluation. If possible, run directly with planewave WF. Not possible today for the maintenance but probably tomorrow.
Comment by: prckent
Planewave is super slow. I advise trying to cut down before running.
Comment by: jtkrogel
Paul/Ye: you are correct, we do not yet have a QE5.3 run that produces a buggy result for this ticket (I misread/misremembered Ye's rerun result). We do know that odd results can happen using the 5.3 toolchain (in context of similarly huge variances obtained intermittently for spline bug 1 https://app.assembla.com/spaces/qmcdev/tickets/40/details?comment=1113785513).
Might this be the time to put orbital KE sums (PW sum from H5 and riemann sum or analytic for splined orbs) into QMCPACK for orbital quality checks? Fringe benefit would be no more need to do meshfactor scans at the VMC level.
We are starting some low electron count runs here.
Comment by: prckent
Let us get to the root cause of the spline bugs before adding general tests. Anything looking at orbital quality should be coordinated with the APW projection conversion which needs something similar.
5.3 runs needed then. Will remove one parameter from our comparison matrix.
Comment by: jtkrogel
Agreed on 1 and 2. QE 5.3 runs will be useful. I'm suspicious of the intermittency issues shown in the other bug, so I may vary the convergence parameters and look for a similar pattern. If the resulting orbital KE's come out similar across the board then I think we can safely conclude that any variability seen at the VMC level resides solely in QMCPACK.
Comment by: jtkrogel
The problematic "small" 3.4 GB file ("vol0.98_eos_small") is now available on Mira: /gpfs/mira-fs1/projects/QMCSim/jtkrogel/transfer/for_ye/01_spline_bug2/vol0.98_eos_small/pwscf.pwscf.h5
Comment by: jtkrogel
Some brief results for a reduced number of electrons (occupying from 2 to 100 up and down orbitals, full occupation is nup=208, ndown=200).
Results with real code:
nud LocalEnergy Variance ratio
2 -1711.586138 +/- 0.022133 25.163081 +/- 0.201123 0.0147
4 -1757.027013 +/- 0.035981 52.466286 +/- 0.386141 0.0299
5 -1764.564283 +/- 3.576613 31020.140442 +/- 6531.582336 17.5795
10 -1500.868653 +/- 19.722575 1505729.495207 +/- 111609.817630 1003.2387
25 -2021.780528 +/- 4.744664 255578.583315 +/- 21096.982477 126.4126
50 -2375.185522 +/- 0.573504 12489.309935 +/- 2157.800910 5.2582
100 -2511.362514 +/- 0.064731 375.992336 +/- 10.289014 0.1497
Combined w/ the complex results below, it looks as though problem orbitals might start around nup/down==5. Still, the variance pattern is strange, with nup/down=100 appearing almost normal. This suggests more intermittency to me rather than specific problem orbitals.
Results with complex code:
nud LocalEnergy Variance ratio
2 -1711.568537 +/- 0.000000 26.088409 +/- 0.000000 0.0152
4 -1757.114780 +/- 0.024181 49.651033 +/- 0.349108 0.0283
5 -1779.701254 +/- 0.042572 63.252476 +/- 1.437347 0.0355
10 -1867.579168 +/- 4.446231 2696.267913 +/- 2406.244516 1.4437 <=== large variance
25 -2095.414049 +/- 0.081082 203.158111 +/- 1.780770 0.0970
50 -2378.986474 +/- 0.290910 249.256439 +/- 3.030250 0.1048
100 -2511.442451 +/- 0.115850 315.629761 +/- 1.803683 0.1257
The results at nup/down==10 show the first signs that there may also be a problem with the complex code. We are currently rerunning the nup/down=10 complex case w/ variations in the DFT convergence parameters to see if this behavior remains.
Comment by: prckent
This hints at an MPI bug in our code, or memory usage problem (bad pointer usage, incorrect free/alloc etc.). One possible strategy would be to compute a checksum of the spline buffers on each MPI task.
Comment by: jtkrogel
We have rerun this in serial, and the problem persists. Also, running with traces on confirms that every walker on every node has a large kinetic energy.
The large variance seen above with complex for nud==10 is due to the runs being too short (equilibration issues w/ partial occupation, similar to isolated molecules in a box).
With equilibration properly accounted for, the real code demonstrates no issue w/ occupation up to nud=100. We are currently performing a bisection search on nud between 100 and 200 to find at least one orbital that has a large kinetic energy.
Comment by: ye-luo
The bug has been fixed by improving the orbital phase rotation algorithm. QMCPACK real code is no more picky on h5. Image: Orbital_scan.png|Image: Orbital_scan.png
Comment by: ye-luo
@
jtkrogel could you confirm the new fix on last friday solves the bug? I would like to close the ticket asap.
Comment by: jtkrogel
@
ye-luo I can confirm that spline bug 2 is now resolved; for our test case, we get identical results as complex. I am running long trace runs for spline bug 1 to see whether it is also resolved.
Comment by: ye-luo
Fantastic. In principle, I would like to urge everyone using the real code + spline to adopt this fix. it is critical.
Comment by: jtkrogel
I absolutely agree. Has anyone run long versions of the ctest runs to see if there are changes vs the reference values?
Comment by: ye-luo
In principle, the energy should not change because any rotation is valid. But the variance may reduce a bit if the old rotation scheme doesn't like the h5. For the tests, I noticed the diamond files are generated with old pwscf. When I scan the orbitals, there is some strange behaviour. If I reran the DFT, it becomes normal. However, no change in energy / flux estimator changes.
Comment by: prckent
Long tests: no. These need to be run. We have not been running them on oxygen recently due to clashes with the nightlies (something is taking too long, needs to be investigated).
Reported by: jtkrogel
Note: the crystal structure associated with this ticket is to remain private among the developers
Problem
Observations
Problem is apparent w/ Jastrow (twist0 is gamma)
And also w/o Jastrow
The issue is localized to the kinetic energy
In contrast to the first "spline bug", the energy and variance are significantly larger at each and every step
Other details
Files
File: bug_package.tgz|File: bug_package.tgz