QMCPACK / qmcpack

Main repository for QMCPACK, an open-source production level many-body ab initio Quantum Monte Carlo code for computing the electronic structure of atoms, molecules, and solids with full performance portable GPU support
http://www.qmcpack.org
Other
298 stars 139 forks source link

Test failure with mixed precision on KNL #46

Closed qmc-robot closed 7 years ago

qmc-robot commented 7 years ago

Reported by: naromero77

A number of tests are failing with mixed precision complex. Most of the test seem to be an energy exceeding three sigma.

This is occurring hyperion build script, configure and build log File: build_hyperion.sh File: configure-hyperion.log File: build-hyperion.log long tests summary File: tests_long.log long tests log file File: LastTest.log short tests log file File: all_tests.log unit tests log file File: unit_test.log on KNL with the Intel 17 Update 1 compiler.

I am attaching my build script and several regression test log file.

Anouar has tested the double precision version and it does not seem to have this issue. The solution to getting that "correct" answer in mixed precision is to run longer --- is this really the path forward?

qmc-robot commented 7 years ago

Comment by: prckent

Answer: No.

If the double precision version is reliable with the current run length, then by definition the single version should be - if and only if single precision is a high enough precision and there are no bugs.

Which revision of the code are you running? There are not many KNLs around yet and until the "Intel patches" are input we can't test them.

qmc-robot commented 7 years ago

Comment by: naromero77

I am testing the one at the Argonne gitlab, but I am told this is identical to the one that was pushed to the Assembla trunk. It does not have any KNL specific options (e.g. Jeongnim tiling is only used in the mini-app which is a separate thing). Anouar or Ye can confirm.

Has anyone else run the entire regression test suite on the mixed precision code (on another architecture)?

qmc-robot commented 7 years ago

Comment by: ye-luo

I have built the assembla trunk code with Intel compilker 17 update 1. No more compiler crashing. @naromero77 could you please paste the test summary for each log file you uploaded? files are very tough to read. mixed precision is carefully tested only in a few cases, see ticket #44

qmc-robot commented 7 years ago

Comment by: ye-luo

@naromero77 I have not tried the long tests. From what you have shown, most calculations are not complete. The -samples tests failed because of incomplete run, namely not sufficient statistics. So the failure on Total energy is possible. For the short run ctest -R short, short-diamondC_1x1x1_pp-dmc_sdj-1-16-totenergy is the only failing one in my runs.

qmc-robot commented 7 years ago

Comment by: naromero77

@ye-luo

Log files for test 1 - 100 (sorry, I accidentally killed it before it produced a Summary) [naromero @ye-luo build_KNL_SP_read_debug]$ grep failed all_tests.log 4: I/O warning : failed to load external entity "bad.xml" 6: test cases:  5 |  4 passed | 1 failed 6: /home/naromero/qmcpack-cels-git/src/Particle/tests/test_distance_table.cpp:4\ 69: FAILED: 6:   REQUIRE( expect[idx] == dist ) 6: with expansion: 6:   0.0 == 0.0000001788 6: assertions: 38 | 37 passed | 1 failed 12: Gold file comparison failed 14: Gold file comparison failed 15: Gold file comparison failed 17: Gold file comparison failed 18: Gold file comparison failed 19: Gold file comparison failed

These are all converter4qmc issues. Is this also a timeout?

Log files for test 101 - 121: Total Test time (real) = 16208.08 sec

The following tests FAILED:     101 - long-diamondC_1x1x1_pp-vmc_sdj-meshf-1-16 (Timeout)     103 - long-diamondC_1x1x1_pp-vmc_sdj-meshf-1-16-samples (Failed)     106 - long-diamondC_1x1x1_pp-dmc_sdj-1-16-totenergy (Failed)     107 - long-diamondC_2x1x1_pp-vmc_sdj-1-16 (Timeout)     109 - long-diamondC_2x1x1_pp-vmc_sdj-1-16-samples (Failed)     112 - long-diamondC_2x1x1_pp-dmc_sdj-1-16-totenergy (Failed)     113 - long-hcpBe_1x1x1_pp-vmc_sdj-1-16 (Timeout)     115 - long-hcpBe_1x1x1_pp-vmc_sdj-1-16-samples (Failed)     117 - long-monoO_1x1x1_pp-vmc_sdj-1-16 (Timeout)     119 - long-monoO_1x1x1_pp-vmc_sdj-1-16-samples (Failed)

qmc-robot commented 7 years ago

Comment by: ye-luo

@naromero77 I found the summary file for the long tests all_tests_2.log, but not the test log. I found the log file for short tests all_tests.log, but not the summary file. Could you upload the missing files for investigation? Ye

qmc-robot commented 7 years ago

Comment by: naromero77

@ye-luo

I already uploaded the log files in the original issue, which additional files do you need?

qmc-robot commented 7 years ago

Comment by: ye-luo

@naromero77 I have changed the original thread by adding description to the files. Summary file all_tests_2.log and test log file all_tests.log doesn't match. all_tests_2.log has only long tests. all_tests.log has most tests but not all the long ones listed in all_tests_2.log I need a complete test log of all the long test you ran.

qmc-robot commented 7 years ago

Comment by: naromero77

@ye-luo

The first time that I ran the tests they got interrupted, hence it is broken into two files.

If you prefer to see one proper file, I will have to start a new ctest run. It would not be ready until tomorrow morning.

Let me know if this is what you need.

qmc-robot commented 7 years ago

Comment by: ye-luo

Got it. I don't need to see the all the tests, I can run the long tests now.

qmc-robot commented 7 years ago

Comment by: ye-luo

File: tests_long.log

qmc-robot commented 7 years ago

Comment by: ye-luo

File: LastTest.log

qmc-robot commented 7 years ago

Comment by: ye-luo

@naromero77 In my short test runs "ctest -R short", 70/89 Test #89: short-diamondC_1x1x1_pp-dmc_sdj-1-16-totenergy ..................Failed 0.24 sec is the only failure. In my long test runs "ctest -R long" (log files are uploaded to the original message), 5/29 Test #113: long-bccH_1x1x1_ae-dmc_sdj-1-16-totenergy .............Failed 0.28 sec 15/29 Test #123: long-diamondC_1x1x1_pp-dmc_sdj-1-16-totenergy .........***Failed 0.12 sec are the only two failures. Seems no surprise to me. diamondC_1x1x1 use some code pathes I didn't check as well as the all-electron case.

qmc-robot commented 7 years ago

Comment by: naromero77

@ye-luo

These are the failures I get for QMC_COMPLEX=1 and MIXED_PRECISION=1 on my Xeon. High overlap with the failures on KNL, though not completely.

The following tests FAILED: 9 - unit_test_hamiltonian (Failed) 12 - converter_test_He_sto3g (Failed) 14 - converter_test_Be_ccd (Failed) 15 - converter_test_O_ext (Failed) 17 - converter_test_HCNp (Failed) 18 - converter_test_aldet1 (Failed) 19 - converter_test_aldet5 (Failed) 59 - short-diamondC_1x1x1_pp-vmc_sdj-1-16-nonlocalecp (Failed) 60 - short-diamondC_1x1x1_pp-vmc_sdj-1-16-flux (Failed) 72 - short-diamondC_1x1x1_pp-dmc_sdj-1-16-totenergy (Failed) 84 - short-diamondC_2x1x1_pp-dmc_sdj-1-16-totenergy (Failed) 92 - long-bccH_1x1x1_ae-vmc_sdj-1-16 (Timeout) 94 - long-bccH_1x1x1_ae-vmc_sdj-1-16-samples (Failed) 96 - long-bccH_1x1x1_ae-dmc_sdj-1-16-totenergy (Failed) 101 - long-diamondC_1x1x1_pp-vmc_sdj-meshf-1-16 (Timeout) 103 - long-diamondC_1x1x1_pp-vmc_sdj-meshf-1-16-samples (Failed) 106 - long-diamondC_1x1x1_pp-dmc_sdj-1-16-totenergy (Failed) 112 - long-diamondC_2x1x1_pp-dmc_sdj-1-16-totenergy (Failed) Errors while running CTest

I want to make sure I understand, so let me summarize:

qmc-robot commented 7 years ago

Comment by: ye-luo

@naromero77 From your nightly tests on the KNL box, the gnu+complex/real+mixed/full precision passed all the short/converter/unit tests. So the unit tests and converter tests need to be investigated with different compilers. The long test results, see my previous thread in which I listed the failed ones.

qmc-robot commented 7 years ago

Comment by: naromero77

This is to summarize a conversation that I had with Ye in my office yesterday.

The converter tests failing have to do with the local orbital code and is independent of the compiler. Its a single-precision issue and it might be that the tolerances are too tight.

I will try to rename the issue, or close it and open a new one.

@ye-luo

You said that there was still a test failing due to the Intel 17 compiler. Which one is it and do you have an issue currently open for that one?

qmc-robot commented 7 years ago

Comment by: ye-luo

@naromero77 The converter tests failure has nothing to do with the local orbital code. The gamess to qmpcack converter has single precision issue. The the cdash log last night shows the issue.

example_H2O-1-1 is the trouble maker. No ticket has been opened but Paul warned us earlier.

qmc-robot commented 7 years ago

Comment by: naromero77

@ye-luo

OK, that's what I meant. A part of the local orbital workflow.

qmc-robot commented 7 years ago

Comment by: markdewing

@ye-luo How much work would it take to build the converters in full precision, even in the mixed precision builds? Do we need the performance increase of single precision in the converters?

qmc-robot commented 7 years ago

Comment by: prckent

Are the converters even sensitive to mixed precision? Why should they be? Converters should be full precision all the time.

qmc-robot commented 7 years ago

Comment by: ye-luo

@prckent @prckent Converters should not be sensitive to mixed precision. But some of the classes are using RealType from the main code and thus gets polluted. I guess the fixed should be very simple after I trap the bug.

qmc-robot commented 7 years ago

Comment by: ye-luo

The converter issue has been fixed. The discrepancy happens in the commented Slater orbital (I don't know who actually use it) written by the converter. Some computation was involved to generate the Slater orbitals from contracted Gaussian orbitals. So the issue falls back to the unchecked LCAO calculation with mixed precision code (still on the TODO).

qmc-robot commented 7 years ago

Comment by: ye-luo

The short-diamondC_1x1x1_pp-dmc_sdj-1-16 failure has been fixed.

qmc-robot commented 7 years ago

Comment by: naromero77

@ye-luo I think this issue can be closed?

qmc-robot commented 7 years ago

Comment by: ye-luo

The urgent bugs are fixed but there are still failing tests which I would like to solve later. Keep the ticket open or close are both fine for me.

qmc-robot commented 7 years ago

Comment by: naromero77

@ye-luo Is this the H2O or He tests that I see failing in cdash?

qmc-robot commented 7 years ago

Comment by: ye-luo

Yes. They still have issue.

prckent commented 7 years ago

Is this now fixed? Mixed precision runs on KNL appear OK in cdash, but they are only the short runs.

ye-luo commented 7 years ago

The issues listed in this ticket all have been fixed. The only remaining issue with mixed precision is in #138

prckent commented 7 years ago

Closes #46