Closed qmc-robot closed 7 years ago
Comment by: prckent
Answer: No.
If the double precision version is reliable with the current run length, then by definition the single version should be - if and only if single precision is a high enough precision and there are no bugs.
Which revision of the code are you running? There are not many KNLs around yet and until the "Intel patches" are input we can't test them.
Comment by: naromero77
I am testing the one at the Argonne gitlab, but I am told this is identical to the one that was pushed to the Assembla trunk. It does not have any KNL specific options (e.g. Jeongnim tiling is only used in the mini-app which is a separate thing). Anouar or Ye can confirm.
Has anyone else run the entire regression test suite on the mixed precision code (on another architecture)?
Comment by: ye-luo
I have built the assembla trunk code with Intel compilker 17 update 1. No more compiler crashing.
@
naromero77 could you please paste the test summary for each log file you uploaded? files are very tough to read.
mixed precision is carefully tested only in a few cases, see ticket #44
Comment by: ye-luo
@
naromero77
I have not tried the long tests. From what you have shown, most calculations are not complete. The -samples tests failed because of incomplete run, namely not sufficient statistics. So the failure on Total energy is possible.
For the short run ctest -R short, short-diamondC_1x1x1_pp-dmc_sdj-1-16-totenergy is the only failing one in my runs.
Comment by: naromero77
@
ye-luo
Log files for test 1 - 100 (sorry, I accidentally killed it before it produced a Summary)
[naromero @
ye-luo build_KNL_SP_read_debug]$ grep failed all_tests.log
4: I/O warning : failed to load external entity "bad.xml"
6: test cases: 5 | 4 passed | 1 failed
6: /home/naromero/qmcpack-cels-git/src/Particle/tests/test_distance_table.cpp:4\
69: FAILED:
6: REQUIRE( expect[idx] == dist )
6: with expansion:
6: 0.0 == 0.0000001788
6: assertions: 38 | 37 passed | 1 failed
12: Gold file comparison failed
14: Gold file comparison failed
15: Gold file comparison failed
17: Gold file comparison failed
18: Gold file comparison failed
19: Gold file comparison failed
These are all converter4qmc issues. Is this also a timeout?
Log files for test 101 - 121: Total Test time (real) = 16208.08 sec
The following tests FAILED: 101 - long-diamondC_1x1x1_pp-vmc_sdj-meshf-1-16 (Timeout) 103 - long-diamondC_1x1x1_pp-vmc_sdj-meshf-1-16-samples (Failed) 106 - long-diamondC_1x1x1_pp-dmc_sdj-1-16-totenergy (Failed) 107 - long-diamondC_2x1x1_pp-vmc_sdj-1-16 (Timeout) 109 - long-diamondC_2x1x1_pp-vmc_sdj-1-16-samples (Failed) 112 - long-diamondC_2x1x1_pp-dmc_sdj-1-16-totenergy (Failed) 113 - long-hcpBe_1x1x1_pp-vmc_sdj-1-16 (Timeout) 115 - long-hcpBe_1x1x1_pp-vmc_sdj-1-16-samples (Failed) 117 - long-monoO_1x1x1_pp-vmc_sdj-1-16 (Timeout) 119 - long-monoO_1x1x1_pp-vmc_sdj-1-16-samples (Failed)
Comment by: ye-luo
@
naromero77
I found the summary file for the long tests all_tests_2.log, but not the test log.
I found the log file for short tests all_tests.log, but not the summary file.
Could you upload the missing files for investigation?
Ye
Comment by: naromero77
@
ye-luo
I already uploaded the log files in the original issue, which additional files do you need?
Comment by: ye-luo
@
naromero77
I have changed the original thread by adding description to the files.
Summary file all_tests_2.log and test log file all_tests.log doesn't match.
all_tests_2.log has only long tests.
all_tests.log has most tests but not all the long ones listed in all_tests_2.log
I need a complete test log of all the long test you ran.
Comment by: naromero77
@
ye-luo
The first time that I ran the tests they got interrupted, hence it is broken into two files.
If you prefer to see one proper file, I will have to start a new ctest run. It would not be ready until tomorrow morning.
Let me know if this is what you need.
Comment by: ye-luo
Got it. I don't need to see the all the tests, I can run the long tests now.
Comment by: ye-luo
File: tests_long.log
Comment by: ye-luo
File: LastTest.log
Comment by: ye-luo
@
naromero77
In my short test runs "ctest -R short",
70/89 Test #
89: short-diamondC_1x1x1_pp-dmc_sdj-1-16-totenergy ..................Failed 0.24 sec
is the only failure.
In my long test runs "ctest -R long" (log files are uploaded to the original message),
5/29 Test #
113: long-bccH_1x1x1_ae-dmc_sdj-1-16-totenergy .............Failed 0.28 sec
15/29 Test #
123: long-diamondC_1x1x1_pp-dmc_sdj-1-16-totenergy .........***Failed 0.12 sec
are the only two failures.
Seems no surprise to me. diamondC_1x1x1 use some code pathes I didn't check as well as the all-electron case.
Comment by: naromero77
@
ye-luo
These are the failures I get for QMC_COMPLEX=1 and MIXED_PRECISION=1 on my Xeon. High overlap with the failures on KNL, though not completely.
The following tests FAILED: 9 - unit_test_hamiltonian (Failed) 12 - converter_test_He_sto3g (Failed) 14 - converter_test_Be_ccd (Failed) 15 - converter_test_O_ext (Failed) 17 - converter_test_HCNp (Failed) 18 - converter_test_aldet1 (Failed) 19 - converter_test_aldet5 (Failed) 59 - short-diamondC_1x1x1_pp-vmc_sdj-1-16-nonlocalecp (Failed) 60 - short-diamondC_1x1x1_pp-vmc_sdj-1-16-flux (Failed) 72 - short-diamondC_1x1x1_pp-dmc_sdj-1-16-totenergy (Failed) 84 - short-diamondC_2x1x1_pp-dmc_sdj-1-16-totenergy (Failed) 92 - long-bccH_1x1x1_ae-vmc_sdj-1-16 (Timeout) 94 - long-bccH_1x1x1_ae-vmc_sdj-1-16-samples (Failed) 96 - long-bccH_1x1x1_ae-dmc_sdj-1-16-totenergy (Failed) 101 - long-diamondC_1x1x1_pp-vmc_sdj-meshf-1-16 (Timeout) 103 - long-diamondC_1x1x1_pp-vmc_sdj-meshf-1-16-samples (Failed) 106 - long-diamondC_1x1x1_pp-dmc_sdj-1-16-totenergy (Failed) 112 - long-diamondC_2x1x1_pp-dmc_sdj-1-16-totenergy (Failed) Errors while running CTest
I want to make sure I understand, so let me summarize:
#
6 and #
9 are considered sensitive, i.e. no direct connection to mixed precision.Comment by: ye-luo
@
naromero77
From your nightly tests on the KNL box, the gnu+complex/real+mixed/full precision passed all the short/converter/unit tests.
So the unit tests and converter tests need to be investigated with different compilers.
The long test results, see my previous thread in which I listed the failed ones.
Comment by: naromero77
This is to summarize a conversation that I had with Ye in my office yesterday.
The converter tests failing have to do with the local orbital code and is independent of the compiler. Its a single-precision issue and it might be that the tolerances are too tight.
I will try to rename the issue, or close it and open a new one.
@
ye-luo
You said that there was still a test failing due to the Intel 17 compiler. Which one is it and do you have an issue currently open for that one?
Comment by: ye-luo
@
naromero77
The converter tests failure has nothing to do with the local orbital code. The gamess to qmpcack converter has single precision issue.
The the cdash log last night shows the issue.
example_H2O-1-1 is the trouble maker. No ticket has been opened but Paul warned us earlier.
Comment by: naromero77
@
ye-luo
OK, that's what I meant. A part of the local orbital workflow.
Comment by: markdewing
@
ye-luo
How much work would it take to build the converters in full precision, even in the mixed precision builds?
Do we need the performance increase of single precision in the converters?
Comment by: prckent
Are the converters even sensitive to mixed precision? Why should they be? Converters should be full precision all the time.
Comment by: ye-luo
@
prckent @
prckent
Converters should not be sensitive to mixed precision. But some of the classes are using RealType from the main code and thus gets polluted.
I guess the fixed should be very simple after I trap the bug.
Comment by: ye-luo
The converter issue has been fixed. The discrepancy happens in the commented Slater orbital (I don't know who actually use it) written by the converter. Some computation was involved to generate the Slater orbitals from contracted Gaussian orbitals. So the issue falls back to the unchecked LCAO calculation with mixed precision code (still on the TODO).
Comment by: ye-luo
The short-diamondC_1x1x1_pp-dmc_sdj-1-16 failure has been fixed.
Comment by: naromero77
@
ye-luo
I think this issue can be closed?
Comment by: ye-luo
The urgent bugs are fixed but there are still failing tests which I would like to solve later. Keep the ticket open or close are both fine for me.
Comment by: naromero77
@
ye-luo
Is this the H2O or He tests that I see failing in cdash?
Comment by: ye-luo
Yes. They still have issue.
Is this now fixed? Mixed precision runs on KNL appear OK in cdash, but they are only the short runs.
The issues listed in this ticket all have been fixed. The only remaining issue with mixed precision is in #138
Closes #46
Reported by: naromero77
A number of tests are failing with mixed precision complex. Most of the test seem to be an energy exceeding three sigma.
This is occurring hyperion build script, configure and build log File: build_hyperion.sh File: configure-hyperion.log File: build-hyperion.log long tests summary File: tests_long.log long tests log file File: LastTest.log short tests log file File: all_tests.log unit tests log file File: unit_test.log on KNL with the Intel 17 Update 1 compiler.
I am attaching my build script and several regression test log file.
Anouar has tested the double precision version and it does not seem to have this issue. The solution to getting that "correct" answer in mixed precision is to run longer --- is this really the path forward?