ExpHP / rsp2

phonons in rust
Apache License 2.0
2 stars 1 forks source link

NaN in relaxation #64

Open colin-daniels opened 6 years ago

colin-daniels commented 6 years ago

Only sometimes (?) fails, input is an all-carbon structure with high symmetry and 960 atoms. Output files are here g10.tar.gz, commit ref is 6b6a7b884fec4171b2bbb824af0b627a1f2fca74. Standard output/err is as follows:

[   0.225s][INFO] Available resources for parallelism:
[   0.225s][INFO]     MPI: 4 process(es)
[   0.225s][INFO]  OpenMP: 1 thread(s) per process (OMP_NUM_THREADS)
[   0.225s][INFO]        : 4 thread(s) in single-process tasks (RSP2_MAX_THREADS)
[   0.225s][INFO]   rayon: 4 thread(s) on the root process
[   0.228s][WARN] 'lammps-update-style: fast' is experimental (this message will not be shown again)
[   0.229s][TRACE] bond graph: intermediate supercell: [1, 1, 1], r = 1.70017
[   0.229s][TRACE] bond graph: true supercell: centered_diagonal([1, 1, 1])
[   0.317s][TRACE] Writing 'g10/initial.structure'
[   0.365s][TRACE] ============================
[   0.365s][TRACE] Begin relaxation # 1
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: FloatIsNaN', libcore/result.rs:945:5
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::print
             at libstd/sys_common/backtrace.rs:71
             at libstd/sys_common/backtrace.rs:59
   2: std::panicking::default_hook::{{closure}}
             at libstd/panicking.rs:211
   3: std::panicking::default_hook
             at libstd/panicking.rs:227
   4: std::panicking::rust_panic_with_hook
             at libstd/panicking.rs:511
   5: std::panicking::continue_panic_fmt
             at libstd/panicking.rs:426
   6: rust_begin_unwind
             at libstd/panicking.rs:337
   7: core::panicking::panic_fmt
             at libcore/panicking.rs:92
   8: core::result::unwrap_failed
   9: <alloc::vec::Vec<T> as alloc::vec::SpecExtend<T, I>>::from_iter
  10: rsp2_minimize::acgsd::_acgsd
  11: rsp2_minimize::acgsd::acgsd
  12: rsp2_tasks::cmd::relaxation::do_relax
  13: rsp2_tasks::cmd::<impl rsp2_tasks::cmd::trial::TrialDir>::run_relax_with_eigenvectors
  14: <rsp2_lammps_wrap::low_level::mpi_helper::MpiOnDemand<D>>::install
  15: rsp2_tasks::entry_points::wrap_main_with_lammps_on_demand
  16: rsp2_tasks::entry_points::wrap_main
  17: rsp2_tasks::entry_points::_rsp2_acgsd
  18: rsp2_tasks::entry_points::rsp2
  19: std::rt::lang_start::{{closure}}
  20: std::panicking::try::do_call
             at libstd/rt.rs:59
             at libstd/panicking.rs:310
  21: __rust_maybe_catch_panic
             at libpanic_unwind/lib.rs:105
  22: std::rt::lang_start_internal
             at libstd/panicking.rs:289
             at libstd/panic.rs:392
             at libstd/rt.rs:58
  23: main
  24: __libc_start_main
  25: _start
[   0.489s][INFO] successfully leaked tempdir at /tmp/rsp2-.fKwiFZCYsDt7
colin-daniels commented 6 years ago

Actually fails harder if you let it run it seems g10-fail2.txt (exact same inputs as before).

Edit: This time it at least properly kills the other threads, for the first crash it doesn't (or I killed them before it did).

ExpHP commented 6 years ago

There is something very seriously not right here. Looking into it.

The initial value of the potential seems to vary wildly in from numbers like +4000 eV to as much as +16000 eV, even when I am running single process, single-threaded. This ought to be impossible; when run in serial, the code is supposed to be 100% deterministic up to the point where it first performs eigsh. (Correction: These observations were with OMP_NUM_THREADS=4)

ExpHP commented 6 years ago

Another weird output that might be a separate problem. Apparently my symmetry code can be inconsistent.

$ RUST_LOG=rsp2_minimize=trace cargo run --bin=rsp2 initial.structure -c settings.yaml -o out --force
    Finished dev [unoptimized + debuginfo] target(s) in 0.16s                                                                                                
     Running `/home/lampam/cpp/other/rust/rsp2/target/debug/rsp2 initial.structure -c settings.yaml -o out --force`
[   0.283s][INFO] Available resources for parallelism:
[   0.283s][INFO]     MPI: 1 process(es)
[   0.283s][INFO]  OpenMP: 4 thread(s) per process (OMP_NUM_THREADS)
[   0.283s][INFO]        : 1 thread(s) in single-process tasks (RSP2_MAX_THREADS)
[   0.283s][INFO]   rayon: 1 thread(s) on the root process
[   0.292s][WARN] 'lammps-update-style: fast' is experimental (this message will not be shown again)
[   0.351s][TRACE] bond graph: intermediate supercell: [1, 1, 1], r = 1.70017
[   0.351s][TRACE] bond graph: true supercell: centered_diagonal([1, 1, 1])
[  16.026s][TRACE] Writing 'out/initial.structure'
[  16.102s][TRACE] ============================
[  16.103s][TRACE] Begin relaxation # 1
[  16.443s][TRACE]  i:      0  v: 16807.78272218822531 dv:  +0.00e0  g:  5.4438451e2                            00000011111111111122 192
[  16.444s][DEBUG] Using steepest descent. (i: 1)
[  16.926s][TRACE]  i:      1  v: 16597.86604968855681 dv:  -2.10e2  g:  3.4076611e2                            00001111111111111112 96
[  17.295s][TRACE]  i:      2  v: 16368.45912211309769 dv:  -2.29e2  g:  3.7902783e2  cos: +0.53                01111111111111122222 672
[  17.737s][DEBUG] update_interval: Exit by strange guess (U0),  (5.304989328257774e0, 7.439512162874651e0) vs 4.455803318164412e0
[  17.973s][DEBUG] update_interval: Exit by strange guess (U0),  (5.410382863194543e0, 5.817743649194453e0) vs 4.839036044166654e0
[  18.097s][DEBUG] update_interval: Exit by strange guess (U0),  (5.410382863194543e0, 5.6064040856796415e0) vs 1.161193949085568e0
[  18.211s][DEBUG] update_interval: Exit by strange guess (U0),  (5.508393474437092e0, 5.5998271153573045e0) vs 1.2181174603992762e0
[  18.331s][DEBUG] update_interval: Exit by strange guess (U0),  (5.508393474437092e0, 5.551078957093247e0) vs 1.8723516841905392e0
[  18.379s][TRACE]  i:      3  v: 15385.62764117925690 dv:  -9.83e2  g:  1.2897596e3  cos: +0.79 +0.36          01111111111111222222 768
[  18.542s][DEBUG] update_interval: Exit by strange guess (U0),  (0e0, 7.902233502103718e-2) vs -5.014386492772605e0
[  18.666s][DEBUG] update_interval: Exit by strange guess (U0),  (0e0, 3.130444235201886e-2) vs -4.21061536510545e0
[  18.786s][DEBUG] update_interval: Exit by strange guess (U0),  (0e0, 1.241620158541894e-2) vs -3.7832530834179163e0
[  18.949s][TRACE]  i:      4  v: 15385.05297506718125 dv: -5.75e-1  g:  1.1538031e3  cos: +1.00 +0.79 +0.36    01111111111111222222 864
[  19.511s][TRACE]  i:      5  v: 14620.02690701393658 dv:  -7.65e2  g:  7.8813361e2  cos: -0.09 -0.09 +0.06    00111111111111111222 384
[  20.207s][DEBUG] update_interval: Exit by strange guess (U0),  (1.4459285369837874e1, 1.9738699100999074e1) vs 1.1491684230228522e1
[  20.326s][TRACE]  i:      6  v: 4715.66708952707540 dv:  -9.90e3  g:  2.1005543e3  cos: +0.55 +0.22 +0.22    00001111111111112222 480
[  21.020s][TRACE]  i:      7  v: 3303.16106153555302 dv:  -1.41e3  g:  3.4190172e3  cos: +1.00 +0.55 +0.22    00001111111111112222 576
[  21.690s][DEBUG] update_interval: Exit by strange guess (U0),  (1.3718229976016578e0, 1.6326237921249265e0) vs 1.6928890566512633e0
[  21.773s][TRACE]  i:      8  v: 3138.65584232136280 dv:  -1.65e2  g:  4.1522559e3  cos: +0.99 +0.99 +0.52    00000001111111112222 480
[  22.861s][TRACE]  i:      9  v: -1734.26338160659816 dv:  -4.87e3  g:  2.4708546e3  cos: +0.92 +0.86 +0.85    00000000000111111122 192
[  23.379s][TRACE]  i:     10  v: -1862.24086489048318 dv:  -1.28e2  g:  8.8354976e2  cos: +0.99 +0.89 +0.82    00000000111111111122 192
[  24.045s][TRACE]  i:     11  v: -2478.61552236364923 dv:  -6.16e2  g:  1.5864535e2  cos: -0.11 -0.19 -0.29    00111111111111222222 864
[  24.637s][DEBUG] update_interval: Exit by strange guess (U0),  (1.5421424192197553e0, 1.6326237921249265e0) vs 1.6734195586776628e0
[  24.640s][TRACE]  i:     12  v: -2950.43877366409652 dv:  -4.72e2  g:  3.0028254e2  cos: +0.17 +0.12 +0.08    00000001111111111122 192
[  24.955s][DEBUG] update_interval: Exit by strange guess (U0),  (0e0, 9.030988105274831e-2) vs -6.2074628941942915e0
[  25.038s][TRACE]  i:     13  v: -2953.14043478378971 dv:  -2.70e0  g:  3.0963038e2  cos: +0.99 +0.15 +0.13    00000011111111112222 480
[  25.858s][TRACE]  i:     14  v: -3204.87836643366836 dv:  -2.52e2  g:  1.2905360e2  cos: +0.72 +0.62 -0.02    00001111111111112222 480
[  26.448s][TRACE]  i:     15  v: -3283.03041176446686 dv:  -7.82e1  g:  9.6511716e1  cos: -0.14 -0.64 -0.69    00111111111111112222 576
[  27.191s][TRACE]  i:     16  v: -3407.94390360426314 dv:  -1.25e2  g:  1.1059690e2  cos: +0.55 +0.49 -0.23    00000001111111111122 288
[  27.849s][TRACE]  i:     17  v: -3476.73570961014138 dv:  -6.88e1  g:  1.4579908e2  cos: +0.81 +0.22 +0.49    00000000000011111122 288
[  28.518s][TRACE]  i:     18  v: -3582.07776354725229 dv:  -1.05e2  g:  9.2406217e1  cos: +0.69 +0.18 -0.14    00001111111111112222 480
[  29.106s][TRACE]  i:     19  v: -3618.42041234991757 dv:  -3.63e1  g:  3.2919930e1  cos: +0.67 +0.47 +0.19    00000000001111112222 576
[  29.890s][TRACE]  i:     20  v: -3648.20756526457535 dv:  -2.98e1  g:  4.6643024e1  cos: +0.59 +0.56 +0.89    11111111111111111122 288
[  30.637s][TRACE]  i:     21  v: -3667.12670506857512 dv:  -1.89e1  g:  2.3726226e1  cos: +0.94 +0.50 +0.67    00111111111111112222 576
[  31.070s][TRACE]  i:     22  v: -3669.08878420527299 dv:  -1.96e0  g:  6.2980158e0  cos: +0.79 +0.73 +0.19    00000000011111111122 288
[  31.426s][TRACE]  i:     23  v: -3669.40034712308443 dv: -3.12e-1  g:  1.8988176e0  cos: +0.11 -0.06 -0.02    00111111111111112222 576
[  31.788s][TRACE]  i:     24  v: -3669.41250008772295 dv: -1.22e-2  g: 7.4869188e-1  cos: +0.29 -0.00 +0.17    00111111111111111122 192
[  32.144s][TRACE]  i:     25  v: -3669.41955771251378 dv: -7.06e-3  g:  1.0505491e0  cos: +0.42 +0.11 +0.37    00000011111111111122 288
[  32.656s][TRACE]  i:     26  v: -3669.44137342401700 dv: -2.18e-2  g: 5.2767463e-1  cos: +0.83 +0.29 +0.13    00000011111111112222 576
[  33.017s][TRACE]  i:     27  v: -3669.44241150852667 dv: -1.04e-3  g: 2.5522894e-1  cos: +0.67 +0.55 +0.75    00001111111111112222 480
[  33.369s][TRACE]  i:     28  v: -3669.44279366960882 dv: -3.82e-4  g: 4.2268518e-2  cos: +0.59 +0.38 +0.46    00000001111111112222 480
[  33.731s][TRACE]  i:     29  v: -3669.44279965109536 dv: -5.98e-6  g: 1.3910208e-2  cos: -0.06 -0.35 +0.28    00000001111111111122 288
[  34.007s][TRACE]  i:     30  v: -3669.44280100071956 dv: -1.35e-6  g: 1.3180855e-2  cos: +0.32 -0.02 +0.56    00000001111111112222 480
[  34.530s][TRACE]  i:     31  v: -3669.44280573254218 dv: -4.73e-6  g: 1.3381393e-2  cos: +0.71 +0.22 +0.35    00000000001111111122 192
[  34.885s][TRACE]  i:     32  v: -3669.44280745922151 dv: -1.73e-6  g: 6.2512621e-3  cos: +0.82 +0.58 +0.04    00000000001111111122 288
[  35.163s][TRACE]  i:     33  v: -3669.44280757288016 dv: -1.14e-7  g: 4.8384000e-4  cos: +0.63 +0.52 +0.19    00000011111111111122 192
[  35.447s][TRACE]  i:     34  v: -3669.44280757383513 dv: -9.55e-10  g: 7.7220042e-5  cos: +0.10 +0.06 -0.17    00000011111111112222 480
[  35.805s][TRACE]  i:     35  v: -3669.44280757389106 dv: -5.59e-11  g: 8.9262403e-5  cos: +0.16 +0.02 -0.48    00000001111111112222 576
[  36.242s][TRACE]  i:     36  v: -3669.44280757394336 dv: -5.23e-11  g: 6.3562784e-5  cos: +0.76 +0.12 -0.57    00000000111111112222 576
[  36.598s][TRACE]  i:     37  v: -3669.44280757398565 dv: -4.23e-11  g: 1.1559662e-4  cos: +0.96 +0.67 +0.06    00000000111111111122 288
[  37.114s][TRACE]  i:     38  v: -3669.44280757408706 dv: -1.01e-10  g: 4.0457698e-5  cos: +0.92 +0.80 +0.67    00111111111111112222 576
[  37.398s][TRACE]  i:     39  v: -3669.44280757410115 dv: -1.41e-11  g: 2.7568275e-5  cos: +0.96 +0.79 +0.61    00000000000011111122 288
[  37.827s][TRACE]  i:     40  v: -3669.44280757410115 dv:  +0.00e0  g: 3.4712491e-6  cos: +0.73 +0.69 +0.72    00000000111111111122 192
[  38.140s][TRACE]  i:     41  v: -3669.44280757410070 dv: +4.55e-13  g: 5.5512018e-7  cos: +0.18 +0.13 +0.22    00000001111111111122 288
[  38.618s][TRACE]  i:     42  v: -3669.44280757410161 dv: -9.09e-13  g: 3.0598803e-7  cos: +0.16 +0.03 +0.62    00000011111111112222 480
[  39.014s][TRACE]  i:     43  v: -3669.44280757410161 dv:  +0.00e0  g: 3.4078309e-7  cos: +0.49 +0.08 +0.77    00011111111111122222 672
[  39.446s][TRACE]  i:     44  v: -3669.44280757410070 dv: +9.09e-13  g: 2.5830104e-7  cos: +0.79 +0.38 +0.36    00000000111111111222 384
[  39.958s][TRACE]  i:     45  v: -3669.44280757410115 dv: -4.55e-13  g: 2.7910936e-7  cos: +0.98 +0.72 +0.40    00000011111111111122 192
[  40.267s][TRACE]  i:     46  v: -3669.44280757410161 dv: -4.55e-13  g: 5.9695147e-8  cos: +0.79 +0.73 +0.70    00000000000111111122 288
[  40.707s][TRACE]  i:     47  v: -3669.44280757410161 dv:  +0.00e0  g: 1.6299424e-8  cos: +0.33 +0.26 +0.36    00000000001111111122 288
[  41.017s][TRACE]  i:     48  v: -3669.44280757410070 dv: +9.09e-13  g: 6.3374115e-9  cos: +0.28 +0.09 +0.32    00000011111111111122 288
[  41.332s][TRACE]  i:     49  v: -3669.44280757410161 dv: -9.09e-13  g: 1.1303685e-8  cos: +0.38 +0.10 +0.88    00000011111111111122 288
[  41.805s][TRACE]  i:     50  v: -3669.44280757410161 dv:  +0.00e0  g: 6.5506183e-9  cos: +0.89 +0.33 +0.46    00000001111111111122 192
[  42.123s][TRACE]  i:     51  v: -3669.44280757410115 dv: +4.55e-13  g: 1.7447008e-9  cos: +0.78 +0.69 +0.57    00000000011111111122 288
[  42.609s][TRACE]  i:     52  v: -3669.44280757410161 dv: -4.55e-13  g: 9.6728816e-10  cos: +0.39 +0.31 +0.51    00000011111111111122 245
[  42.929s][TRACE]  i:     53  v: -3669.44280757410252 dv: -9.09e-13  g: 1.4248736e-10  cos: +0.52 +0.20 +0.35    00000011111111111122 192
[  43.363s][TRACE]  i:     54  v: -3669.44280757410161 dv: +9.09e-13  g: 2.6479234e-10  cos: +0.17 +0.09 +0.81    00000011111111111122 203
[  43.646s][TRACE]  i:     55  v: -3669.44280757410206 dv: -4.55e-13  g: 1.9694943e-10  cos: +0.89 +0.15 +0.41    00001111111111111122 170
[  44.037s][TRACE]  i:     56  v: -3669.44280757410115 dv: +9.09e-13  g: 1.1031621e-10  cos: +0.84 +0.75 +0.63    00000000111111111122 209
[  44.349s][TRACE]  i:     57  v: -3669.44280757410070 dv: +4.55e-13  g: 7.8663621e-11  cos: +0.97 +0.70 +0.63    00000111111111111122 245
[  44.671s][TRACE]  i:     58  v: -3669.44280757410161 dv: -9.09e-13  g: 2.1707441e-11  cos: +0.71 +0.68 +0.66    00000111111111111112 124
[  44.983s][TRACE]  i:     59  v: -3669.44280757410161 dv:  +0.00e0  g: 1.6307851e-11  cos: +0.36 +0.25 +0.33    00000000111111111112 24
[  45.378s][TRACE]  i:     60  v: -3669.44280757410161 dv:  +0.00e0  g: 1.4125871e-11  cos: +0.62 +0.18 +0.32    00000001111111111112 41
[  45.694s][TRACE]  i:     61  v: -3669.44280757410161 dv:  +0.00e0  g: 1.1995141e-11  cos: +0.73 +0.39 +0.33    00000001111111111112 122
[  46.128s][TRACE]  i:     62  v: -3669.44280757410161 dv:  +0.00e0  g: 1.3708902e-11  cos: +0.72 +0.44 +0.42    00000000011111111112 26
[  46.129s][INFO] ACGSD Finished.
[  46.129s][INFO] Iterations: 62
[  46.129s][INFO]      Value: -3669.4428075741016
[  46.129s][INFO]  Grad Norm: 1.370890187584052e-11
[  46.129s][INFO]   Grad Max: 1.9766528305187246e-12
[  46.130s][TRACE] ============================
[  46.130s][TRACE] Writing 'out/ev-loop-01.1.structure'
[  46.261s][TRACE] Computing symmetry
[  46.640s][TRACE] Computing deperms in primitive cell
thread 'main' panicked at 'compute_stars: input deperms violate the group axioms!', src/tasks/math/stars.rs:66:13
note: Run with `RUST_BACKTRACE=1` for a backtrace.
[  47.474s][INFO] successfully leaked tempdir at /tmp/rsp2-.a6h5k4pELaPO
ExpHP commented 6 years ago

Well that's funny.

When I said I was running without threads, I forgot to reset OMP_NUM_THREADS. If I do reset OMP_NUM_THREADS, then the following behavior is observed:

Half of the time it dies with FloatIsNan, and the other half of the time, the initial energy is exactly -6919.09741961267173 (a reasonable value).

Anyways, something is clearly wrong with how rsp2 communicates with lammps. (surprising nobody)

ExpHP commented 6 years ago

I tried getting rid of create_atoms random + set atom in favor of create_atoms single, but to no avail.

ExpHP commented 6 years ago

Colin, you're not gonna like this.

Guess what happens when I use pair_style rebo instead of pair_style rebo/omp?

It works. Consistently.

ExpHP commented 6 years ago

Workaround provided in https://github.com/ExpHP/rsp2/commit/db0156ac242e99ef6de06817d39ffcd4037b4433 to be able to select rebo

colin-daniels commented 6 years ago

Thanks for the workaround (fixed that one at least), unfortunately these structures seem to be the gifts that just keep giving. A new error for these inputs:

[   0.228s][INFO] Available resources for parallelism:
[   0.229s][INFO]     MPI: 1 process(es)
[   0.229s][INFO]  OpenMP: 1 thread(s) per process (OMP_NUM_THREADS)
[   0.229s][INFO]        : 4 thread(s) in single-process tasks (RSP2_MAX_THREADS)
[   0.229s][INFO]   rayon: 4 thread(s) on the root process
[   0.230s][WARN] 'lammps-update-style: fast' is experimental (this message will not be shown again)
[   0.231s][TRACE] bond graph: intermediate supercell: [1, 1, 1], r = 1.70017
[   0.232s][TRACE] bond graph: true supercell: centered_diagonal([1, 1, 1])
[   0.308s][TRACE] Writing 'gyroid15/initial.structure'
[   0.373s][TRACE] ============================
[   0.373s][TRACE] Begin relaxation # 1
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `1440`,
 right: `1416`', src/io/lammps/lib.rs:1073:9
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
             at libstd/sys/unix/backtrace/tracing/gcc_s.rs:49
   1: std::sys_common::backtrace::print
             at libstd/sys_common/backtrace.rs:71
             at libstd/sys_common/backtrace.rs:59
   2: std::panicking::default_hook::{{closure}}
             at libstd/panicking.rs:211
   3: std::panicking::default_hook
             at libstd/panicking.rs:227
   4: std::panicking::rust_panic_with_hook
             at libstd/panicking.rs:475
   5: std::panicking::continue_panic_fmt
             at libstd/panicking.rs:390
   6: std::panicking::begin_panic_fmt
             at libstd/panicking.rs:345
   7: <rsp2_lammps_wrap::Lammps<P>>::update_computation
   8: <<rsp2_tasks::potential::lammps::Builder<P>>::lammps_diff_fn::MyDiffFn<Mm> as rsp2_tasks::potential::DiffFn<Mm>>::compute
   9: rsp2_tasks::potential::PotentialBuilder::initialize_flat_diff_fn::{{closure}}
  10: rsp2_minimize::acgsd::_acgsd::{{closure}}
  11: rsp2_minimize::acgsd::_acgsd
  12: rsp2_minimize::acgsd::acgsd
  13: rsp2_tasks::cmd::relaxation::do_relax
  14: rsp2_tasks::cmd::<impl rsp2_tasks::cmd::trial::TrialDir>::run_relax_with_eigenvectors
  15: <rsp2_lammps_wrap::low_level::mpi_helper::MpiOnDemand<D>>::install
  16: rsp2_tasks::entry_points::wrap_main_with_lammps_on_demand
  17: rsp2_tasks::entry_points::wrap_main
  18: rsp2_tasks::entry_points::_rsp2_acgsd
  19: rsp2_tasks::entry_points::rsp2
  20: std::rt::lang_start::{{closure}}
  21: std::panicking::try::do_call
             at libstd/rt.rs:59
             at libstd/panicking.rs:310
  22: __rust_maybe_catch_panic
             at libpanic_unwind/lib.rs:105
  23: std::rt::lang_start_internal
             at libstd/panicking.rs:289
             at libstd/panic.rs:392
             at libstd/rt.rs:58
  24: main
  25: __libc_start_main
  26: _start
[   0.460s][INFO] successfully leaked tempdir at /tmp/rsp2-.cyRoFAyIZGyZ
colin-daniels commented 6 years ago

Bonus segfault if I try to use mpirun with omp off on this input:

Edit: command line OMP_NUM_THREADS=1 mpirun -np 4 rsp2 -c input.yaml -o g10-segfault --force g10-relaxed.vasp

[   0.216s][INFO] Available resources for parallelism:
[   0.216s][INFO]     MPI: 4 process(es)
[   0.216s][INFO]  OpenMP: 1 thread(s) per process (OMP_NUM_THREADS)
[   0.216s][INFO]        : 4 thread(s) in single-process tasks (RSP2_MAX_THREADS)
[   0.216s][INFO]   rayon: 4 thread(s) on the root process
[   0.218s][WARN] 'lammps-update-style: fast' is experimental (this message will not be shown again)
[   0.219s][TRACE] bond graph: intermediate supercell: [1, 1, 1], r = 1.70017
[   0.219s][TRACE] bond graph: true supercell: centered_diagonal([1, 1, 1])
[   0.254s][TRACE] Writing 'g10-segfault/initial.structure'
[   0.298s][TRACE] ============================
[   0.298s][TRACE] Begin relaxation # 1
[   0.333s][TRACE]  i:      0  v: -6923.85556855887171 dv:  +0.00e0  g: 2.5554223e-4                            00000000000000000112 3
[   0.333s][DEBUG] Using steepest descent. (i: 1)
[   0.338s][TRACE]  i:      1  v: -6923.85556855640607 dv: +2.47e-9  g: 1.6012033e-4                            00000000000000111112 3
[   0.343s][TRACE]  i:      2  v: -6923.85556856183484 dv: -5.43e-9  g: 9.9874784e-5  cos: +0.53                00000000001111111112 9
[   0.349s][TRACE]  i:      3  v: -6923.85556855880714 dv: +3.03e-9  g: 1.0545607e-4  cos: +0.59 +0.31          00000000001111111112 4
[   0.354s][TRACE]  i:      4  v: -6923.85556856332551 dv: -4.52e-9  g: 1.0082739e-4  cos: +0.80 +0.47 +0.25    00000000011111111112 10
[   0.357s][TRACE]  i:      5  v: -6923.85556856646417 dv: -3.14e-9  g: 1.5434653e-4  cos: +0.94 +0.72 +0.43    00000000000000011112 6
[   0.363s][TRACE]  i:      6  v: -6923.85556856235507 dv: +4.11e-9  g: 9.3231352e-5  cos: +0.96 +0.90 +0.70    00000011111111111112 11
[   0.367s][TRACE]  i:      7  v: -6923.85556856504900 dv: -2.69e-9  g: 1.4937256e-4  cos: +0.99 +0.93 +0.84    00000000000000111112 7
[   0.377s][TRACE]  i:      8  v: -6923.85556856378753 dv: +1.26e-9  g: 1.2081709e-4  cos: +0.96 +0.93 +0.90    00000000011111111112 12
[   0.383s][TRACE]  i:      9  v: -6923.85556856064704 dv: +3.14e-9  g: 1.2126181e-4  cos: +0.95 +0.92 +0.87    00000000000011111112 7
[   0.388s][TRACE]  i:     10  v: -6923.85556856454059 dv: -3.89e-9  g: 1.2523105e-4  cos: +0.95 +0.91 +0.83    00000000000011111112 7
[   0.393s][TRACE]  i:     11  v: -6923.85556855894174 dv: +5.60e-9  g: 1.1327453e-4  cos: +0.96 +0.92 +0.83    00000000001111111112 11
[   0.397s][TRACE]  i:     12  v: -6923.85556856279800 dv: -3.86e-9  g: 1.5823793e-4  cos: +0.96 +0.92 +0.84    00000000000000111112 7
[   0.402s][TRACE]  i:     13  v: -6923.85556855827599 dv: +4.52e-9  g: 1.1697806e-4  cos: +0.98 +0.94 +0.86    00000000111111111112 10
[   0.407s][TRACE]  i:     14  v: -6923.85556855843970 dv: -1.64e-10  g: 1.9073105e-4  cos: +0.99 +0.96 +0.90    00000000000000111112 6
[   0.412s][TRACE]  i:     15  v: -6923.85556856436051 dv: -5.92e-9  g: 1.4542335e-4  cos: +0.98 +0.95 +0.94    00000000001111111112 7
[   0.417s][TRACE]  i:     16  v: -6923.85556856625590 dv: -1.90e-9  g: 1.3379912e-4  cos: +0.97 +0.95 +0.91    00000000000111111112 6
[   0.426s][TRACE]  i:     17  v: -6923.85556855889081 dv: +7.37e-9  g: 2.0577697e-4  cos: +0.96 +0.93 +0.90    00000000000000001112 10
[   0.430s][TRACE]  i:     18  v: -6923.85556856400763 dv: -5.12e-9  g: 1.5059070e-4  cos: +0.99 +0.95 +0.90    00000000011111111112 11
[   0.436s][TRACE]  i:     19  v: -6923.85556856325184 dv: +7.56e-10  g: 1.8905567e-4  cos: +0.97 +0.96 +0.91    00000000011111111112 8
[   0.441s][TRACE]  i:     20  v: -6923.85556856813037 dv: -4.88e-9  g: 2.4710529e-4  cos: +0.98 +0.96 +0.93    00000000000000011112 8
[   0.446s][TRACE]  i:     21  v: -6923.85556856199128 dv: +6.14e-9  g: 1.6636304e-4  cos: +0.99 +0.97 +0.94    00000000001111111112 5
[   0.450s][TRACE]  i:     22  v: -6923.85556856385665 dv: -1.87e-9  g: 2.0753738e-4  cos: +0.98 +0.97 +0.95    00000000000011111112 6
[   0.455s][TRACE]  i:     23  v: -6923.85556856283074 dv: +1.03e-9  g: 2.0287575e-4  cos: +0.99 +0.97 +0.95    00000000000011111112 9
[   0.461s][TRACE]  i:     24  v: -6923.85556856100993 dv: +1.82e-9  g: 1.4585492e-4  cos: +0.99 +0.98 +0.95    00000000111111111112 4
[   0.461s][INFO] ACGSD Finished.
[   0.461s][INFO] Iterations: 24
[   0.461s][INFO]      Value: -6923.85556856101
[   0.461s][INFO]  Grad Norm: 1.458549176948771e-4
[   0.461s][INFO]   Grad Max: 2.493511354585698e-5
[   0.461s][TRACE] ============================
[   0.461s][TRACE] Writing 'g10-segfault/ev-loop-01.1.structure'
[   0.528s][TRACE] Computing symmetry
[   1.182s][TRACE] Computing deperms in primitive cell
[   1.182s][DEBUG] Surveying displacement implementations:
[   1.183s][DEBUG]   axial: Produces 5760
[   1.185s][DEBUG]    diag: Produces 5760
[   1.191s][DEBUG]  diag-2: Produces 5760
[   1.197s][TRACE] num spacegroup ops: 1
[   1.197s][TRACE] num displacements:  5760
[   1.197s][TRACE] Computing forces at displacements
disp 5760 of 5760
[   7.547s][TRACE] Done computing forces at displacements
[   7.547s][TRACE] Computing deperms in supercell
[   7.548s][TRACE] Computing sparse force constants
[   7.577s][TRACE] Computing sparse dynamical matrix
[   9.680s][TRACE] nnz: 18720 out of 921600 blocks (matrix density: 2.031e-2)
[   9.680s][TRACE] Diagonalizing dynamical matrix
[   9.680s][TRACE] Computing most negative eigensolutions.
[  10.513s][WARN] trace: precomputing OPinv for shift-invert
[  10.832s][WARN] /usr/lib/python3.7/site-packages/scipy/sparse/linalg/dsolve/linsolve.py:295: SparseEfficiencyWarning: splu requires CSC matrix format
[  10.832s][WARN]   warn('splu requires CSC matrix format', SparseEfficiencyWarning)
[  10.832s][WARN] trace: shift-invert call 1
[  13.529s][WARN] trace: shift-invert call 2
[  16.432s][WARN] trace: shift-invert call 3
[  19.144s][WARN] trace: shift-invert call 4
[  22.478s][WARN]  Good -- Bad (Old Wrong OrthoFail OrthoBad)
[  22.478s][WARN]   3   --  1  ( 0    1       0        0    )
[  22.478s][WARN]   0   --  4  ( 3    1       0        0    )
[  22.478s][WARN]   0   --  4  ( 3    1       0        0    )
[  22.478s][WARN]   0   --  3  ( 3    0       0        0    )
[  22.479s][WARN] trace: trying non-shift-invert
[  25.727s][TRACE] Done diagonalizing dynamical matrix
[  25.729s][TRACE] ============================
[  25.729s][TRACE] Finished diagonalization
[  27.721s][TRACE] Computing eigensystem info
[  27.721s][TRACE] computing EvAcousticness
[  27.721s][TRACE] computing EvPolarization
[  27.722s][TRACE] not computing EvLayerAcousticness due to missing requirement SiteLayers
[  27.722s][TRACE] not computing UnfoldProbs due to missing requirement SiteLayers
[  27.722s][TRACE] computing EvRamanTensors
[  27.723s][INFO] # (C)  Frequency(cm-1)       Acoust. RamnA RamnB [ X  ,  Y  ,  Z  ]
[  27.723s][INFO]   (T) -0.2451772453052291    1-1e-06 0     0     [0.00, 0.00, 1.00]
[  27.723s][INFO]   (T) -0.0010127036152851687 1-1e-11 0     0     [0.83, 0.17, 0.00]
[  27.724s][INFO]   (T) -0.0008918143931896466 1-1e-11 0     0     [0.17, 0.83, 0.00]
[  27.724s][INFO]   (-) 45.95981583710969        1e-08 1e0   1e0   [0.24, 0.35, 0.41]
[  27.724s][INFO]   (-) 45.985920719463614       1e-08 3e-1  3e-1  [0.41, 0.41, 0.18]
[  27.724s][INFO]   (-) 46.010025797323486       1e-09 4e-1  1e-1  [0.35, 0.24, 0.41]
[  27.724s][INFO]   (-) 58.08965647010484        1e-10 6e-2  6e-2  [0.24, 0.24, 0.52]
[  27.724s][INFO]   (-) 58.17720758382288        1e-09 3e-1  4e-1  [0.36, 0.49, 0.15]
[  27.724s][INFO]   (-) 58.368162080215804       1e-08 1e-1  2e-1  [0.50, 0.34, 0.16]
[  27.724s][INFO]   (-) 58.47182753235027        1e-08 2e-1  2e-1  [0.23, 0.28, 0.50]
[  27.724s][INFO]   (-) 60.16169460590661        1e-10 2e-1  2e-1  [0.34, 0.33, 0.33]
[  27.724s][INFO]   (-) 60.36572993565027        1e-09 8e-1  1e0   [0.33, 0.33, 0.34]
[  27.724s][TRACE] Writing 'g10-segfault/ev-loop-01.2.structure'
-------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
-------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 2 with PID 0 on node engine exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------
#0  LAMMPS_NS::NPairFullBinGhostOmp::build(LAMMPS_NS::NeighList*) [clone ._omp_fn.0] () at ../npair_full_bin_ghost_omp.cpp:105
#1  0x00007fa221487e83 in GOMP_parallel (fn=0x7fa2221488a0 <_ZN9LAMMPS_NS20NPairFullBinGhostOmp5buildEPNS_9NeighListE._omp_fn.0(void)>, data=0x7ffd11a0a570, num_threads=1, flags=0)
    at /build/gcc/src/gcc/libgomp/parallel.c:168
#2  0x00007fa222148863 in LAMMPS_NS::NPairFullBinGhostOmp::build (this=<optimized out>, list=0x7fa2204a36c0) at ../npair_full_bin_ghost_omp.cpp:47
#3  0x00007fa222331deb in LAMMPS_NS::Neighbor::build (this=0x7fa2204c6200, topoflag=1) at ../neighbor.cpp:2147
#4  0x00007fa222394eab in LAMMPS_NS::Verlet::run (this=0x7fa220578c40, n=1) at ../verlet.cpp:285
#5  0x00007fa2221dd051 in LAMMPS_NS::Run::command (this=this@entry=0x7ffd11a0a7e0, narg=narg@entry=5, arg=arg@entry=0x7fa220440c40) at ../run.cpp:183
#6  0x00007fa221d2603e in LAMMPS_NS::Input::command_creator<LAMMPS_NS::Run> (lmp=<optimized out>, narg=5, arg=0x7fa220440c40) at ../input.cpp:861
#7  0x00007fa221d246ca in LAMMPS_NS::Input::execute_command (this=this@entry=0x7fa220575680) at ../input.cpp:844
#8  0x00007fa221d250b5 in LAMMPS_NS::Input::one (this=0x7fa220575680, single=0x7fa22053d950 "run 1 pre no post no") at ../input.cpp:312
#9  0x00007fa221f3c3b3 in lammps_command (ptr=<optimized out>, str=<optimized out>) at ../library.cpp:223
#10 0x0000556909022b98 in <rsp2_lammps_wrap::low_level::plain::LammpsOwner as rsp2_lammps_wrap::low_level::LowLevelApi>::command ()
#11 0x0000556908d818c7 in <rsp2_lammps_wrap::low_level::mpi::LammpsDispatch as rsp2_lammps_wrap::low_level::mpi_helper::DispatchMultiProcess>::dispatch ()
#12 0x0000556908dd1a0e in <rsp2_lammps_wrap::low_level::mpi_helper::MpiOnDemandInner<D>>::non_root_event_loop ()
#13 0x0000556908dcfff6 in <rsp2_lammps_wrap::low_level::mpi_helper::MpiOnDemand<D>>::install ()
#14 0x0000556908d8f4f1 in rsp2_tasks::entry_points::wrap_main_with_lammps_on_demand ()
#15 0x0000556908d8f456 in rsp2_tasks::entry_points::wrap_main ()
#16 0x0000556908d905f0 in rsp2_tasks::entry_points::_rsp2_acgsd ()
#17 0x0000556908d905d8 in rsp2_tasks::entry_points::rsp2 ()
#18 0x0000556908d7ec03 in std::rt::lang_start::{{closure}} ()
#19 0x000055690917bd93 in std::rt::lang_start_internal::{{closure}} () at libstd/rt.rs:59
#20 std::panicking::try::do_call () at libstd/panicking.rs:310
#21 0x00005569091a114a in __rust_maybe_catch_panic () at libpanic_unwind/lib.rs:105
#22 0x000055690917e5a6 in std::panicking::try () at libstd/panicking.rs:289
#23 std::panic::catch_unwind () at libstd/panic.rs:392
#24 std::rt::lang_start_internal () at libstd/rt.rs:58
#25 0x0000556908d7ec74 in main ()