geodynamics / vq

Virtual Quake is a boundary element code designed to investigate long term fault system behavior and interactions between faults through stress transfer.
Other
12 stars 24 forks source link

Heisenbug #59

Closed kwschultz closed 8 years ago

kwschultz commented 9 years ago

Error in running VQ on multiple processors with a model more complex than a few faults and a few thousand elements. Error is that the simulation gets stuck and cannot continue. Error is randomly reproducible, occurring at different points in multiple runs of the same simulation.

We must fix this bug before we can reliably run simulations on multiple processors. Right now we are effectively dead in the water with respect to a full CA simulation or even including aftershocks on smaller simulations.

Eric's summary of the bug when it occurred on a 3 processor run of a 6 fault subset from the UCERF2 California model: "So, here’s my Sherlock Holmes take: From the backtrace, we know process 0 and 1 are stuck in distributeUpdateField(), while process 2 is in MPI_Recv() in processBlocksSecondaryFailures() Since the processes are in order, this means the MPI_Recv that is stuck must correspond to the solution of Ax=b being sent back from the root (process 0) to process 2 The only way this could happen is if the number of MPI_Send() calls from root does not match the number of MPI_Recv() calls in the other processes The only way this mismatch could happen is if the total number of entries in global_id_list is not equal to the sum of the number of entries in local_id_list for each process or if processes have different understandings of the assignment of blocks to each process Since my laptop run is already at 4300 events with no problems, it seems more likely this is a bug caused by bad memory writing Such that one of theses structures is being corrupted by something overwriting the existing data So the question is how do we check whether this corruption is happening"

markyoder commented 9 years ago

I think i'm observing the same bug on several Linux machines (Ubuntu 14.04 and Mint 17). The simulation seems to run fine on a single processor (SPP) -- but the actual output data need to be verified; for multiple processors (MPP), the Greens functions calculate successfully, then the whole thing quits. i get an error message like this:

/# note: this error occurs after greens function calcs. /# [Umbasa:04607] Signal: Segmentation fault (11) [Umbasa:04607] Signal code: Address not mapped (1) [Umbasa:04607] Failing at address: 0xffffffffffffffe8 [Umbasa:04607] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36d40) [0x7f4a75654d40] [Umbasa:04607] [ 1] ../../build/src/vq(_ZN10GreensInit4initEP12SimFramework+0x6af) [0x43c4df] [Umbasa:04607] [ 2] ../../build/src/vq(_ZN12SimFramework4initEv+0x59e) [0x45370e] [Umbasa:04607] [ 3] ../../build/src/vq(_ZN10Simulation4initEv+0x29) [0x467a49] [Umbasa:04607] [ 4] ../../build/src/vq(main+0x109b) [0x42a42b] [Umbasa:04607] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f4a7563fec5] [Umbasa:04607] [ 6] ../../build/src/vq() [0x42b732]

[Umbasa:04607] * End of error message *

mpirun noticed that process rank 1 with PID 4607 on node Umbasa exited on signal 11 (Segmentation fault).

Looking at "make test", it seems that the single processor tests go quite well; there is a significant failure rate for multi-processor tests (and note some sort of error at the end as well): 80% tests passed, 47 tests failed out of 238

Total Test time (real) = 69.30 sec

The following tests FAILED: 82 - run_P2_none_6000 (Failed) 84 - test_slip_P2_none_6000 (Failed) 85 - test_interevent_P2_none_6000 (Failed) 88 - run_P2_none_4000 (Failed) 90 - test_slip_P2_none_4000 (Failed) 91 - test_interevent_P2_none_4000 (Failed) 94 - run_P2_none_3000 (Failed) 96 - test_slip_P2_none_3000 (Failed) 97 - test_interevent_P2_none_3000 (Failed) 100 - run_P2_none_2000 (Failed) 102 - test_slip_P2_none_2000 (Failed) 103 - test_interevent_P2_none_2000 (Failed) 106 - run_P2_taper_6000 (Failed) 110 - run_P2_taper_4000 (Failed) 114 - run_P2_taper_3000 (Failed) 118 - run_P2_taper_2000 (Failed) 122 - run_P2_taper_renorm_6000 (Failed) 126 - run_P2_taper_renorm_4000 (Failed) 130 - run_P2_taper_renorm_3000 (Failed) 134 - run_P2_taper_renorm_2000 (Failed) 138 - run_P4_none_6000 (Failed) 140 - test_slip_P4_none_6000 (Failed) 141 - test_interevent_P4_none_6000 (Failed) 144 - run_P4_none_4000 (Failed) 146 - test_slip_P4_none_4000 (Failed) 147 - test_interevent_P4_none_4000 (Failed) 150 - run_P4_none_3000 (Failed) 152 - test_slip_P4_none_3000 (Failed) 153 - test_interevent_P4_none_3000 (Failed) 156 - run_P4_none_2000 (Failed) 158 - test_slip_P4_none_2000 (Failed) 159 - test_interevent_P4_none_2000 (Failed) 162 - run_P4_taper_6000 (Failed) 166 - run_P4_taper_4000 (Failed) 170 - run_P4_taper_3000 (Failed) 174 - run_P4_taper_2000 (Failed) 178 - run_P4_taper_renorm_6000 (Failed) 182 - run_P4_taper_renorm_4000 (Failed) 186 - run_P4_taper_renorm_3000 (Failed) 190 - run_P4_taper_renorm_2000 (Failed) 222 - check_sum_P1_green_3000 (Failed) 228 - run_gen_P2_green_3000 (Failed) 229 - check_sum_P2_green_3000 (Failed) 230 - run_full_P2_green_3000 (Failed) 235 - run_gen_P4_green_3000 (Failed) 236 - check_sum_P4_green_3000 (Failed) 237 - run_full_P4_green_3000 (Failed) Errors while running CTest make: *\ [test] Error 8

markyoder commented 9 years ago

... and the "failing at" address: Failing at address: 0xffffffffffffffe8

appears to be consistent with at least two runs (seemingly, at the end of the register), AND: the exception appears to occur after the Greens functions are calculated, but before they are written to file (in the event that the run is configured to save them); at least, when i did a MPP run to pre-calc the greens functions for L=3000m, the greens functions finished; the simulation failed (like message above), and the greens-functions HDF5 file was not created.

markyoder commented 9 years ago

... and then, if i run vq in mpp mode using the pre-calculated greens functions, i get the same error (and note, the same failure address at the end of the register: Failing at address: 0xffffffffffffffe8)

***

* Virtual Quake *

* Version 1.2.0 *

* Git revision ID 52894364400720f17931fcad531dbc2c4c971fae *

* QuakeLib 1.2.0 Git revision 52894364400720f17931fcad531dbc2c4c971fae *

* MPI process count : 2 *

* OpenMP not enabled *

***

Initializing blocks.

To gracefully quit, create the file quit_vq in the run directory or use a SIGINT (Control-C).

Reading Greens function data from file all_cal_greens_5000.h5....

Greens function took 3.95872 seconds.

Greens shear matrix takes 45.5098 megabytes

Greens normal matrix takes 45.5098 megabytes

[yodubuntu:15069] * Process received signal *

Global Greens shear matrix takes 91.0195 megabytes.

Global Greens normal matrix takes 91.0195 megabytes.

[yodubuntu:15069] Signal: Segmentation fault (11) [yodubuntu:15069] Signal code: Address not mapped (1) [yodubuntu:15069] Failing at address: 0xffffffffffffffe8 [yodubuntu:15069] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36d40) [0x7f4fed504d40] [yodubuntu:15069] [ 1] ../../build/src/vq(_ZN10GreensInit4initEP12SimFramework+0x6af) [0x43c4df] [yodubuntu:15069] [ 2] ../../build/src/vq(_ZN12SimFramework4initEv+0x59e) [0x45370e] [yodubuntu:15069] [ 3] ../../build/src/vq(_ZN10Simulation4initEv+0x29) [0x467a49] [yodubuntu:15069] [ 4] ../../build/src/vq(main+0x109b) [0x42a42b] [yodubuntu:15069] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f4fed4efec5] [yodubuntu:15069] [ 6] ../../build/src/vq() [0x42b732] [yodubuntu:15069] * End of error message *


mpirun noticed that process rank 1 with PID 15069 on node yodubuntu.physics.ucdavis.edu exited on signal 11 (Segmentation fault).

eheien commented 9 years ago

I'm guessing it's in the Green's function calculation then. Can you try commenting out the call to symmetrizeMatrix (misc/GreensFunctions.cpp:81), recompile, and see if it still crashes? And if it still crashes, then try removing the whole assignment loop (lines 83-89) after that?

eheien commented 9 years ago

FYI Kasey, this is also a common debugging technique - keep removing code until it runs, then figure out what was wrong about the code you removed.

markyoder commented 9 years ago

so i tried commentingout mist/GreensFunctions.cpp:81, 83-89, but no joy. the GreensFunctions calculate, but they don't write to file (or other wise proceed to the next step in the sim).

i also see this in the error output: vq: /home/myoder/Documents/Research/yoder/VC/vq/src/io/GreensFileOutput.cpp:31: virtual void GreensFileOutput::initDesc(const SimFramework*) const: Assertion `false' failed.

which points to this bit in GreensFileOuput.cpp:

ifndef HDF5_IS_PARALLEL

if (sim->getWorldSize() > 1) {
    assertThrow(false, "# ERROR: Greens HDF5 output in parallel only allowed if using HDF5 parallel library.");
}

endif

is this a parallel vs serial HDF5 problem?

eheien commented 9 years ago

So when you say they don't write to file, does it crash in the same way as before or does it just have the assertion failure?

I don't think it's the parallel vs. serial HDF5 problem, because that would manifest as a different sort of error.

markyoder commented 9 years ago

the sim breaks before the write to file; always the same error. i've added some debugging code to SimFramework.cpp (amongst other places). in particular, in the initialization function: void SimFramework::init(void), i added some debugging lines to the 'plugin' initialization loop: for (it=ordered_plugins.begin(); it!=ordered_plugins.end(); ++it) { ...} in MPI mode, 2 processors: on the first process (in this case 6407), i get to the 5th (index=4) plugin; the initDesc() statement (first of 3) statement runs; the init() statement is started, then we get the "Signal: Segmentation fault..." bit.

... but the second process (6408) has only reached the 4th (i=3) plugin. initDesc() completes; init() has started, then we get a segmentation fault reported by 6408. output pasted below.

so i'm not sure yet which plugins those are.


myoder@Umbasa ~/Documents/Research/yoder/VC/vq/examples/ca_model $ mpirun -np 2 ../../build/src/vq params.d Debug(SimFramework::SimFramework()): run mpi initilizations. 6407.. Debug(SimFramework::SimFramework()): run mpi initilizations. 6408.. Debug(SimFramework::SimFramework()): mpi initializations finished.6407.. Debug(SimFramework::SimFramework()): mpi initializations finished.6408.. Debug: Initialize SimFramework... Debug: Initialize SimFramework... Debug: SimFramework::init(), 'dry run'/normal initialization loop??6408..

***

Debug: plugin_init 0, initDesc() pid: 6408

* Virtual Quake *

* Version 1.2.0 *

* Git revision ID dde492c687b4983084b36f27ddb6eb1145877eb1 *

Debug: plugin_init 0, init() pid: 6408

* QuakeLib 1.2.0 Git revision dde492c687b4983084b36f27ddb6eb1145877eb1 *

* MPI process count : 2 *

* OpenMP not enabled *

***

Debug: SimFramework::init(), 'dry run'/normal initialization loop??6407.. Debug: plugin_init 0, initDesc() pid: 6407 Debug: plugin_init 0, init() pid: 6407 Debug: plugin_init 0, timer bit... pid: 6408 plugin cycle finished 0/6408 Debug: plugin_init 1, initDesc() pid: 6408 Debug: plugin_init 1, init() pid: 6408 Debug: plugin_init 0, timer bit... pid: 6407 plugin cycle finished 0/6407 Debug: plugin_init 1, initDesc() pid: 6407

Initializing blocks.

Debug: plugin_init 1, init() pid: 6407 Debug: plugin_init 1, timer bit... pid: 6407 plugin cycle finished 1/6407 Debug: plugin_init 2, initDesc() pid: 6407

To gracefully quit, create the file quit_vq in the run directory or use a SIGINT (Control-C).

Debug: plugin_init 2, init() pid: 6407 Debug: plugin_init 2, timer bit... pid: 6407 plugin cycle finished 2/6407 Debug: plugin_init 3, initDesc() pid: 6407 Debug: plugin_init 3, init() pid: 6407

Reading Greens function data from file greens_15000.h5.Debug: plugin_init 1, timer bit... pid: 6408

plugin cycle finished 1/6408 Debug: plugin_init 2, initDesc() pid: 6408 Debug: plugin_init 2, init() pid: 6408 Debug: plugin_init 2, timer bit... pid: 6408 plugin cycle finished 2/6408 Debug: plugin_init 3, initDesc() pid: 6408 Debug: plugin_init 3, init() pid: 6408

Greens function took 0.012284 seconds.

Greens shear matrix takes 1014 kilobytes

Greens normal matrix takes 1014 kilobytes

[Umbasa:06408] * Process received signal *

Global Greens shear matrix takes 1.98047 megabytes.

Global Greens normal matrix takes 1.98047 megabytes.

Debug: plugin_init 3, timer bit... pid: 6407 plugin cycle finished 3/6407 Debug: plugin_init 4, initDesc() pid: 6407 Debug: plugin_init 4, init() pid: 6407 [Umbasa:06408] Signal: Segmentation fault (11) [Umbasa:06408] Signal code: Address not mapped (1) [Umbasa:06408] Failing at address: 0xffffffffffffffe8 [Umbasa:06408] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36d40) [0x7f56ab074d40] [Umbasa:06408] [ 1] ../../build/src/vq(_ZN10GreensInit4initEP12SimFramework+0x6af) [0x43c59f] [Umbasa:06408] [ 2] ../../build/src/vq(_ZN12SimFramework4initEv+0x606) [0x453896] [Umbasa:06408] [ 3] ../../build/src/vq(_ZN10Simulation4initEv+0x36) [0x469196] [Umbasa:06408] [ 4] ../../build/src/vq(main+0x109b) [0x42a4db] [Umbasa:06408] [ 5] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f56ab05fec5] [Umbasa:06408] [ 6] ../../build/src/vq() [0x42b7f2] [Umbasa:06408] * End of error message *


mpirun noticed that process rank 1 with PID 6408 on node Umbasa exited on signal 11 (Segmentation fault).


eheien commented 9 years ago

To make sure it's not HDF5 related, can you recompile with HDF5 disabled and run again?

kwschultz commented 9 years ago

I'm not sure if it's the same error, but here is an error that the possibly-Iranian graduate student is getting on multiprocessor runs (this is the only output he gave me, I'll ask for full output):

vq: /home/user/Desktop/vq-master3.3/src/core/SimDataBlocks.h:47: Block& VCSimDataBlocks::getBlock(const BlockID&): Assertion `block_num<blocks.size()' failed. [PHP-06:07000] * Process received signal * [PHP-06:07000] Signal: Aborted (6) [PHP-06:07000] Signal code: (-6) [PHP-06:07000] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36c30) [0x7f726bf67c30] [PHP-06:07000] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7f726bf67bb9] [PHP-06:07000] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f726bf6afc8] [PHP-06:07000] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2fa76) [0x7f726bf60a76] [PHP-06:07000] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2fb22) [0x7f726bf60b22] [PHP-06:07000] [ 5] ./vq() [0x4292ea] [PHP-06:07000] [ 6] ./vq(_ZN17UpdateBlockStress4initEP12SimFramework+0x51d) [0x448e5d] [PHP-06:07000] [ 7] ./vq(_ZN12SimFramework4initEv+0x51e) [0x45379e] [PHP-06:07000] [ 8] ./vq(_ZN10Simulation4initEv+0x29) [0x467bf9] [PHP-06:07000] [ 9] ./vq(main+0x109b) [0x42a47b] [PHP-06:07000] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f726bf52ec5] [PHP-06:07000] [11] ./vq() [0x42b782]

[PHP-06:07000] * End of error message *

mpirun noticed that process rank 0 with PID 7000 on node PHP-06 exited on signal 6 (Aborted).

markyoder commented 9 years ago

... and the winner is (i think): looks like this all comes down to the sim->console() call reporting the global Greens data. in: GreensInit.cpp. void GreensInit::init(SimFramework *_sim) { ... in the " #ifdef MPI_C_FOUND " block, at the end of all things, we get two lines like:

sim->console() << "# Global Greens shear matrix takes " << abbr_global_shear_bytes << " " << space_vals[global_shear_ind] << "." << std::endl;

and the string array space_vals[] is not declared in the scope of the child nodes. what is the right way to share scope? the sim appears to run with these lines commented out. we'll run a big sim to improve confidence.

ericheien commented 9 years ago

From what I can tell the space_vals[] is in the scope of the child nodes since it's created at function start. You can try running with this, but I don't think it's the source of the problem. However, it might "solve" the problem by creating extra space that absorbs whatever memory overwriting is normally causing a crash.

ericheien commented 9 years ago

The message the Iranian student is getting is really weird, because there should never be accesses past the number of blocks. My best guess is that this would be related to the same memory corruption that's caused problems on our side, just manifesting itself differently.

markyoder commented 9 years ago

sadly Eric, i with my limited understanding of MPI and the VQ architecture, i was hoping you'd have something different to say. i can manage to reproduce the segmentation fault using 8 processors, (not) fix in place and all. once the sim gets past the initialization, however, it seems to be stable. that said, the manifestation in the mac environment may be less forgiving.

... and on another machine, where i think vq was running before, i'm getting an error for all runs (mpi and spp modes, so maybe this is a good thing). in this case, however, i'm getting something more like: vq: /home/myoder/Documents/Research/yoder/VC/vq/src/core/SimDataBlocks.h:47: Block& VCSimDataBlocks::getBlock(const BlockID&): Assertion `block_num<blocks.size()' failed. [Umbasa:09437] * Process received signal * [Umbasa:09437] Signal: Aborted (6) [Umbasa:09437] Signal code: (-6) [Umbasa:09437] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36d40) [0x7f74124b8d40] [Umbasa:09437] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7f74124b8cc9] [Umbasa:09437] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f74124bc0d8] [Umbasa:09437] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2fb86) [0x7f74124b1b86] [Umbasa:09437] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2fc32) [0x7f74124b1c32] [Umbasa:09437] [ 5] ../../build/src/vq(_ZN10Simulation15partitionBlocksEv+0xcc5) [0x467115]

looking at Simulation.cpp::Simulation::partitionBlocks(), i'm not clear on the scope of the arrays {local_block_ids, global_block_ids, block_node_map}. local_block_ids is declared as "New" in Simulation::distributeBlocks(). these are declared in "core/CommPartition.h". CommPartition is inherited by Simulation. however, is the re-declaration of local_block_ids in distributeBlocks() correct? also, these arrays are declared in CommPartition.h, but where are they allocated?

kwschultz commented 9 years ago

I made the change and successfully ran a 10kyr sim on Kapalua (fault model is all CA traces from VQ/examples/fault_traces/ca_traces/, actually a copy of Mark's 5km model) with 4 processes. This is not necessarily evidence that it works though, as before I ran an AllCal 3km sim (Michael's model file but meshed with VQ mesher) on multiple processes and it had some successful runs but most of them would get hung up after 4kyr+. I'll try a few more like a 50kyr with aftershocks on Kapalua and a few more with newly meshed 3km models on my Mac, though I never got explicit memory errors like on Linux.

kwschultz commented 9 years ago

Full output from graduate student in some middle east country not to be named:

user@PHP-06:~/Desktop/NW_3_1_50000_kasey$ mpirun -np 1 ./vq ./params.prm

***

* Virtual Quake *

* Version 1.2.0 *

* QuakeLib 1.2.0 *

* MPI process count : 1 *

* OpenMP not enabled *

***

Initializing blocks.

To gracefully quit, create the file quit_vq in the run directory or use a SIGINT (Control-C).

Reading Greens function data from file Green_NW_Iran.h5.

Greens function took 0.22002 seconds.

Greens shear matrix takes 61.6003 megabytes

Greens normal matrix takes 61.6003 megabytes

vq: /home/user/Desktop/vq-master3.3/src/core/SimDataBlocks.h:47: Block& VCSimDataBlocks::getBlock(const BlockID&): Assertion `block_num<blocks.size()' failed. [PHP-06:14746] * Process received signal * [PHP-06:14746] Signal: Aborted (6) [PHP-06:14746] Signal code: (-6) [PHP-06:14746] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36c30) [0x7f760c8c4c30] [PHP-06:14746] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7f760c8c4bb9] [PHP-06:14746] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f760c8c7fc8] [PHP-06:14746] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2fa76) [0x7f760c8bda76] [PHP-06:14746] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2fb22) [0x7f760c8bdb22] [PHP-06:14746] [ 5] ./vq() [0x4292ea] [PHP-06:14746] [ 6] ./vq(_ZN17UpdateBlockStress4initEP12SimFramework+0x51d) [0x448e5d] [PHP-06:14746] [ 7] ./vq(_ZN12SimFramework4initEv+0x51e) [0x45379e] [PHP-06:14746] [ 8] ./vq(_ZN10Simulation4initEv+0x29) [0x467bf9] [PHP-06:14746] [ 9] ./vq(main+0x109b) [0x42a47b] [PHP-06:14746] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f760c8afec5] [PHP-06:14746] [11] ./vq() [0x42b782]

[PHP-06:14746] * End of error message *

mpirun noticed that process rank 0 with PID 14746 on node PHP-06 exited on signal 6 (Aborted).


I also checked the test, which is included in the VQ source.

The following tests FAILED: 12 - run_P1_none_12000 (OTHER_FAULT) 14 - test_slip_P1_none_12000 (Failed) 15 - test_interevent_P1_none_12000 (Failed) 18 - run_P1_none_6000 (OTHER_FAULT) 20 - test_slip_P1_none_6000 (Failed) 21 - test_interevent_P1_none_6000 (Failed) 24 - run_P1_none_4000 (OTHER_FAULT) 26 - test_slip_P1_none_4000 (Failed) 27 - test_interevent_P1_none_4000 (Failed) 30 - run_P1_none_3000 (OTHER_FAULT) 32 - test_slip_P1_none_3000 (Failed) 33 - test_interevent_P1_none_3000 (Failed) 36 - run_P1_none_2000 (OTHER_FAULT) 38 - test_slip_P1_none_2000 (Failed) 39 - test_interevent_P1_none_2000 (Failed) 42 - run_P1_taper_12000 (OTHER_FAULT) 46 - run_P1_taper_6000 (OTHER_FAULT) 50 - run_P1_taper_4000 (OTHER_FAULT) 54 - run_P1_taper_3000 (OTHER_FAULT) 58 - run_P1_taper_2000 (OTHER_FAULT) 62 - run_P1_taper_renorm_12000 (OTHER_FAULT) 66 - run_P1_taper_renorm_6000 (OTHER_FAULT) 70 - run_P1_taper_renorm_4000 (OTHER_FAULT) 74 - run_P1_taper_renorm_3000 (OTHER_FAULT) 78 - run_P1_taper_renorm_2000 (OTHER_FAULT) 82 - run_P2_none_6000 (Failed) 84 - test_slip_P2_none_6000 (Failed) 85 - test_interevent_P2_none_6000 (Failed) 88 - run_P2_none_4000 (Failed) 90 - test_slip_P2_none_4000 (Failed) 91 - test_interevent_P2_none_4000 (Failed) 94 - run_P2_none_3000 (Failed) 96 - test_slip_P2_none_3000 (Failed) 97 - test_interevent_P2_none_3000 (Failed) 100 - run_P2_none_2000 (Failed) 102 - test_slip_P2_none_2000 (Failed) 103 - test_interevent_P2_none_2000 (Failed) 106 - run_P2_taper_6000 (Failed) 110 - run_P2_taper_4000 (Failed) 114 - run_P2_taper_3000 (Failed) 118 - run_P2_taper_2000 (Failed) 122 - run_P2_taper_renorm_6000 (Failed) 126 - run_P2_taper_renorm_4000 (Failed) 130 - run_P2_taper_renorm_3000 (Failed) 134 - run_P2_taper_renorm_2000 (Failed) 138 - run_P4_none_6000 (Failed) 140 - test_slip_P4_none_6000 (Failed) 141 - test_interevent_P4_none_6000 (Failed) 144 - run_P4_none_4000 (Failed) 146 - test_slip_P4_none_4000 (Failed) 147 - test_interevent_P4_none_4000 (Failed) 150 - run_P4_none_3000 (Failed) 152 - test_slip_P4_none_3000 (Failed) 153 - test_interevent_P4_none_3000 (Failed) 156 - run_P4_none_2000 (Failed) 158 - test_slip_P4_none_2000 (Failed) 159 - test_interevent_P4_none_2000 (Failed) 162 - run_P4_taper_6000 (Failed) 166 - run_P4_taper_4000 (Failed) 170 - run_P4_taper_3000 (Failed) 174 - run_P4_taper_2000 (Failed) 178 - run_P4_taper_renorm_6000 (Failed) 182 - run_P4_taper_renorm_4000 (Failed) 186 - run_P4_taper_renorm_3000 (Failed) 190 - run_P4_taper_renorm_2000 (Failed) 194 - run_two_none_6000 (OTHER_FAULT) 196 - test_two_slip_none_6000 (Failed) 199 - run_two_none_3000 (OTHER_FAULT) 201 - test_two_slip_none_3000 (Failed) 204 - run_two_taper_6000 (OTHER_FAULT) 208 - run_two_taper_3000 (OTHER_FAULT) 212 - run_two_taper_renorm_6000 (OTHER_FAULT) 216 - run_two_taper_renorm_3000 (OTHER_FAULT) 221 - run_gen_P1_green_3000 (Failed) 222 - run_full_P1_green_3000 (Failed) 227 - run_gen_P2_green_3000 (Failed) 228 - run_full_P2_green_3000 (Failed) 233 - run_gen_P4_green_3000 (Failed) 234 - run_full_P4_green_3000 (Failed) Errors while running CTest

markyoder commented 9 years ago

i get the same error. it looks like the ::init() function in the VCInitBlocks() class object is not executing. i'm hoping to have this narrowed down in the next day or two.

On Mon, Mar 9, 2015 at 12:06 PM, Kasey Schultz notifications@github.com wrote:

Full output from graduate student in some middle east country not to be named:

user@PHP-06:~/Desktop/NW_3_1_50000_kasey$ mpirun -np 1 ./vq ./params.prm *** * Virtual Quake * * Version 1.2.0 * * QuakeLib 1.2.0 * * MPI process count : 1 * * OpenMP not enabled * *** Initializing blocks. To gracefully quit, create the file quit_vq in the run directory or use a SIGINT (Control-C). Reading Greens function data from file Green_NW_Iran.h5. Greens function took 0.22002 seconds. Greens shear matrix takes 61.6003 megabytes Greens normal matrix takes 61.6003 megabytes

vq: /home/user/Desktop/vq-master3.3/src/core/SimDataBlocks.h:47: Block& VCSimDataBlocks::getBlock(const BlockID&): Assertion `block_num<blocks.size()' failed. [PHP-06:14746] * Process received signal [PHP-06:14746] Signal: Aborted (6) [PHP-06:14746] Signal code: (-6) [PHP-06:14746] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36c30) [0x7f760c8c4c30] [PHP-06:14746] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7f760c8c4bb9] [PHP-06:14746] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f760c8c7fc8] [PHP-06:14746] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2fa76) [0x7f760c8bda76] [PHP-06:14746] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2fb22) [0x7f760c8bdb22] [PHP-06:14746] [ 5] ./vq() [0x4292ea] [PHP-06:14746] [ 6] ./vq(_ZN17UpdateBlockStress4initEP12SimFramework+0x51d) [0x448e5d] [PHP-06:14746] [ 7] ./vq(_ZN12SimFramework4initEv+0x51e) [0x45379e] [PHP-06:14746] [ 8] ./vq(_ZN10Simulation4initEv+0x29) [0x467bf9] [PHP-06:14746] [ 9] ./vq(main+0x109b) [0x42a47b] [PHP-06:14746] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f760c8afec5] [PHP-06:14746] [11] ./vq() [0x42b782] [PHP-06:14746] \ End of error message * mpirun noticed that process

rank 0 with PID 14746 on node PHP-06 exited on signal 6 (Aborted).

I also checked the test, which is included in the VQ source.

The following tests FAILED: 12 - run_P1_none_12000 (OTHER_FAULT) 14 - test_slip_P1_none_12000 (Failed) 15 - test_interevent_P1_none_12000 (Failed) 18 - run_P1_none_6000 (OTHER_FAULT) 20 - test_slip_P1_none_6000 (Failed) 21 - test_interevent_P1_none_6000 (Failed) 24 - run_P1_none_4000 (OTHER_FAULT) 26 - test_slip_P1_none_4000 (Failed) 27 - test_interevent_P1_none_4000 (Failed) 30 - run_P1_none_3000 (OTHER_FAULT) 32 - test_slip_P1_none_3000 (Failed) 33 - test_interevent_P1_none_3000 (Failed) 36 - run_P1_none_2000 (OTHER_FAULT) 38 - test_slip_P1_none_2000 (Failed) 39 - test_interevent_P1_none_2000 (Failed) 42 - run_P1_taper_12000 (OTHER_FAULT) 46 - run_P1_taper_6000 (OTHER_FAULT) 50 - run_P1_taper_4000 (OTHER_FAULT) 54 - run_P1_taper_3000 (OTHER_FAULT) 58 - run_P1_taper_2000 (OTHER_FAULT) 62 - run_P1_taper_renorm_12000 (OTHER_FAULT) 66 - run_P1_taper_renorm_6000 (OTHER_FAULT) 70 - run_P1_taper_renorm_4000 (OTHER_FAULT) 74 - run_P1_taper_renorm_3000 (OTHER_FAULT) 78 - run_P1_taper_renorm_2000 (OTHER_FAULT) 82 - run_P2_none_6000 (Failed) 84 - test_slip_P2_none_6000 (Failed) 85 - test_interevent_P2_none_6000 (Failed) 88 - run_P2_none_4000 (Failed) 90 - test_slip_P2_none_4000 (Failed) 91 - test_interevent_P2_none_4000 (Failed) 94 - run_P2_none_3000 (Failed) 96 - test_slip_P2_none_3000 (Failed) 97 - test_interevent_P2_none_3000 (Failed) 100 - run_P2_none_2000 (Failed) 102 - test_slip_P2_none_2000 (Failed) 103 - test_interevent_P2_none_2000 (Failed) 106 - run_P2_taper_6000 (Failed) 110 - run_P2_taper_4000 (Failed) 114 - run_P2_taper_3000 (Failed) 118 - run_P2_taper_2000 (Failed) 122 - run_P2_taper_renorm_6000 (Failed) 126 - run_P2_taper_renorm_4000 (Failed) 130 - run_P2_taper_renorm_3000 (Failed) 134 - run_P2_taper_renorm_2000 (Failed) 138 - run_P4_none_6000 (Failed) 140 - test_slip_P4_none_6000 (Failed) 141 - test_interevent_P4_none_6000 (Failed) 144 - run_P4_none_4000 (Failed) 146 - test_slip_P4_none_4000 (Failed) 147 - test_interevent_P4_none_4000 (Failed) 150 - run_P4_none_3000 (Failed) 152 - test_slip_P4_none_3000 (Failed) 153 - test_interevent_P4_none_3000 (Failed) 156 - run_P4_none_2000 (Failed) 158 - test_slip_P4_none_2000 (Failed) 159 - test_interevent_P4_none_2000 (Failed) 162 - run_P4_taper_6000 (Failed) 166 - run_P4_taper_4000 (Failed) 170 - run_P4_taper_3000 (Failed) 174 - run_P4_taper_2000 (Failed) 178 - run_P4_taper_renorm_6000 (Failed) 182 - run_P4_taper_renorm_4000 (Failed) 186 - run_P4_taper_renorm_3000 (Failed) 190 - run_P4_taper_renorm_2000 (Failed) 194 - run_two_none_6000 (OTHER_FAULT) 196 - test_two_slip_none_6000 (Failed) 199 - run_two_none_3000 (OTHER_FAULT) 201 - test_two_slip_none_3000 (Failed) 204 - run_two_taper_6000 (OTHER_FAULT) 208 - run_two_taper_3000 (OTHER_FAULT) 212 - run_two_taper_renorm_6000 (OTHER_FAULT) 216 - run_two_taper_renorm_3000 (OTHER_FAULT) 221 - run_gen_P1_green_3000 (Failed) 222 - run_full_P1_green_3000 (Failed) 227 - run_gen_P2_green_3000 (Failed) 228 - run_full_P2_green_3000 (Failed) 233 - run_gen_P4_green_3000 (Failed) 234 - run_full_P4_green_3000 (Failed) Errors while running CTest

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/vq/issues/59#issuecomment-77920675.

Mark Yoder, PhD 805 451 8750

"If you bore me, you lose your soul..." ~belly

kwschultz commented 9 years ago

It looks like the other grad student is still getting an error:

user@PHP-06:~/Desktop/Fault_NW_3_1/RUN$ mpirun -np 1 ./vq ./params_G.d

***

* Virtual Quake *

* Version 1.2.0 *

* QuakeLib 1.2.0 *

* MPI process count : 1 *

* OpenMP not enabled *

***

Initializing blocks.

To gracefully quit, create the file quit_vq in the run directory or use a SIGINT (Control-C).

Calculating Greens function with the standard Okada class....0%....2%....3%....4%....5%....6%....7%....8%....10%....11%....12%....13%....14%....15%....16%....18%....19%....20%....21%....22%....23%....24%....25%....27%....28%....29%....30%....31%....32%....33%....35%....36%....37%....38%....39%....40%....41%....43%....44%....45%....46%....47%....48%....49%....51%....52%....53%....54%....55%....56%....57%....59%....60%....61%....62%....63%....64%....65%....67%....68%....69%....70%....71%....72%....73%....75%....76%....77%....78%....79%....80%....81%....83%....84%....85%....86%....88%....89%....90%....91%....92%....94%....95%....96%....97%....98%....99%

Greens function took 468.306 seconds.

Greens shear matrix takes 71.9531 megabytes

Greens normal matrix takes 71.9531 megabytes

Greens output file: Green_NW_Iran.h5

vq: /home/user/Desktop/vq-master_3_10/src/core/SimDataBlocks.h:47: Block& VCSimDataBlocks::getBlock(const BlockID&): Assertion `block_num<blocks.size()' failed. [PHP-06:04232] * Process received signal * [PHP-06:04232] Signal: Aborted (6) [PHP-06:04232] Signal code: (-6) [PHP-06:04232] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36c30) [0x7fc9ae17fc30] [PHP-06:04232] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7fc9ae17fbb9] [PHP-06:04232] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7fc9ae182fc8] [PHP-06:04232] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2fa76) [0x7fc9ae178a76] [PHP-06:04232] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2fb22) [0x7fc9ae178b22] [PHP-06:04232] [ 5] ./vq() [0x42980a] [PHP-06:04232] [ 6] ./vq(_ZN17UpdateBlockStress4initEP12SimFramework+0x51d) [0x44937d] [PHP-06:04232] [ 7] ./vq(_ZN12SimFramework4initEv+0x51e) [0x453cee] [PHP-06:04232] [ 8] ./vq(_ZN10Simulation4initEv+0x29) [0x468149] [PHP-06:04232] [ 9] ./vq(main+0x109b) [0x42a99b] [PHP-06:04232] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fc9ae16aec5] [PHP-06:04232] [11] ./vq() [0x42bca2]

[PHP-06:04232] * End of error message *

mpirun noticed that process rank 0 with PID 4232 on node PHP-06 exited on signal 6 (Aborted).

markyoder commented 9 years ago

this is a different error what was getting. i'd been seeing, basically, an error that no blocks had been created; this is telling us that the sim is looking for a block off the end of the blocks array (block_num>=blocks.size() -- which could, conceivable, be size=0, but i don't think so in this case. this could also be due to recycling a fault model. did he recreate his fault model before running? otherwise, it might be related to the general seg-fault problem. standby... can we get a copy of those fault traces?

On Tue, Mar 10, 2015 at 10:40 AM, Kasey Schultz notifications@github.com wrote:

It looks like the other grad student is still getting an error:

user@PHP-06:~/Desktop/Fault_NW_3_1/RUN$ mpirun -np 1 ./vq ./params_G.d *** * Virtual Quake * * Version 1.2.0 * * QuakeLib 1.2.0 * * MPI process count : 1 * * OpenMP not enabled * *** Initializing blocks. To gracefully quit, create the file quit_vq in the run directory or use a SIGINT (Control-C). Calculating Greens function with the standard Okada class....0%....2%....3%....4%....5%....6%....7%....8%....10%....11%....12%....13%....14%....15%....16%....18%....19%....20%....21%....22%....23%....24%....25%....27%....28%....29%....30%....31%....32%....33%....35%....36%....37%....38%....39%....40%....41%....43%....44%....45%....46%....47%....48%....49%....51%....52%....53%....54%....55%....56%....57%....59%....60%....61%....62%....63%....64%....65%....67%....68%....69%....70%....71%....72%....73%....75%....76%....77%....78%....79%....80%....81%....83%....84%....85%....86%....88%....89%....90%....91%....92%....94%....95%....96%....97%....98%....99% Greens function took 468.306 seconds. Greens shear matrix takes 71.9531 megabytes Greens normal matrix takes 71.9531 megabytes Greens output file: Green_NW_Iran.h5

vq: /home/user/Desktop/vq-master_3_10/src/core/SimDataBlocks.h:47: Block& VCSimDataBlocks::getBlock(const BlockID&): Assertion `block_num<blocks.size()' failed. [PHP-06:04232] * Process received signal [PHP-06:04232] Signal: Aborted (6) [PHP-06:04232] Signal code: (-6) [PHP-06:04232] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36c30) [0x7fc9ae17fc30] [PHP-06:04232] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7fc9ae17fbb9] [PHP-06:04232] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7fc9ae182fc8] [PHP-06:04232] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2fa76) [0x7fc9ae178a76] [PHP-06:04232] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2fb22) [0x7fc9ae178b22] [PHP-06:04232] [ 5] ./vq() [0x42980a] [PHP-06:04232] [ 6] ./vq(_ZN17UpdateBlockStress4initEP12SimFramework+0x51d) [0x44937d] [PHP-06:04232] [ 7] ./vq(_ZN12SimFramework4initEv+0x51e) [0x453cee] [PHP-06:04232] [ 8] ./vq(_ZN10Simulation4initEv+0x29) [0x468149] [PHP-06:04232] [ 9] ./vq(main+0x109b) [0x42a99b] [PHP-06:04232] [10] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7fc9ae16aec5] [PHP-06:04232] [11] ./vq() [0x42bca2] [PHP-06:04232] \ End of error message * mpirun noticed that process rank 0 with PID 4232 on node PHP-06 exited on signal 6 (Aborted).

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/vq/issues/59#issuecomment-78106846.

Mark Yoder, PhD 805 451 8750

"If you bore me, you lose your soul..." ~belly

markyoder commented 9 years ago

ok, so i think i've been making progress on this. hopefully eric can clarify some syntax for those of us less awesome at C++. valgrind tells me that we are not properly deallocating memory in a bunch of places because we need to use "delete [ ] ary" when we declare and allocate that array like "int *X = new int[n];"... which we don't do in about a thousand places. similarly, when we use malloc() or valloc(), we need to use free(), though i'm not sure if something more needs to be done for valloc(); i only found articles for malloc(). this seems to be clearing up valgrind complaints one by one.

one area of concern: in Simulation.cpp:Simulation:collectEventSweep(quakelib::ModelSweeps &sweeps) {...} (lines 746 or so), i'm a bit confused about how to handle the sweep_counts, sweep_offsets, and all_sweeps arrays. they appear to be declared/allocated differently for root/child nodes, so nominally they need to be deallocated differently. for child nodes, we set the pointer int * sweep_offsets to NULL, and it is subsequently involved in an MPI_Gather() call, so i presume the NULL value (actually NULL address; should it be a value?) is handled on the other end.

Is this it???: having corrected (??) a bunch of memory allocation new/delete bits, we're back to GreensInit.cpp, somewhere around line 139, where we wrtie "global greens shear matrix takes..." ,etc. it looks like this mpi call might not be happining correctly. mpi_reduce is supposed to collect numbers for global_shear/normal_bytes, right? but it does not seem to collect those values correctly. to make memcheck happy, i initialize these upon declaration to nan; i get nan again for those values on non-root nodes -- or is that by design?

markyoder commented 9 years ago

and it seemed that heisenbug was fixed, but it's back. almost all of the bits above are addressed in the most recent Pull request. BUT, we still get heisenbug, at least for big models, long runs, MPP mode.

the most likely candidate at this point is RunEvent.cpp::RunEvent::processBlocksSecondaryFailures(). note the MPI_Send(), MPI_Recv() bits in the middle of this code block. basically, we have an if_root()/not_root() separation; both the root and child nodes do a bit of both sending and receiving. it seems to me that if one node gets ahead of the others in this send/receive + if-loop process, we could get a hang-up.

... and a quick note: at least on my secondary Ubuntu platform, heisen_hang appears to occur at the same place, when the events.h5 ouptut file reaches 221.9 MB; it has at this moment not been modified for a little more than 2.5 hours.

markyoder commented 9 years ago

... and to make this more and more confusing, it may be the case that there is no heisen_hang. the main parameters to observe this phenomenon is: block_length <= 3km, MPP or SPP, full CA model, HDF5 output mode, and it seems to occur after about 220MB of data are collected.

so it looked like this might be in the HDF5 write code, but i just finished a run with all of the above parameters, including HDF5 output, and it finished. it DID appear to hang (on what turns out the be the last (few) event(s) ), but as any good scientist does, i went to have a swim and some lunch, and when i came back it was finished. similarly, i finished a run on kapalua np=8 in text mode. of course, the np=8 kapalua run using HDF5 mode remains hung for at least the last 6 hours or so, so there might still be a problem. so, the plan at this point: 1) there are a couple places in the write-data-out code that could, on some platforms, cause the system to hang during a big HDF5 write. namely look to see if child nodes might be trying to write to the output file; all hdf5 writing should be by the root-node... at this time... i think. is "write" process blocking correctly? 2) there may be no problem. large events take a LOT of time to process, so Exploding California might look like a hang in some cases, but it's really just processing a massive event. let's address the Exploding California problem, maybe heisen_hang will go away.

markyoder commented 9 years ago

update: not sure if heisen_hang and exploding california are actually related. mac systems appear to hang independent of large events. as per the "m>10 events" ticket, we've implemented code to mitigate exploding_california (basically, impose max/min constraints on GF values), but mac os still hangs... at least in mpi, sometimes. hangs in both HDF5 and text output modes.

summary: system appears to hang during processStaticFailure() during the MPI_Allgatherv() call in sim->distributeUpdateField()

this appears to be utterly stable on Linux platforms. note that during the installation (compile time), the Linux (Mint 17 and corresponding ubuntu (14.x?) distros), we do NOT get the "OpenMPI not found" errors that we see during the MacOS installations.

using the "lldb" debugger, backtrace on all "active" processes produces something like: (lldb) thread backtrace

a "thread list" command usually produces: Process 62302 stopped

and maybe on one process: Process 62301 stopped

i think the state is cycling on the processes, so the "different" message can be hard to catch.

markyoder commented 9 years ago

... and it looks like the problem is... or may be anyway, that there are nested MPI calls in the primary-secondary block failure model. in other words, it may occur that a secondary failure loop on one process starts making calls for blocking, waiting, distributing, etc. at the same time that another process is doing the same for primary failure events. for now, let's try more mpi_barrier(), and we'll see if we can clean it up better later...

markyoder commented 9 years ago

still hanging on a child node (i think) MPI_Recv() call. the solution may be as simple as using MPI_Ssend() instead of MPI_Send(), the former being the "synchronous" version of the latter. see: http://stackoverflow.com/questions/17582900/difference-between-mpi-send-and-mpi-ssend

this article describes cases where MPI_Send() might be happy to move on when the corresponding MPI_Recv() has not actually done whatever the hell it's supposed to do.