geodynamics / vq

Virtual Quake is a boundary element code designed to investigate long term fault system behavior and interactions between faults through stress transfer.
Other
12 stars 24 forks source link

Greens functions tests fail #71

Closed markyoder closed 9 years ago

markyoder commented 9 years ago

using the standard set of tests, "build/make test", we consistently get 5 failed tests:

The following tests FAILED: 222 - check_sum_P1_green_3000 (Failed) 229 - check_sum_P2_green_3000 (Failed) 230 - run_full_P2_green_3000 (Failed) 236 - check_sum_P4_green_3000 (Failed) 237 - run_full_P4_green_3000 (Failed) note that the corresponding single processor test run_full_P1_green_3000 does pass: Start 223: run_full_P1_green_3000 223/238 Test #223: run_full_P1_green_3000 .................. Passed 0.19 sec, as do a number of other green tests on single and multiple processors.

first of all, what are these tests? next will be to determine if they're failing a lot or a little (aka, if g_1 ~ g_2, but not g_1==g_2, due to floating point errors or something like that). of course, all input is welcomed and invited and will be rewarded as best we can muster.

eheien commented 9 years ago

The green tests check whether Greens function vals are calculated as expected. P1, P2, P4 indicate if the calculation is done on multiple processors, so this means there's an issue (probably with file output) on multiple processors.

eheien commented 9 years ago

The relevant checking code is in examples/sum_greens.py, and it does allow some error tolerance on the values.

markyoder commented 9 years ago

As always, heroic. Thanks Eric.

On Tuesday, March 24, 2015, Eric Heien notifications@github.com wrote:

The relevant checking code is in examples/sum_greens.py, and it does allow some error tolerance on the values.

— Reply to this email directly or view it on GitHub https://github.com/geodynamics/vq/issues/71#issuecomment-85796400.

Mark Yoder, PhD 805 451 8750

"If you bore me, you lose your soul..." ~belly

markyoder commented 9 years ago

so looking at the test environment, in this case build/examples/GREENSP{n}, if we hijack the configuration files and run in MPP mode, we get an aborted run: vq: /home/myoder/Documents/Research/Yoder/VQ/vq/src/simulation/UpdateBlockStress.cpp:149: virtual SimRequest UpdateBlockStress::run(SimFramework): Assertion `next_eventglobal.val < double(1.79769313486231570815e+308L)' failed. [hatton:14294] ** Process received signal *** [hatton:14294] Signal: Aborted (6) [hatton:14294] Signal code: (-6) [hatton:14294] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x36d40) [0x7f7b38f58d40] [hatton:14294] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x39) [0x7f7b38f58cc9] [hatton:14294] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f7b38f5c0d8] [hatton:14294] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x2fb86) [0x7f7b38f51b86] [hatton:14294] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x2fc32) [0x7f7b38f51c32] [hatton:14294] [ 5] ../../src/vq(_ZN17UpdateBlockStress3runEP12SimFramework+0x75e) [0x44a86e] [hatton:14294] [ 6] ../../src/vq(_ZN12SimFramework3runEv+0x2cd) [0x454e5d] [hatton:14294] [ 7] ../../src/vq(main+0x10a3) [0x42aa13] [hatton:14294] [ 8] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f7b38f43ec5] [hatton:14294] [ 9] ../../src/vq() [0x42bd12]

[hatton:14294] * End of error message *

mpirun noticed that process rank 0 with PID 14294 on node hatton exited on signal 6 (Aborted).

we get a successful run in SPP mode (this is consistent with the "run_full_P1" completing successfully; ..._P{>1} throw errors.

as for the Greens check-sums: i'm still a bit off from fully understanding it, but it looks like the normal Greens functions finish within parameters; the shear do not. "expected" sum(greens_functions) are specified in the test code (CMakeLists.txt), specifically (6.9056016275796917e-08 -91753588.690448046) for (shear, normal). when i run the script manually, i get: myoder@hatton:~/Documents/Research/Yoder/VQ/vq/examples$ python sum_greens.py ../build/examples/GREENS_P4/greens_3000.h5 6.9056016275796917e-08 -91753588.690448046 ('Type', 'Expected', 'Actual', 'Error') ('Normal', 6.905601627579692e-08, 2.9693741173858417e-08, 0.57000500788711694) ('Shear', -91753588.69044805, -91753588.690418527, 3.2172256961644906e-13)

for P1, P2, P4. note "error" is normalized error ( (x-)/max(x, ) ).

markyoder commented 9 years ago

so the astute reader will observe that i just chucked a bunch of commentary on this issue. 1) the check_sum errors can be chalked up to a misleading test (and result). the shear values are very very small, so they come out very close to zero, but still with fairly large relative error. so, i changed the test to fail only if linear AND geometric errors are triggered (a-b > x_0 AND a/b > x_1) -- or something like this; i may make minor revisions to the specifics.

that leaves the full_run errors. these (appear to) occur 1) only when running in MPI mode and 2) only when using normal stress (sim.greens.use_normal = true, or at least NOT sim.greens.use_normal = false in the config file). so for use_normal=false, which is how we usually run, everything is fine... or so far as we know at this time at least.

so questions chasing down the MPP use_normal=true case: there may be some confusion about when to use local vs global blockID values. it's worth checking to see that this is being handled properly in the shear case as well. the problem appears to be that the variable "rhogd" [ which from UpdateBlockStress.cpp::UpdateBlockStress::init() we see: sim->setRhogd(gid, tho_g_depth); (hence the variable name)] is not being properly set/assigned for MPP runs; presumably this suggests a global/local ID confusion or a missing/misguided MPI call.

markyoder commented 9 years ago

so we need to distribute rhogd[] values to colleague nodes when we redistribute stress values. specifically, see UpdateBlockStress.cpp:UpdateBlockStress::init() in the "#ifdef MPI_C_FOUND" section.

i think this section distributes the stress_drop values one by one via MPI_Bcast; maybe a similar framework is appropriate for the setRhogd() section above (aka, setRhogd() locally, then MPI_Bcast the value for MPI runs).

markyoder commented 9 years ago

this issue will be resolved with the latest pull request.

markyoder commented 9 years ago

This (should be) resolved with the recent merge/pull request.