Closed qing42102 closed 4 years ago
@qing42102 Can you link the commit that can be built & ran to produces this error? Also, if you have it available, please post the call stack trace at the point of failure.
@cwsmith Here's the commit link. You can run it with one of the simpler test to produce the error
-run_type test -refinement_limit 0.0 -bc_type dirichlet -interpolate 1 -petscspace_degree 1 -show_initial -dm_plex_print_fem 1
For the call stack trace, do you mean the backtrace
from gdb
? If so, I am not sure how to run with multiple processes in gdb
.
@qing42102 Thank you. Yeah, the GDB backtrace
(or where
) command at the point of failure will print the function stack trace. In a multi-process MPI program it is a little trickier. On AiMOS you can generate core files and read them into GDB:
https://secure.cci.rpi.edu/wiki/clusters/DCS_Supercomputer/#debugging
I'm not 100% sure that assert(...)
will trigger core file generation. We'll have to try it.
@cwsmith Here's the call stack I got:
#0 0x00007fffb31dfbf0 in raise () from /usr/lib64/libc.so.6
#1 0x00007fffb31e1f6c in abort () from /usr/lib64/libc.so.6
#2 0x0000000010087f10 in Omega_h::fail(char const*, ...) ()
#3 0x000000001004fba0 in int Omega_h::divide_no_remainder<int>(int, int) ()
#4 0x0000000010405b08 in Omega_h::bi_partition(std::shared_ptr<Omega_h::Comm>, Omega_h::Read<signed char>) ()
#5 0x000000001008f8c8 in Omega_h::inertia::recursively_bisect(std::shared_ptr<Omega_h::Comm>, double, Omega_h::Reals*, Omega_h::Reals*, Omega_h::Remotes*, Omega_h::inertia::Rib*) ()
#6 0x000000001011577c in Omega_h::Mesh::balance(bool) ()
#7 0x000000001004b2e0 in Omega_h::build_box(std::shared_ptr<Omega_h::Comm>, Omega_h_Family, double, double, double, int, int, int, bool) ()
#8 0x000000001000ef40 in CreateQuadMesh (
comm=0x7fffb3841290 <ompi_mpi_comm_world>, dm=0x7fffc3c09908,
options=0x7fffc3c09938)
at /gpfs/u/home/MPFS/MPFSzhqg/barn/omegahPetsc/ex12.cpp:553
#9 0x000000001000ffd8 in CreateMesh (comm=0x7fffb3841290 <ompi_mpi_comm_world>,
user=0x7fffc3c09938, dm=0x7fffc3c09908)
at /gpfs/u/home/MPFS/MPFSzhqg/barn/omegahPetsc/ex12.cpp:655
#10 0x000000001001628c in main (argc=14, argv=0x7fffc3c0a6a8)
at /gpfs/u/home/MPFS/MPFSzhqg/barn/omegahPetsc/ex12.cpp:1100
I think our guess is right that the partitioning across multiple processes failed.
How many processes was this? The recursive geometric partitioner used in Omega_h may only support partitioning to powers of 2.
@cwsmith That was 3 processes, so the partitioning to powers of 2 makes sense. I tried increasing the number of boxes from 2 by 2 to 20 by 20, and the partitioning worked for 2, 4, and 8 processes.
Given the source of the problem was found, I'm closing the issue. Please reopen if needed.
Running ex12 with the box mesh and with greater than 2 processes will produce this error for most of the tests:
This is probably related to the failure of Omega_h partitioning the box mesh across more than 2 processes. Since the box mesh is not essential to the end goal, this issue is not imperative.