cwsmith / omegahPetsc

omegah+petsc
BSD 3-Clause "New" or "Revised" License
0 stars 0 forks source link

Running Box Mesh with Multiple Processes #2

Closed qing42102 closed 4 years ago

qing42102 commented 4 years ago

Running ex12 with the box mesh and with greater than 2 processes will produce this error for most of the tests:

assertion a % b == 0 failed at /omega_h/src/Omega_h_scalar.hpp +282

This is probably related to the failure of Omega_h partitioning the box mesh across more than 2 processes. Since the box mesh is not essential to the end goal, this issue is not imperative.

cwsmith commented 4 years ago

@qing42102 Can you link the commit that can be built & ran to produces this error? Also, if you have it available, please post the call stack trace at the point of failure.

qing42102 commented 4 years ago

@cwsmith Here's the commit link. You can run it with one of the simpler test to produce the error

-run_type test -refinement_limit 0.0    -bc_type dirichlet -interpolate 1 -petscspace_degree 1 -show_initial -dm_plex_print_fem 1

For the call stack trace, do you mean the backtrace from gdb? If so, I am not sure how to run with multiple processes in gdb.

cwsmith commented 4 years ago

@qing42102 Thank you. Yeah, the GDB backtrace (or where) command at the point of failure will print the function stack trace. In a multi-process MPI program it is a little trickier. On AiMOS you can generate core files and read them into GDB:

https://secure.cci.rpi.edu/wiki/clusters/DCS_Supercomputer/#debugging

I'm not 100% sure that assert(...) will trigger core file generation. We'll have to try it.

qing42102 commented 4 years ago

@cwsmith Here's the call stack I got:

#0  0x00007fffb31dfbf0 in raise () from /usr/lib64/libc.so.6
#1  0x00007fffb31e1f6c in abort () from /usr/lib64/libc.so.6
#2  0x0000000010087f10 in Omega_h::fail(char const*, ...) ()
#3  0x000000001004fba0 in int Omega_h::divide_no_remainder<int>(int, int) ()
#4  0x0000000010405b08 in Omega_h::bi_partition(std::shared_ptr<Omega_h::Comm>, Omega_h::Read<signed char>) ()
#5  0x000000001008f8c8 in Omega_h::inertia::recursively_bisect(std::shared_ptr<Omega_h::Comm>, double, Omega_h::Reals*, Omega_h::Reals*, Omega_h::Remotes*, Omega_h::inertia::Rib*) ()
#6  0x000000001011577c in Omega_h::Mesh::balance(bool) ()
#7  0x000000001004b2e0 in Omega_h::build_box(std::shared_ptr<Omega_h::Comm>, Omega_h_Family, double, double, double, int, int, int, bool) ()
#8  0x000000001000ef40 in CreateQuadMesh (
    comm=0x7fffb3841290 <ompi_mpi_comm_world>, dm=0x7fffc3c09908, 
    options=0x7fffc3c09938)
    at /gpfs/u/home/MPFS/MPFSzhqg/barn/omegahPetsc/ex12.cpp:553
#9  0x000000001000ffd8 in CreateMesh (comm=0x7fffb3841290 <ompi_mpi_comm_world>, 
    user=0x7fffc3c09938, dm=0x7fffc3c09908)
    at /gpfs/u/home/MPFS/MPFSzhqg/barn/omegahPetsc/ex12.cpp:655
#10 0x000000001001628c in main (argc=14, argv=0x7fffc3c0a6a8)
    at /gpfs/u/home/MPFS/MPFSzhqg/barn/omegahPetsc/ex12.cpp:1100

I think our guess is right that the partitioning across multiple processes failed.

cwsmith commented 4 years ago

How many processes was this? The recursive geometric partitioner used in Omega_h may only support partitioning to powers of 2.

qing42102 commented 4 years ago

@cwsmith That was 3 processes, so the partitioning to powers of 2 makes sense. I tried increasing the number of boxes from 2 by 2 to 20 by 20, and the partitioning worked for 2, 4, and 8 processes.

cwsmith commented 4 years ago

Given the source of the problem was found, I'm closing the issue. Please reopen if needed.