STEllAR-GROUP / octotiger

Astrophysics program simulating the evolution of star systems based on the fast multipole method on adaptive Octrees
http://octotiger.stellar-group.org/
Boost Software License 1.0
48 stars 19 forks source link

Debugging Octo-Tiger on Grace Hopper #496

Open diehlpk opened 2 months ago

diehlpk commented 2 months ago

I just pushed a branch called verbose_debug. To enable the debugging output, set --verbose=1, to disable, --verbose=0. I've attach an example of the output. It gives comments at the beginning and end of functions, along with the start time and the time elapsed during the execution of the function. When the comment has something like "(from root)" this means the code is within a function that executes for each node, and only the root node is emitting output.

diehlpk commented 2 months ago

@dmarce1 I tried to compile the new branch and I get the following error

2 errors found in build log:
     123    -- Octo-Tiger will use Kokkos Serial Execution Space for (Kokkos CPU) Hydro kernels!
     124    INFO Building with fp_contract=off
     125    -- Octo-Tiger max nf: 15
     126    -- Octo-Tiger minimal allowed theta: 0.34
     127    INFO Used Octo-Tiger commit: 02cf56d9bc2b4852022886f5cff6a39bb7438a07
     128    -- Configuring done
  >> 129    CMake Error at /users/diehlpk/spack/opt/spack/linux-sles15-neoverse_v2/gcc-12.3.0/hpx-1.9.1-4e54quutjtm4nz
            4y447r5kanti3odvn6/lib64/cmake/HPX/HPX_AddLibrary.cmake:235 (add_library):
     130      Cannot find source file:
     131    
     132        octotiger/verbose.hpp
     133    
     134      Tried extensions .c .C .c++ .cc .cpp .cxx .cu .mpp .m .M .mm .h .hh .h++
     135      .hm .hpp .hxx .in .txx .f .F .for .f77 .f90 .f95 .f03 .ispc
     136    Call Stack (most recent call first):
     137      CMakeLists.txt:349 (add_hpx_library)
     138    
     139    
  >> 140    CMake Error at /users/diehlpk/spack/opt/spack/linux-sles15-neoverse_v2/gcc-12.3.0/hpx-1.9.1-4e54quutjtm4nz
            4y447r5kanti3odvn6/lib64/cmake/HPX/HPX_AddLibrary.cmake:235 (add_library):
     141      No SOURCES given to target: octolib
     142    Call Stack (most recent call first):
     143      CMakeLists.txt:349 (add_hpx_library)
     144    
     145    
     146    CMake Generate step failed.  Build files cannot be regenerated correctly.
diehlpk commented 2 months ago

cc @G-071 and @JiakunYan

diehlpk commented 1 month ago

The code hangs here

New Omega = 9.687093e-01
t=21 END  : DWD step (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 396, function: execute_solver) (2.939855e+00 s elapsed)
TS 16:: t: 2.423595e+02, dt: 1.655967e-03, time_elapsed: 3.066656e+00, rotational_time: 2.347759e+02, x: 1.492980e+00, y: -4.062294e+00, z: -3.587781e-01, a: 3.727446e+00, ur: 2.053696e-06, ul: 2.037199e-06, vr: 6.826120e-01, vl: 6.800910e-01, dim: 0, ngrids: 8393, leafs: 7344, amr_boundaries: 5960
t=21 BEGIN: regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 458, function: execute_solver)
-----------------------------------------------
t=21 BEGIN: check for refinement (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 251, function: regrid)
t=21 END  : check for refinement (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 251, function: regrid) (3.230400e-02 s elapsed)
t=21 BEGIN: regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 259, function: regrid)
t=21 BEGIN: gather (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 261, function: regrid)
          (rebalancing 8489 nodes with 7428 leaves)
t=21 END  : gather (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 261, function: regrid) (4.129000e-03 s elapsed)
t=21 BEGIN: scatter (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 266, function: regrid)
t=21 END  : scatter (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 266, function: regrid) (1.547000e-02 s elapsed)
t=21 BEGIN: form tree connections (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 271, function: regrid)
          (6072 amr boundaries)
t=21 END  : form tree connections (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 271, function: regrid) (1.114730e-01 s elapsed)
t=21 BEGIN: solve gravity (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 276, function: regrid)
t=21 BEGIN: (root node) computing FMM (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 329, function: solve_gravity)
t=22 END  : (root node) computing FMM (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 329, function: solve_gravity) (3.010200e-02 s elapsed)
t=22 END  : solve gravity (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 276, function: regrid) (7.064400e-02 s elapsed)
t=22 END  : regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 259, function: regrid) (2.020480e-01 s elapsed)
t=22 END  : regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 458, function: execute_solver) (2.345370e-01 s elapsed)
t=22 END  : main execution loop iteration (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 363, function: execute_solver) (3.301262e+00 s elapsed)
t=22 BEGIN: main execution loop iteration (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 363, function: execute_solver)
t=22 BEGIN: DWD step (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 396, function: execute_solver)
dmarce1 commented 1 month ago

Patrick-

This narrows it down a bit, looks like it is between entry into the main loop and when the distributed part of the solver kicks in. I may need to add some more debugging language to figure out exactly where it though. If so I'll push something today.

Thanks Dominic

On Thu, Sep 19, 2024, 11:20 Patrick Diehl @.***> wrote:

The code hangs here

New Omega = 9.687093e-01 t=21 END : DWD step (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 396, function: execute_solver) (2.939855e+00 s elapsed) TS 16:: t: 2.423595e+02, dt: 1.655967e-03, time_elapsed: 3.066656e+00, rotational_time: 2.347759e+02, x: 1.492980e+00, y: -4.062294e+00, z: -3.587781e-01, a: 3.727446e+00, ur: 2.053696e-06, ul: 2.037199e-06, vr: 6.826120e-01, vl: 6.800910e-01, dim: 0, ngrids: 8393, leafs: 7344, amr_boundaries: 5960 t=21 BEGIN: regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 458, function: execute_solver)

t=21 BEGIN: check for refinement (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 251, function: regrid) t=21 END : check for refinement (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 251, function: regrid) (3.230400e-02 s elapsed) t=21 BEGIN: regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 259, function: regrid) t=21 BEGIN: gather (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 261, function: regrid) (rebalancing 8489 nodes with 7428 leaves) t=21 END : gather (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 261, function: regrid) (4.129000e-03 s elapsed) t=21 BEGIN: scatter (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 266, function: regrid) t=21 END : scatter (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 266, function: regrid) (1.547000e-02 s elapsed) t=21 BEGIN: form tree connections (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 271, function: regrid) (6072 amr boundaries) t=21 END : form tree connections (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 271, function: regrid) (1.114730e-01 s elapsed) t=21 BEGIN: solve gravity (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 276, function: regrid) t=21 BEGIN: (root node) computing FMM (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 329, function: solve_gravity) t=22 END : (root node) computing FMM (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 329, function: solve_gravity) (3.010200e-02 s elapsed) t=22 END : solve gravity (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 276, function: regrid) (7.064400e-02 s elapsed) t=22 END : regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 259, function: regrid) (2.020480e-01 s elapsed) t=22 END : regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 458, function: execute_solver) (2.345370e-01 s elapsed) t=22 END : main execution loop iteration (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 363, function: execute_solver) (3.301262e+00 s elapsed) t=22 BEGIN: main execution loop iteration (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 363, function: execute_solver) t=22 BEGIN: DWD step (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 396, function: execute_solver)

— Reply to this email directly, view it on GitHub https://github.com/STEllAR-GROUP/octotiger/issues/496#issuecomment-2361462532, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAO4RTXZOHKA3CNBHWCG23TZXL2VTAVCNFSM6AAAAABNSWBKFSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGNRRGQ3DENJTGI . You are receiving this because you were mentioned.Message ID: @.***>

diehlpk commented 1 month ago

Here is the new output

diagnostics...
New Omega = 9.687093e-01
t=25 END  : DWD step (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 396, function: execute_solver) (2.890631e+00 s elapsed)
TS 16:: t: 2.423595e+02, dt: 1.655967e-03, time_elapsed: 3.017211e+00, rotational_time: 2.347759e+02, x: 1.492980e+00, y: -4.062294e+00, z: -3.587781e-01, a: 3.727446e+00, ur: 2.053696e-06, ul: 2.037199e-06, vr: 6.826120e-01, vl: 6.800910e-01, dim: 0, ngrids: 8393, leafs: 7344, amr_boundaries: 5960
t=25 BEGIN: regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 458, function: execute_solver)
-----------------------------------------------
t=25 BEGIN: check for refinement (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 251, function: regrid)
t=25 END  : check for refinement (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 251, function: regrid) (3.727400e-02 s elapsed)
t=25 BEGIN: regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 259, function: regrid)
t=25 BEGIN: gather (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 261, function: regrid)
          (rebalancing 8489 nodes with 7428 leaves)
t=25 END  : gather (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 261, function: regrid) (1.043100e-02 s elapsed)
t=25 BEGIN: scatter (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 266, function: regrid)
t=25 END  : scatter (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 266, function: regrid) (1.588200e-02 s elapsed)
t=25 BEGIN: form tree connections (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 271, function: regrid)
          (6072 amr boundaries)
t=26 END  : form tree connections (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 271, function: regrid) (1.025130e-01 s elapsed)
t=26 BEGIN: solve gravity (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 276, function: regrid)
t=26 BEGIN: (root node) computing FMM (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 329, function: solve_gravity)
t=26 END  : (root node) computing FMM (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 329, function: solve_gravity) (4.937500e-02 s elapsed)
t=26 END  : solve gravity (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 276, function: regrid) (7.951700e-02 s elapsed)
t=26 END  : regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_1.cpp, line: 259, function: regrid) (2.084010e-01 s elapsed)
t=26 END  : regrid (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 458, function: execute_solver) (2.457240e-01 s elapsed)
t=26 END  : main execution loop iteration (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 363, function: execute_solver) (3.262951e+00 s elapsed)
t=26 BEGIN: main execution loop iteration (file: /users/diehlpk/compile/octotiger/src/node_server_actions_3.cpp, line: 363, function: execute_solver)
dmarce1 commented 1 month ago

This is running without SILO output enabled? I think the problem may be SILO related. If it is being run with SILO output can you please re-run it with disable_output=on?

dmarce1 commented 1 month ago

I think I have the bug narrowed down to diagnostics(), it is likely in this section of code in node_server_actions_2.cpp. I have added some more debug output which will hopefully let us narrow it down further.

EDIT: My bet is this is in all_hydro_bounds. If so, it may be hard to narrow it down using verbose debugging output past which kind of boundary exchange (there are three kinds, a) the restrict step which updates refined cells from their children, b) the decomp step which exchanges ghost cells between grids on the same level, and c) the AMR step which interpolates ghost cells at AMR boundaries).

diagnostics_t node_server::diagnostics(const diagnostics_t &diags) { if (is_refined) { auto rc = hpx::async(hpx::annotated_function([&]() { return child_diagnostics(diags); }, "diagnostics::return_child_diagnostics")); all_hydro_bounds(); auto diags = GET(rc); return diags; } else { all_hydro_bounds(); return local_diagnostics(diags); } }