AMReX-Astro / MAESTROeX

A C++ low Mach number stellar hydrodynamics code
https://amrex-astro.github.io/MAESTROeX/
BSD 3-Clause "New" or "Revised" License
40 stars 22 forks source link

wdconvect stopped early & segment fault on GPU #454

Closed andrewsilver1997 closed 3 months ago

andrewsilver1997 commented 5 months ago

I'm running wdconvect problem with OMP activated only. The input file is inputs_3d_C.128. But the simulation seems very unstable and stops at some point occasionally. It gives me the following message: amrex::Abort::0::ERROR: ncenter invalid in Diag() !!!

And when I run the simulation with GPU and OMP, I have segment fault.

zingale commented 5 months ago

okay, just to be clear, this is in MAESTROeX and not MAESTRO, correct?

It does appear that we are missing some OMP reductions in the diag code.

andrewsilver1997 commented 5 months ago

yes, it's MAESTROeX. So it's unfixable at this point?

On Wed, May 15, 2024 at 8:04 PM Michael Zingale @.***> wrote:

okay, just to be clear, this is in MAESTROeX and not MAESTRO, correct?

It does appear that we are missing some OMP reductions in the diag code.

— Reply to this email directly, view it on GitHub https://github.com/AMReX-Astro/MAESTROeX/issues/454#issuecomment-2113141043, or unsubscribe https://github.com/notifications/unsubscribe-auth/APC42MLJI5EJYLN3GE7VNX3ZCOPT3AVCNFSM6AAAAABHYU2EWGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMJTGE2DCMBUGM . You are receiving this because you authored the thread.Message ID: @.***>

zingale commented 5 months ago

we can fix it. Give me a bit. There seem to be a few issues.

zingale commented 5 months ago

we haven't really been running with OpenMP much, so it is not tested as well as MPI + CUDA / GPUs.

zingale commented 5 months ago

in the meantime, you can just remove the OpenMP pragma before the MFIter look in MaestroDiag.cpp:

https://github.com/AMReX-Astro/MAESTROeX/blob/e714207d0d102435e876bc33b3a8c573c982ad70/Source/MaestroDiag.cpp#L129

zingale commented 5 months ago

note: you should not run with GPUs and OpenMP.

GPU support is via CUDA, so you would compile with OpenMP diasabled.

PR #455 should fix MPI + OpenMP if you can test it out.

zingale commented 3 months ago

I believe that this is fixed. Reopen if you still have issues.

andrewsilver1997 commented 1 month ago

Hi, Now I'm running with OMP only. But I have segfault issue. This is the backtrace.0.0 file:

Host Name: node104 === If no file names and line numbers are shown below, one can run addr2line -Cpfie my_exefile my_line_address to convert my_line_address (e.g., 0x4a6b) into file name and line number. Or one can use amrex/Tools/Backtrace/parse_bt.py.

=== Please note that the line number reported by addr2line may not be accurate. One can use readelf -wl my_exefile | grep my_line_address' to find out the offset for that line.

0: ./Maestro3d.gnu.OMP.ex() [0x73da50] amrex::BLBackTrace::print_backtrace_info(_IO_FILE*) /scratch/p310347/DVR-time-prediction/data/MAESTROeX/Exec/science/wdconvect/../../../external/amrex/Src/Base/AMReX_BLBackTrace.cpp:200:36

1: ./Maestro3d.gnu.OMP.ex() [0x7438df] amrex::BLBackTrace::handler(int) /scratch/p310347/DVR-time-prediction/data/MAESTROeX/Exec/science/wdconvect/../../../external/amrex/Src/Base/AMReX_BLBackTrace.cpp:100:15

2: /lib64/libc.so.6(+0x4e5b0) [0x7fb2493835b0]

3: ./Maestro3d.gnu.OMP.ex() [0x4c3b07] std::vector<std::filesystem::__cxx11::path::_Cmpt, std::allocator >::~vector() /usr/include/c++/8/bits/stl_vector.h:567:15

4: ./Maestro3d.gnu.OMP.ex() [0x4c06b9] std::cxx11::basic_string<char, std::char_traits, std::allocator >::_M_is_local() const inlined at /usr/include/c++/8/bits/basic_string.h:224:6 in Maestro::Evolve() /usr/include/c++/8/bits/basic_string.h:215:26 std::__cxx11::basic_string<char, std::char_traits, std::allocator >::_M_dispose() /usr/include/c++/8/bits/basic_string.h:224:6 std::cxx11::basic_string<char, std::char_traits, std::allocator >::~basic_string() /usr/include/c++/8/bits/basic_string.h:661:9 std::filesystem::cxx11::path::~path() /usr/include/c++/8/bits/fs_path.h:209:5 std::filesystem::cxx11::path::_Cmpt::~_Cmpt() /usr/include/c++/8/bits/fs_path.h:644:16 void std::_Destroy<std::filesystem::cxx11::path::_Cmpt>(std::filesystem::cxx11::path::_Cmpt) /usr/include/c++/8/bits/stl_construct.h:98:7 void std::_Destroy_aux::destroy<std::filesystem::cxx11::path::_Cmpt>(std::filesystem::cxx11::path::_Cmpt, std::filesystem::__cxx11::path::_Cmpt) /usr/include/c++/8/bits/stl_construct.h:108:19 void std::_Destroy<std::filesystem::cxx11::path::_Cmpt>(std::filesystem::__cxx11::path::_Cmpt, std::filesystem::cxx11::path::_Cmpt) /usr/include/c++/8/bits/stl_construct.h:137:11 void std::_Destroy<std::filesystem::__cxx11::path::_Cmpt, std::filesystem::cxx11::path::_Cmpt>(std::filesystem::cxx11::path::_Cmpt, std::filesystem::__cxx11::path::_Cmpt, std::allocator<std::filesystem::cxx11::path::_Cmpt>&) /usr/include/c++/8/bits/stl_construct.h:206:15 std::vector<std::filesystem::__cxx11::path::_Cmpt, std::allocator >::~vector() /usr/include/c++/8/bits/stl_vector.h:567:15 std::filesystem::__cxx11::path::~path() /usr/include/c++/8/bits/fs_path.h:209:5 Maestro::Evolve() /scratch/p310347/DVR-time-prediction/data/MAESTROeX/Exec/science/wdconvect/../../../Source/MaestroEvolve.cpp:125:36

5: ./Maestro3d.gnu.OMP.ex() [0x424d0d] main /scratch/p310347/DVR-time-prediction/data/MAESTROeX/Exec/science/wdconvect/../../../Source/main.cpp:63:52

6: /lib64/libc.so.6(libc_start_main+0xe5) [0x7fb24936f7e5] libc_start_main /usr/src/debug/glibc-2.28-251.el8.2.x86_64/csu/../csu/libc-start.c:336:3

7: ./Maestro3d.gnu.OMP.ex() [0x434cfe] _start at ??:?

zingale commented 1 month ago

can you share the job_info file that is output in any of the plotfile directories? This is failing at:

 125         if (std::filesystem::exists("plot_and_continue")) {                                    
 126             remove("plot_and_continue");                                                       
 127             do_plotfile = true;                                                                
 128         }       

which is odd.

andrewsilver1997 commented 1 month ago

it started to work after I updated the code to the lasted version. thank you:)