Closed shashankNREL closed 3 years ago
@shashankNREL Are you certain the code doesn't crash on CPU? Currently, the ABLBoundaryPlane won't work if the multiple levels are not touching the boundary.
@gantech has a patch to workaround this for his IEA work.
@sayerhs and @gantech: The patch was useful and I was able to run the regression test on GPUs.
I started looking into this because when I was trying to run a case on Summit with ABL as inflow with ALM. There I got the following errors
Regrid mesh ... time elapsed = 0.012923277
Grid summary:
Level 0 128 grids 33554432 cells 100 % of domain
smallest grid: 64 x 64 x 64 biggest grid: 64 x 64 x 64
Level 1 150 grids 1627648 cells 0.6063461304 % of domain
smallest grid: 16 x 16 x 8 biggest grid: 32 x 32 x 16
terminate called after throwing an instance of 'std::runtime_error'
what(): GPU last error detected in file /gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/submods/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 1000: an illegal memory access was encountered
[f09n15:57760] *** Process received signal ***
[f09n15:57760] Signal: Aborted (6)
[f09n15:57760] Signal code: (-6)
[f09n15:57760] [ 0] [0x2000000504d8]
[f09n15:57760] [ 1] /lib64/libc.so.6(abort+0x2b4)[0x200001412094]
[f09n15:57760] [ 2] /sw/summit/gcc/7.4.0/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x1c4)[0x200001120614]
[f09n15:57760] [ 3] /sw/summit/gcc/7.4.0/lib64/libstdc++.so.6(+0xab364)[0x20000111b364]
[f09n15:57760] [ 4] /sw/summit/gcc/7.4.0/lib64/libstdc++.so.6(+0xa9778)[0x200001119778]
[f09n15:57760] [ 5] /sw/summit/gcc/7.4.0/lib64/libstdc++.so.6(__gxx_personality_v0+0x52c)[0x20000111a94c]
[f09n15:57760] [ 6] /sw/summit/gcc/7.4.0/lib64/libgcc_s.so.1(+0xc0a4)[0x20000139c0a4]
[f09n15:57760] [ 7] /sw/summit/gcc/7.4.0/lib64/libgcc_s.so.1(_Unwind_RaiseException+0x370)[0x20000139c770]
[f09n15:57760] [ 8] /sw/summit/gcc/7.4.0/lib64/libstdc++.so.6(__cxa_throw+0x68)[0x20000111b8c8]
[f09n15:57760] [ 9] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x11a9d4bc]
[f09n15:57760] [10] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x11a9d12c]
[f09n15:57760] [11] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x10050b5c]
[f09n15:57760] [12] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x1049342c]
[f09n15:57760] [13] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x104903b8]
[f09n15:57760] [14] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x1048aa68]
[f09n15:57760] [15] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x104777c0]
[f09n15:57760] [16] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x10075fd8]
[f09n15:57760] [17] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x100f344c]
[f09n15:57760] [18] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x100ef794]
[f09n15:57760] [19] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x1005aab8]
[f09n15:57760] [20] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x11dcf730]
[f09n15:57760] [21] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x100453c4]
[f09n15:57760] [22] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x10045d4c]
[f09n15:57760] [23] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x1003dc30]
[f09n15:57760] [24] /lib64/libc.so.6(+0x25200)[0x2000013f5200]
And so with a little probing I found that the problem happens
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x1003dc30 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x000000001003dc30
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/main.cpp:70 (discriminator 8)
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x10045d4c -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x0000000010045d4c
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/incflo.cpp:225
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x100453c4 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x00000000100453c4
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/incflo.cpp:158 (discriminator 4)
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x11dcf730 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x0000000011dcf730
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/submods/amrex/Src/AmrCore/AMReX_AmrCore.cpp:104
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x1005aab8 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x000000001005aab8
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/incflo_regrid.cpp:18
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x100ef794 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x00000000100ef794
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/core/FieldRepo.cpp:42
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x100f344c -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x00000000100f344c
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/core/Field.H:345
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x10075fd8 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x0000000010075fd8
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/core/Field.cpp:172
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x104777c0 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x00000000104777c0
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/wind_energy/ABLFillInflow.cpp:39
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x1048aa68 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x000000001048aa68
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/wind_energy/ABLBoundaryPlane.cpp:571
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x104903b8 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x00000000104903b8
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/submods/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:1486
So the culprit seems to be ABLBoundaryPlane.cpp
, a ParallelFor in the routine populate_data (https://github.com/Exawind/amr-wind/blob/319ad956f0f85fbe2c41fbec97b69f7b63a190c9/amr-wind/wind_energy/ABLBoundaryPlane.cpp#L569)
@marchdf and @sayerhs : Can you tell me what could be going wrong here?
The static refinement is not close to any of the inflow planes
####################
# Mesh
####################
amr.n_cell = 512 512 128 # Grid cells at coarsest AMRlevel
amr.max_level = 2 # Max AMR level in hierarchy
geometry.prob_lo = 0.0 0.0 0.0
geometry.prob_hi = 5120.0 5120.0 1280.0
amr.max_grid_size = 128
tagging.labels = static
tagging.static.type = CartBoxRefinement
tagging.static.static_refinement_def = static_box.txt
and the static_box.txt is
2
1
1600.0 1600.0 40.0 2200.0 2200.0 400.0
1
1700.0 1700.0 60.0 1900.0 1900.0 300.0
and I put Ganesh's fix here as well.
@shashankNREL can you share the local modifications to your build? That is the patches from Ganesh?
@sayerhs : @gantech had mentioned that he set the number of levels within ABLBoundaryPlane.cpp to 1 and so I did the following. Not sure if I forgot something. Please let me know if I missed something.
--- a/amr-wind/wind_energy/ABLBoundaryPlane.cpp
+++ b/amr-wind/wind_energy/ABLBoundaryPlane.cpp
- }
+ // if (m_repo.num_active_levels() > 1) {
+ // amrex::Abort("Not supporting multi-level input mode yet.");
+ // }
amrex::Print() << "Reading input NetCDF file: " << m_filename << std::endl;
auto ncf = ncutils::NCFile::open_par(
@@ -435,7 +435,8 @@ void ABLBoundaryPlane::read_header()
// Sanity check the input file time
AMREX_ALWAYS_ASSERT(m_in_times[0] <= m_time.current_time());
- const int nlevels = m_repo.num_active_levels();
+ // const int nlevels = m_repo.num_active_levels();
+ const int nlevels = 1;
m_in_data.resize(6);
for (auto& plane_grp : ncf.all_groups()) {
int normal, face_dir;
@@ -508,7 +509,8 @@ void ABLBoundaryPlane::read_file()
m_filename, NC_NOWRITE | NC_NETCDF4 | NC_MPIIO,
amrex::ParallelContext::CommunicatorSub(), MPI_INFO_NULL);
- const int nlevels = m_repo.num_active_levels();
+ // const int nlevels = m_repo.num_active_levels();
+ const int nlevels = 1;
for (amrex::OrientationIter oit; oit; ++oit) {
auto ori = oit();
if (not m_in_data.is_populated(ori)) continue;
@shashankNREL ABLBoundaryData::populate_field
is called in ABLFillInflow::fillpatch
operations. So we will want to skip filling boundary planes if (lev > 0)
. Can you add an early return here before line 538?
// Add before line 538
if (lev > 0) return;
I am curious why @gantech 's case worked if he didn't have this logic. Is this only crashing on GPUs and not on CPUs?
@sayerhs: Due to eagle queue issues, I couldn't test this earlier but was able to test it yesterday and it worked for me. I'll just finish the GPU test on Summit for completeness and close this issue.
Thanks again for your help with this issue.
I am trying to run the regression test case
abl_bndry_input
on GPUs on Eagle. One key difference is that I tried adding 2 levels of refinement far from the boundary through static refinement. As soon as the code initiates a refinement the code crashes. I am running themain
branch.Following is the error message that I see. Somehow the error is with NetCDF complaining
No group found
which is after running a few time steps.On Summit the error shows up as
what(): GPU last error detected in file /gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/submods/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 1000: misaligned address [f12n08:40078] *** Process received signal ***
@marchdf @sayerhs @jrood-nrel : Have you seen this kind of error before? Can you help me fix this issue?
Thanks