Exawind / amr-wind

AMReX-based structured wind solver
https://exawind.github.io/amr-wind
Other
112 stars 83 forks source link

Boundary input file feature on GPU crashing #414

Closed shashankNREL closed 3 years ago

shashankNREL commented 3 years ago

I am trying to run the regression test case abl_bndry_input on GPUs on Eagle. One key difference is that I tried adding 2 levels of refinement far from the boundary through static refinement. As soon as the code initiates a refinement the code crashes. I am running the main branch.

 AMR-Wind version :: 319ad956

 AMR-Wind Git SHA :: 319ad956f0f85fbe2c41fbec97b69f7b63a190c9

  AMReX version    :: 21.05-20-gfb0c16e34b93

Following is the error message that I see. Somehow the error is with NetCDF complaining No group found which is after running a few time steps.

Regrid mesh ... time elapsed = 0.003602950135
Grid summary:
  Level 0   8 grids  110592 cells  100 % of domain
            smallest grid: 16 x 16 x 16  biggest grid: 32 x 32 x 32
  Level 1   8 grids  64000 cells  7.233796296 % of domain
            smallest grid: 16 x 16 x 16  biggest grid: 24 x 24 x 24
  Level 2   8 grids  64000 cells  0.904224537 % of domain
            smallest grid: 16 x 16 x 16  biggest grid: 24 x 24 x 24

For godunov_type select between plm, ppm, ppm_nolim, weno_js, and weno_z: it defaults to ppm
For godunov_type select between plm, ppm, ppm_nolim, weno_js, and weno_z: it defaults to ppm
Step: 7 dt: 0.4 Time: 2.9 to 3.3
CFL: 0.768155 (conv: 0.768008 diff: 0 src: 0.0106265 )

NetCDF: No group found.

terminate called after throwing an instance of 'std::runtime_error'
  what():  Encountered NetCDF error; aborting
MPT ERROR: Rank 0(g:0) received signal SIGABRT/SIGIOT(6).
    Process ID: 8057, Host: r104u37, Program: /lustre/eaglefs/scratch/syellapa/Wind/WRF/BplaneTest/amr-wind/build/amr_wind
    MPT Version: HPE MPT 2.22  03/31/20 16:12:29

MPT: --------stack traceback-------
MPT: Attaching to program: /proc/8057/exe, process 8057
MPT: [New LWP 8102]
MPT: [New LWP 8078]
MPT: [New LWP 8077]
MPT: [Thread debugging using libthread_db enabled]

On Summit the error shows up as what(): GPU last error detected in file /gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/submods/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 1000: misaligned address [f12n08:40078] *** Process received signal ***

@marchdf @sayerhs @jrood-nrel : Have you seen this kind of error before? Can you help me fix this issue?

Thanks

sayerhs commented 3 years ago

@shashankNREL Are you certain the code doesn't crash on CPU? Currently, the ABLBoundaryPlane won't work if the multiple levels are not touching the boundary.

https://github.com/Exawind/amr-wind/blob/319ad956f0f85fbe2c41fbec97b69f7b63a190c9/amr-wind/wind_energy/ABLBoundaryPlane.cpp#L438

@gantech has a patch to workaround this for his IEA work.

shashankNREL commented 3 years ago

@sayerhs and @gantech: The patch was useful and I was able to run the regression test on GPUs.

I started looking into this because when I was trying to run a case on Summit with ABL as inflow with ALM. There I got the following errors

Regrid mesh ... time elapsed = 0.012923277
Grid summary:
  Level 0   128 grids  33554432 cells  100 % of domain
            smallest grid: 64 x 64 x 64  biggest grid: 64 x 64 x 64
  Level 1   150 grids  1627648 cells  0.6063461304 % of domain
            smallest grid: 16 x 16 x 8  biggest grid: 32 x 32 x 16

terminate called after throwing an instance of 'std::runtime_error'
  what():  GPU last error detected in file /gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/submods/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H line 1000: an illegal memory access was encountered
[f09n15:57760] *** Process received signal ***
[f09n15:57760] Signal: Aborted (6)
[f09n15:57760] Signal code:  (-6)
[f09n15:57760] [ 0] [0x2000000504d8]
[f09n15:57760] [ 1] /lib64/libc.so.6(abort+0x2b4)[0x200001412094]
[f09n15:57760] [ 2] /sw/summit/gcc/7.4.0/lib64/libstdc++.so.6(_ZN9__gnu_cxx27__verbose_terminate_handlerEv+0x1c4)[0x200001120614]
[f09n15:57760] [ 3] /sw/summit/gcc/7.4.0/lib64/libstdc++.so.6(+0xab364)[0x20000111b364]
[f09n15:57760] [ 4] /sw/summit/gcc/7.4.0/lib64/libstdc++.so.6(+0xa9778)[0x200001119778]
[f09n15:57760] [ 5] /sw/summit/gcc/7.4.0/lib64/libstdc++.so.6(__gxx_personality_v0+0x52c)[0x20000111a94c]
[f09n15:57760] [ 6] /sw/summit/gcc/7.4.0/lib64/libgcc_s.so.1(+0xc0a4)[0x20000139c0a4]
[f09n15:57760] [ 7] /sw/summit/gcc/7.4.0/lib64/libgcc_s.so.1(_Unwind_RaiseException+0x370)[0x20000139c770]
[f09n15:57760] [ 8] /sw/summit/gcc/7.4.0/lib64/libstdc++.so.6(__cxa_throw+0x68)[0x20000111b8c8]
[f09n15:57760] [ 9] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x11a9d4bc]
[f09n15:57760] [10] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x11a9d12c]
[f09n15:57760] [11] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x10050b5c]
[f09n15:57760] [12] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x1049342c]
[f09n15:57760] [13] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x104903b8]
[f09n15:57760] [14] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x1048aa68]
[f09n15:57760] [15] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x104777c0]
[f09n15:57760] [16] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x10075fd8]
[f09n15:57760] [17] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x100f344c]
[f09n15:57760] [18] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x100ef794]
[f09n15:57760] [19] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x1005aab8]
[f09n15:57760] [20] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x11dcf730]
[f09n15:57760] [21] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x100453c4]
[f09n15:57760] [22] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x10045d4c]
[f09n15:57760] [23] /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind[0x1003dc30]
[f09n15:57760] [24] /lib64/libc.so.6(+0x25200)[0x2000013f5200]

And so with a little probing I found that the problem happens

[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x1003dc30 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x000000001003dc30
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/main.cpp:70 (discriminator 8)
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x10045d4c -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x0000000010045d4c
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/incflo.cpp:225
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x100453c4 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x00000000100453c4
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/incflo.cpp:158 (discriminator 4)
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x11dcf730 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x0000000011dcf730
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/submods/amrex/Src/AmrCore/AMReX_AmrCore.cpp:104
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x1005aab8 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x000000001005aab8
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/incflo_regrid.cpp:18
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x100ef794 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x00000000100ef794
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/core/FieldRepo.cpp:42
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x100f344c -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x00000000100f344c
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/core/Field.H:345
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x10075fd8 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x0000000010075fd8
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/core/Field.cpp:172
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x104777c0 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x00000000104777c0
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/wind_energy/ABLFillInflow.cpp:39
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x1048aa68 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x000000001048aa68
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/amr-wind/wind_energy/ABLBoundaryPlane.cpp:571
[syellapa@login2.summit Inflow-ALM]$ addr2line -a 0x104903b8 -e /gpfs/alpine/cfd142/proj-shared/shashank/Exec-Inflow-ALM/bin/amr_wind
0x00000000104903b8
/gpfs/alpine/cfd142/scratch/syellapa/WRF/amr-wind/submods/amrex/Src/Base/AMReX_GpuLaunchFunctsG.H:1486

So the culprit seems to be ABLBoundaryPlane.cpp, a ParallelFor in the routine populate_data (https://github.com/Exawind/amr-wind/blob/319ad956f0f85fbe2c41fbec97b69f7b63a190c9/amr-wind/wind_energy/ABLBoundaryPlane.cpp#L569) @marchdf and @sayerhs : Can you tell me what could be going wrong here?

shashankNREL commented 3 years ago

The static refinement is not close to any of the inflow planes

####################
#   Mesh
####################
amr.n_cell           = 512 512 128            # Grid cells at coarsest AMRlevel
amr.max_level        = 2                        # Max AMR level in hierarchy
geometry.prob_lo     = 0.0 0.0 0.0
geometry.prob_hi     = 5120.0 5120.0 1280.0
amr.max_grid_size    = 128

tagging.labels = static
tagging.static.type = CartBoxRefinement
tagging.static.static_refinement_def = static_box.txt

and the static_box.txt is


2
1
1600.0 1600.0 40.0 2200.0 2200.0 400.0
1
1700.0 1700.0 60.0 1900.0 1900.0 300.0

and I put Ganesh's fix here as well.

sayerhs commented 3 years ago

@shashankNREL can you share the local modifications to your build? That is the patches from Ganesh?

shashankNREL commented 3 years ago

@sayerhs : @gantech had mentioned that he set the number of levels within ABLBoundaryPlane.cpp to 1 and so I did the following. Not sure if I forgot something. Please let me know if I missed something.

--- a/amr-wind/wind_energy/ABLBoundaryPlane.cpp
+++ b/amr-wind/wind_energy/ABLBoundaryPlane.cpp
-    }
+    // if (m_repo.num_active_levels() > 1) {
+    //     amrex::Abort("Not supporting multi-level input mode yet.");
+    // }

     amrex::Print() << "Reading input NetCDF file: " << m_filename << std::endl;
     auto ncf = ncutils::NCFile::open_par(
@@ -435,7 +435,8 @@ void ABLBoundaryPlane::read_header()
     // Sanity check the input file time
     AMREX_ALWAYS_ASSERT(m_in_times[0] <= m_time.current_time());

-    const int nlevels = m_repo.num_active_levels();
+    // const int nlevels = m_repo.num_active_levels();
+    const int nlevels = 1;
     m_in_data.resize(6);
     for (auto& plane_grp : ncf.all_groups()) {
         int normal, face_dir;
@@ -508,7 +509,8 @@ void ABLBoundaryPlane::read_file()
             m_filename, NC_NOWRITE | NC_NETCDF4 | NC_MPIIO,
             amrex::ParallelContext::CommunicatorSub(), MPI_INFO_NULL);

-        const int nlevels = m_repo.num_active_levels();
+        // const int nlevels = m_repo.num_active_levels();
+        const int nlevels = 1;
         for (amrex::OrientationIter oit; oit; ++oit) {
             auto ori = oit();
             if (not m_in_data.is_populated(ori)) continue;
sayerhs commented 3 years ago

@shashankNREL ABLBoundaryData::populate_field is called in ABLFillInflow::fillpatch operations. So we will want to skip filling boundary planes if (lev > 0). Can you add an early return here before line 538?

https://github.com/Exawind/amr-wind/blob/319ad956f0f85fbe2c41fbec97b69f7b63a190c9/amr-wind/wind_energy/ABLBoundaryPlane.cpp#L530-L538

// Add before line 538
if (lev > 0) return;

I am curious why @gantech 's case worked if he didn't have this logic. Is this only crashing on GPUs and not on CPUs?

shashankNREL commented 3 years ago

@sayerhs: Due to eagle queue issues, I couldn't test this earlier but was able to test it yesterday and it worked for me. I'll just finish the GPU test on Summit for completeness and close this issue.

Thanks again for your help with this issue.