E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
73 stars 52 forks source link

Summit GPU debugging: ne4 (2 tasks on 2 nodes) - DEBUG mode issues #1680

Closed sarats closed 6 months ago

sarats commented 2 years ago

v1 with ne4 using 2 tasks with one task per node gets past initialization but still gets stuck.

This is the backtrace for the two processes.

cc @bartgol @jgfouca @oksanaguba @whannah1

(gdb) bt
#0  Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<double***, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >, void>::extent<unsigned int> (this=0x7ffffbc11dd0, r=@0x7ffffbc10d00: 2) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/externals/ekat/extern/kokkos/core/src/impl/Kokkos_ViewMapping.hpp:3093
#1  0x00000000128c4dcc in Kokkos::Impl::view_verify_operator_bounds<2u, Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<double***, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >, void>, int> (map=..., i=@0x7ffffbc10d70: 35)
    at /gpfs/alpine/cli115/scratch/sarat/repos/scream/externals/ekat/extern/kokkos/core/src/impl/Kokkos_ViewMapping.hpp:3888
#2  0x00000000128c1e1c in Kokkos::Impl::view_verify_operator_bounds<1u, Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<double***, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >, void>, int, int> (map=..., i=@0x7ffffbc10dd0: 6)
    at /gpfs/alpine/cli115/scratch/sarat/repos/scream/externals/ekat/extern/kokkos/core/src/impl/Kokkos_ViewMapping.hpp:3888
#3  0x00000000128bdfbc in Kokkos::Impl::view_verify_operator_bounds<0u, Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<double***, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >, void>, int, int, int> (map=..., i=@0x7ffffbc112b0: 2976)
    at /gpfs/alpine/cli115/scratch/sarat/repos/scream/externals/ekat/extern/kokkos/core/src/impl/Kokkos_ViewMapping.hpp:3888
#4  0x00000000128b9088 in Kokkos::Impl::view_verify_operator_bounds<Kokkos::HostSpace, Kokkos::View<double***, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >, Kokkos::Impl::ViewMapping<Kokkos::ViewTraits<double***, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >, void>, int, int, int> (tracker=..., map=...)
    at /gpfs/alpine/cli115/scratch/sarat/repos/scream/externals/ekat/extern/kokkos/core/src/impl/Kokkos_ViewMapping.hpp:3967
#5  0x000000001374fc58 in Kokkos::View<double***, Kokkos::LayoutRight, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace> >::operator()<int, int, int> (i2=<optimized out>, i1=<optimized out>, i0=<optimized out>, this=<optimized out>) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/externals/ekat/extern/kokkos/core/src/Kokkos_View.hpp:963
#6  scream::spa::SPAFunctions<double, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >::update_spa_data_from_file (spa_data_file_name=..., time_index=1, nswbands=14, nlwbands=16, spa_horiz_interp=..., spa_data=...) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/physics/spa/spa_functions_impl.hpp:631
#7  0x00000000137499f8 in scream::spa::SPAFunctions<double, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >::update_spa_timestate (spa_data_file_name=..., nswbands=14, nlwbands=16, ts=..., spa_horiz_interp=..., time_state=..., spa_beg=..., spa_end=...)
    at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/physics/spa/spa_functions_impl.hpp:704
#8  0x000000001373d338 in scream::SPA::initialize_impl (this=0x3608f240) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/physics/spa/atmosphere_prescribed_aerosol.cpp:189
#9  0x00000000137b53f4 in scream::AtmosphereProcess::initialize (this=0x3608f240, t0=..., run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process.cpp:43
#10 0x00000000137d0888 in scream::AtmosphereProcessGroup::initialize_impl (this=0x2c75caf0, run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process_group.cpp:186
#11 0x00000000137b53f4 in scream::AtmosphereProcess::initialize (this=0x2c75caf0, t0=..., run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process.cpp:43
#12 0x00000000137d0888 in scream::AtmosphereProcessGroup::initialize_impl (this=0x2b2ba0a0, run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process_group.cpp:186
#13 0x00000000137b53f4 in scream::AtmosphereProcess::initialize (this=0x2b2ba0a0, t0=..., run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process.cpp:43
#14 0x00000000137d0888 in scream::AtmosphereProcessGroup::initialize_impl (this=0x3267eb40, run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process_group.cpp:186
#15 0x00000000137b53f4 in scream::AtmosphereProcess::initialize (this=0x3267eb40, t0=..., run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process.cpp:43
#16 0x000000001265db50 in scream::control::AtmosphereDriver::initialize_atm_procs (this=0x35dbafe0) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/control/atmosphere_driver.cpp:872
#17 0x000000001024ac70 in <lambda()>::operator()(void) const (__closure=0x7ffffbc136b0) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:190
#18 0x0000000010252c04 in _GLOBAL__N__e19a5351_28_scream_cxx_f90_interface_cpp_babe8b2b::fpe_guard_wrapper<scream_init_atm(int, int, int, int)::<lambda()> >(const <lambda()> &) (f=...) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:51
#19 0x000000001024ad6c in scream_init_atm (run_start_ymd=10101, run_start_tod=0, case_start_ymd=10101, case_start_tod=0) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:164
#20 0x0000000010246684 in atm_comp_mct::atm_init_mct (eclock=..., cdata=..., x2a=..., a2x=..., nlfilename=..., _nlfilename=6) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/mct_coupling/atm_comp_mct.F90:174
#21 0x000000001006823c in component_mod::component_init_cc (eclock=..., comp=..., comp_init=0x102444e0 <atm_comp_mct::atm_init_mct>, infodata=..., nlfilename=..., seq_flds_x2c_fluxes=..., seq_flds_c2x_fluxes=..., _nlfilename=6, _seq_flds_x2c_fluxes=0, _seq_flds_c2x_fluxes=0)
    at /gpfs/alpine/cli115/scratch/sarat/repos/scream/driver-mct/main/component_mod.F90:248
#22 0x00000000100473f0 in cime_comp_mod::cime_init () at /gpfs/alpine/cli115/scratch/sarat/repos/scream/driver-mct/main/cime_comp_mod.F90:1438
#23 0x000000001005fa40 in cime_driver () at /gpfs/alpine/cli115/scratch/sarat/repos/scream/driver-mct/main/cime_driver.F90:122
#24 0x000000001005fbf0 in main (argc=1, argv=0x7ffffbc24b7a) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/driver-mct/main/cime_driver.F90:23
#25 0x0000200006114078 in generic_start_main.isra () from /lib64/power9/libc.so.6
#26 0x0000200006114264 in __libc_start_main () from /lib64/power9/libc.so.6
#27 0x0000000000000000 in ?? ()

---------------------------------------------------------------
Second process
---------------------------------------------------------------

(gdb) bt
#0  0x00002000060b2f44 in pthread_spin_lock () from /lib64/power9/libpthread.so.0
#1  0x000020000ac4114c in mlx5_poll_cq_ex_1 () from /lib64/libmlx5-rdmav2.so
#2  0x000020000a1c9bfc in PAMI_Context_advancev () from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-6depextb6p6ulrvmehgtbskbmcsyhtdi/container/../lib/pami_port/libpami.so.3
#3  0x0000200008ff77d0 in mca_pml_pami_progress_wait () from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-6depextb6p6ulrvmehgtbskbmcsyhtdi/container/../lib/spectrum_mpi/mca_pml_pami.so
#4  0x000020000900aa10 in mca_pml_pami_recv () from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-6depextb6p6ulrvmehgtbskbmcsyhtdi/container/../lib/spectrum_mpi/mca_pml_pami.so
#5  0x0000200005f0c4b0 in ompi_coll_base_bcast_intra_basic_linear () from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-6depextb6p6ulrvmehgtbskbmcsyhtdi/container/../lib/libmpi_ibm.so.3
#6  0x000020000a7658e8 in mca_coll_ibm_bcast () from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-6depextb6p6ulrvmehgtbskbmcsyhtdi/container/../lib/spectrum_mpi/mca_coll_ibm.so
#7  0x0000200005eb1158 in PMPI_Bcast () from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/spectrum-mpi-10.4.0.3-20210112-6depextb6p6ulrvmehgtbskbmcsyhtdi/container/../lib/libmpi_ibm.so.3
#8  0x00002000001b6e80 in ncmpi_open () from /sw/summit/spack-envs/base/opt/linux-rhel8-ppc64le/gcc-9.3.0/parallel-netcdf-1.12.2-wr65dxzaz6topsdmlgzw2xyzn7w6uvs7/lib/libpnetcdf.so.4
#9  0x00002000000a583c in ncmpi_open (comm=0x3d04f180, path=0x4fb09360 "/gpfs/alpine/cli115/world-shared/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne30_20220428.nc", omode=<optimized out>, info=0x200005fc8eb8 <ompi_mpi_info_null>, ncidp=0x5254d368) at lib/darshan-pnetcdf.c:157
#10 0x00000000125ab138 in PIOc_openfile_retry (iosysid=2049, ncidp=0x50e475d8, iotype=0x22584d60 <__scream_scorpio_interface_MOD_pio_iotype>, filename=0x4fb09360 "/gpfs/alpine/cli115/world-shared/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne30_20220428.nc", mode=0, retry=1)
    at /gpfs/alpine/cli115/scratch/sarat/repos/scream/externals/scorpio/src/clib/pioc_support.c:3519
#11 0x00000000125abc88 in openfile_int (iosysid=2049, ncidp=0x50e475d8, iotype=0x22584d60 <__scream_scorpio_interface_MOD_pio_iotype>, filename=0x4fb09360 "/gpfs/alpine/cli115/world-shared/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne30_20220428.nc", mode=0, retry=1)
    at /gpfs/alpine/cli115/scratch/sarat/repos/scream/externals/scorpio/src/clib/pioc_support.c:3737
#12 0x00000000125a0124 in PIOc_openfile (iosysid=2049, ncidp=0x50e475d8, iotype=0x22584d60 <__scream_scorpio_interface_MOD_pio_iotype>, filename=0x4fb09360 "/gpfs/alpine/cli115/world-shared/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne30_20220428.nc", mode=0)
    at /gpfs/alpine/cli115/scratch/sarat/repos/scream/externals/scorpio/src/clib/pio_file.c:35
#13 0x00000000124ca74c in piolib_mod::pio_openfile (iosystem=..., file=..., iotype=1, fname=..., mode=0, _fname=110) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/externals/scorpio/src/flib/piolib_mod.F90:1001
#14 0x000000001288b744 in scream_scorpio_interface::eam_pio_openfile (pio_file=0x50e474d0, fname=..., _fname=110) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/io/scream_scorpio_interface.F90:717
#15 0x00000000128880a0 in scream_scorpio_interface::get_pio_atm_file (filename=..., pio_file=0x50e474d0, purpose=1, _filename=110) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/io/scream_scorpio_interface.F90:1247
#16 0x000000001289230c in scream_scorpio_interface::register_file (filename=..., file_purpose=1, _filename=110) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/io/scream_scorpio_interface.F90:204
#17 0x0000000012894480 in scream_scorpio_interface_iso_c2f::register_file_c2f (filename_in=0x50e3fd90, purpose=1) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/io/scream_scorpio_interface_iso_c2f.F90:57
#18 0x0000000012892480 in scream::scorpio::register_file (filename=..., mode=scream::scorpio::Read) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/io/scream_scorpio_interface.cpp:49
#19 0x00000000128a5980 in scream::AtmosphereInput::AtmosphereInput (this=0x7fffd254d650, comm=..., params=...) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/io/scorpio_input.cpp:24
#20 0x000000001374dac0 in scream::spa::SPAFunctions<double, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >::update_spa_data_from_file (spa_data_file_name=..., time_index=2, nswbands=14, nlwbands=16, spa_horiz_interp=..., spa_data=...) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/physics/spa/spa_functions_impl.hpp:502
#21 0x0000000013749a4c in scream::spa::SPAFunctions<double, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >::update_spa_timestate (spa_data_file_name=..., nswbands=14, nlwbands=16, ts=..., spa_horiz_interp=..., time_state=..., spa_beg=..., spa_end=...)
    at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/physics/spa/spa_functions_impl.hpp:706
#22 0x000000001373d338 in scream::SPA::initialize_impl (this=0x483690f0) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/physics/spa/atmosphere_prescribed_aerosol.cpp:189
#23 0x00000000137b53f4 in scream::AtmosphereProcess::initialize (this=0x483690f0, t0=..., run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process.cpp:43
#24 0x00000000137d0888 in scream::AtmosphereProcessGroup::initialize_impl (this=0x4439eb50, run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process_group.cpp:186
#25 0x00000000137b53f4 in scream::AtmosphereProcess::initialize (this=0x4439eb50, t0=..., run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process.cpp:43
#26 0x00000000137d0888 in scream::AtmosphereProcessGroup::initialize_impl (this=0x43dfb170, run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process_group.cpp:186
#27 0x00000000137b53f4 in scream::AtmosphereProcess::initialize (this=0x43dfb170, t0=..., run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process.cpp:43
#28 0x00000000137d0888 in scream::AtmosphereProcessGroup::initialize_impl (this=0x45adfea0, run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process_group.cpp:186
#29 0x00000000137b53f4 in scream::AtmosphereProcess::initialize (this=0x45adfea0, t0=..., run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process.cpp:43
#30 0x000000001265db50 in scream::control::AtmosphereDriver::initialize_atm_procs (this=0x44ce57c0) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/control/atmosphere_driver.cpp:872
#31 0x000000001024ac70 in <lambda()>::operator()(void) const (__closure=0x7fffd254f160) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:190
#32 0x0000000010252c04 in _GLOBAL__N__e19a5351_28_scream_cxx_f90_interface_cpp_babe8b2b::fpe_guard_wrapper<scream_init_atm(int, int, int, int)::<lambda()> >(const <lambda()> &) (f=...) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:51
#33 0x000000001024ad6c in scream_init_atm (run_start_ymd=10101, run_start_tod=0, case_start_ymd=10101, case_start_tod=0) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:164
#34 0x0000000010246684 in atm_comp_mct::atm_init_mct (eclock=..., cdata=..., x2a=..., a2x=..., nlfilename=..., _nlfilename=6) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/mct_coupling/atm_comp_mct.F90:174
#35 0x000000001006823c in component_mod::component_init_cc (eclock=..., comp=..., comp_init=0x102444e0 <atm_comp_mct::atm_init_mct>, infodata=..., nlfilename=..., seq_flds_x2c_fluxes=..., seq_flds_c2x_fluxes=..., _nlfilename=6, _seq_flds_x2c_fluxes=0, _seq_flds_c2x_fluxes=0)
    at /gpfs/alpine/cli115/scratch/sarat/repos/scream/driver-mct/main/component_mod.F90:248
#36 0x00000000100473f0 in cime_comp_mod::cime_init () at /gpfs/alpine/cli115/scratch/sarat/repos/scream/driver-mct/main/cime_comp_mod.F90:1438
#37 0x000000001005fa40 in cime_driver () at /gpfs/alpine/cli115/scratch/sarat/repos/scream/driver-mct/main/cime_driver.F90:122
#38 0x000000001005fbf0 in main (argc=1, argv=0x7fffd2554b7a) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/driver-mct/main/cime_driver.F90:23
#39 0x0000200006114078 in generic_start_main.isra () from /lib64/power9/libc.so.6
#40 0x0000200006114264 in __libc_start_main () from /lib64/power9/libc.so.6
#41 0x0000000000000000 in ?? ()
sarats commented 2 years ago

I did check and the following file exists on the compute node.

$ ls -l /gpfs/alpine/cli115/world-shared/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne30_20220428.nc
-rw-rw-r-- 1 acmetest cli115 9914470292 Apr 27 17:41 /gpfs/alpine/cli115/world-shared/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne30_20220428.nc
PeterCaldwell commented 2 years ago

I notice you're trying to read an ne30 file during an ne4 simulation. I think this is expected - I think we remap an ne30 file to ne4 to confirm that horizontal remapping is correct. But worth checking with @AaronDonahue, who implemented this part of the code. Aaron - can you take a quick look at this stack trace and see if you have other ideas about what could be wrong.

I think you can bypass this horiz remap step by using an ne4 SPA file instead and setting SPARemapFile to none in namelist_scream.xml.

PeterCaldwell commented 2 years ago

It's also weird that we're having so much trouble on Summit, but have been doing perlmutter gpu runs for a long while. Is there a particular case you want @ndkeen to try to confirm the problem is isolated to Summit?

PeterCaldwell commented 2 years ago

the ne4 spa file to use is /gpfs/alpine/cli115/world-shared/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne4_20220428.nc

sarats commented 2 years ago

All of us have been trying to just do the basic ne4 test run after checking out master.

But there is something specific in how Summit fails when we have 2 tasks on same node vs. 2 tasks on different nodes.

./create_newcase --compset F2010-SCREAMv1 --res ne4_ne4 --case scr-ne4-nt2 --compiler gnugpu --queue debug --walltime 00:30
./xmlchange DEBUG=TRUE
./xmlchange NTASKS=2
./case.setup
./case.build
./case.submit
AaronDonahue commented 2 years ago

based on the stack trace it looks like it errors when trying to open the file. @sarats , can you confirm the files SPA is looking for are actually there? You can check the path by looking in run/data/scream_input.yaml and looking in the SPA section.

sarats commented 2 years ago

As I noted above, this file /gpfs/alpine/cli115/world-shared/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne30_20220428.nc is there. Do you want to check something else?

AaronDonahue commented 2 years ago

yes, can you check if the mapping file is also present?

AaronDonahue commented 2 years ago

oh wait, yeah, this is looking for the data file. So it should work...

AaronDonahue commented 2 years ago

what are the permissions on that file?

sarats commented 2 years ago

Permissions seem ok.

 64       spa:
 65         SPA Remap File: /gpfs/alpine/cli115/world-shared/e3sm/inputdata/atm/scream/init/map_ne30_to_ne4_mono_20220502.nc
 66         SPA Data File: /gpfs/alpine/cli115/world-shared/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne30_20220428.nc

[sarat@batch1 data ]$ ls /gpfs/alpine/cli115/world-shared/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne30_20220428.nc -l
-rw-rw-r-- 1 acmetest cli115 9914470292 Apr 27 17:41 /gpfs/alpine/cli115/world-shared/e3sm/inputdata/atm/scream/init/spa_file_unified_and_complete_ne30_20220428.nc
[sarat@batch1 data ]$ ls -l /gpfs/alpine/cli115/world-shared/e3sm/inputdata/atm/scream/init/map_ne30_to_ne4_mono_20220502.nc
-rw-rw-r-- 1 acmetest cli115 16979268 May  2 19:38 /gpfs/alpine/cli115/world-shared/e3sm/inputdata/atm/scream/init/map_ne30_to_ne4_mono_20220502.nc
sarats commented 2 years ago

When I change the data file to the one @PeterCaldwell pointed above, the program crashes at

(gdb) bt
#0  0x0000200006133618 in raise () from /lib64/power9/libc.so.6
#1  0x0000200006113a2c in abort () from /lib64/power9/libc.so.6
#2  0x0000200003fbba28 in __gnu_cxx::__verbose_terminate_handler () at /gpfs/alpine/scratch/belhorn/stf007/builds/gcc-build-9.3.0-3/gcc-9.3.0/libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x0000200003fb7004 in __cxxabiv1::__terminate (handler=<optimized out>) at /gpfs/alpine/scratch/belhorn/stf007/builds/gcc-build-9.3.0-3/gcc-9.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
#4  0x0000200003fb70d0 in std::terminate () at /gpfs/alpine/scratch/belhorn/stf007/builds/gcc-build-9.3.0-3/gcc-9.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
#5  0x0000200003fb75a8 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x200004166bf0 <typeinfo for std::logic_error>, dest=0x200003fd7070 <std::logic_error::~logic_error()>) at /gpfs/alpine/scratch/belhorn/stf007/builds/gcc-build-9.3.0-3/gcc-9.3.0/libstdc++-v3/libsupc++/eh_throw.cc:95
#6  0x000000001374e200 in scream::spa::SPAFunctions<double, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >::update_spa_data_from_file (spa_data_file_name=..., time_index=1, nswbands=14, nlwbands=16, spa_horiz_interp=..., spa_data=...) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/physics/spa/spa_functions_impl.hpp:518
#7  0x00000000137499f8 in scream::spa::SPAFunctions<double, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >::update_spa_timestate (spa_data_file_name=..., nswbands=14, nlwbands=16, ts=..., spa_horiz_interp=..., time_state=..., spa_beg=..., spa_end=...)
    at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/physics/spa/spa_functions_impl.hpp:704
#8  0x000000001373d338 in scream::SPA::initialize_impl (this=0x40954ea0) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/physics/spa/atmosphere_prescribed_aerosol.cpp:189
#9  0x00000000137b53f4 in scream::AtmosphereProcess::initialize (this=0x40954ea0, t0=..., run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process.cpp:43
#10 0x00000000137d0888 in scream::AtmosphereProcessGroup::initialize_impl (this=0x37a2c970, run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process_group.cpp:186
#11 0x00000000137b53f4 in scream::AtmosphereProcess::initialize (this=0x37a2c970, t0=..., run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process.cpp:43
#12 0x00000000137d0888 in scream::AtmosphereProcessGroup::initialize_impl (this=0x383d0e40, run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process_group.cpp:186
#13 0x00000000137b53f4 in scream::AtmosphereProcess::initialize (this=0x383d0e40, t0=..., run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process.cpp:43
#14 0x00000000137d0888 in scream::AtmosphereProcessGroup::initialize_impl (this=0x3e600480, run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process_group.cpp:186
#15 0x00000000137b53f4 in scream::AtmosphereProcess::initialize (this=0x3e600480, t0=..., run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process.cpp:43
#16 0x000000001265db50 in scream::control::AtmosphereDriver::initialize_atm_procs (this=0x40a198b0) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/control/atmosphere_driver.cpp:872
#17 0x000000001024ac70 in <lambda()>::operator()(void) const (__closure=0x7fffe9acb6a0) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:190
#18 0x0000000010252c04 in _GLOBAL__N__e19a5351_28_scream_cxx_f90_interface_cpp_babe8b2b::fpe_guard_wrapper<scream_init_atm(int, int, int, int)::<lambda()> >(const <lambda()> &) (f=...) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:51
#19 0x000000001024ad6c in scream_init_atm (run_start_ymd=10101, run_start_tod=0, case_start_ymd=10101, case_start_tod=0) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:164
#20 0x0000000010246684 in atm_comp_mct::atm_init_mct (eclock=..., cdata=..., x2a=..., a2x=..., nlfilename=..., _nlfilename=6) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/mct_coupling/atm_comp_mct.F90:174
#21 0x000000001006823c in component_mod::component_init_cc (eclock=..., comp=..., comp_init=0x102444e0 <atm_comp_mct::atm_init_mct>, infodata=..., nlfilename=..., seq_flds_x2c_fluxes=..., seq_flds_c2x_fluxes=..., _nlfilename=6, _seq_flds_x2c_fluxes=0, _seq_flds_c2x_fluxes=0)
    at /gpfs/alpine/cli115/scratch/sarat/repos/scream/driver-mct/main/component_mod.F90:248
#22 0x00000000100473f0 in cime_comp_mod::cime_init () at /gpfs/alpine/cli115/scratch/sarat/repos/scream/driver-mct/main/cime_comp_mod.F90:1438
#23 0x000000001005fa40 in cime_driver () at /gpfs/alpine/cli115/scratch/sarat/repos/scream/driver-mct/main/cime_driver.F90:122
#24 0x000000001005fbf0 in main (argc=1, argv=0x7fffe9ad4ba6) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/driver-mct/main/cime_driver.F90:23
#25 0x0000200006114078 in generic_start_main.isra () from /lib64/power9/libc.so.6
#26 0x0000200006114264 in __libc_start_main () from /lib64/power9/libc.so.6
#27 0x0000000000000000 in ?? ()

Second process

(gdb) bt
#0  0x0000200006133618 in raise () from /lib64/power9/libc.so.6
#1  0x0000200006113a2c in abort () from /lib64/power9/libc.so.6
#2  0x0000200003fbba28 in __gnu_cxx::__verbose_terminate_handler () at /gpfs/alpine/scratch/belhorn/stf007/builds/gcc-build-9.3.0-3/gcc-9.3.0/libstdc++-v3/libsupc++/vterminate.cc:95
#3  0x0000200003fb7004 in __cxxabiv1::__terminate (handler=<optimized out>) at /gpfs/alpine/scratch/belhorn/stf007/builds/gcc-build-9.3.0-3/gcc-9.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:48
#4  0x0000200003fb70d0 in std::terminate () at /gpfs/alpine/scratch/belhorn/stf007/builds/gcc-build-9.3.0-3/gcc-9.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:58
#5  0x0000200003fb75a8 in __cxxabiv1::__cxa_throw (obj=<optimized out>, tinfo=0x200004166bf0 <typeinfo for std::logic_error>, dest=0x200003fd7070 <std::logic_error::~logic_error()>) at /gpfs/alpine/scratch/belhorn/stf007/builds/gcc-build-9.3.0-3/gcc-9.3.0/libstdc++-v3/libsupc++/eh_throw.cc:95
#6  0x000000001374e200 in scream::spa::SPAFunctions<double, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >::update_spa_data_from_file (spa_data_file_name=..., time_index=1, nswbands=14, nlwbands=16, spa_horiz_interp=..., spa_data=...) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/physics/spa/spa_functions_impl.hpp:518
#7  0x00000000137499f8 in scream::spa::SPAFunctions<double, Kokkos::Device<Kokkos::Cuda, Kokkos::CudaSpace> >::update_spa_timestate (spa_data_file_name=..., nswbands=14, nlwbands=16, ts=..., spa_horiz_interp=..., time_state=..., spa_beg=..., spa_end=...)
    at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/physics/spa/spa_functions_impl.hpp:704
#8  0x000000001373d338 in scream::SPA::initialize_impl (this=0x59c56ea0) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/physics/spa/atmosphere_prescribed_aerosol.cpp:189
#9  0x00000000137b53f4 in scream::AtmosphereProcess::initialize (this=0x59c56ea0, t0=..., run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process.cpp:43
#10 0x00000000137d0888 in scream::AtmosphereProcessGroup::initialize_impl (this=0x53b521e0, run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process_group.cpp:186
#11 0x00000000137b53f4 in scream::AtmosphereProcess::initialize (this=0x53b521e0, t0=..., run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process.cpp:43
#12 0x00000000137d0888 in scream::AtmosphereProcessGroup::initialize_impl (this=0x540299f0, run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process_group.cpp:186
#13 0x00000000137b53f4 in scream::AtmosphereProcess::initialize (this=0x540299f0, t0=..., run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process.cpp:43
#14 0x00000000137d0888 in scream::AtmosphereProcessGroup::initialize_impl (this=0x55211b70, run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process_group.cpp:186
#15 0x00000000137b53f4 in scream::AtmosphereProcess::initialize (this=0x55211b70, t0=..., run_type=scream::RunType::Initial) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/share/atm_process/atmosphere_process.cpp:43
#16 0x000000001265db50 in scream::control::AtmosphereDriver::initialize_atm_procs (this=0x59679e30) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/control/atmosphere_driver.cpp:872
#17 0x000000001024ac70 in <lambda()>::operator()(void) const (__closure=0x7ffffd93e9f0) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:190
#18 0x0000000010252c04 in _GLOBAL__N__e19a5351_28_scream_cxx_f90_interface_cpp_babe8b2b::fpe_guard_wrapper<scream_init_atm(int, int, int, int)::<lambda()> >(const <lambda()> &) (f=...) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:51
#19 0x000000001024ad6c in scream_init_atm (run_start_ymd=10101, run_start_tod=0, case_start_ymd=10101, case_start_tod=0) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/mct_coupling/scream_cxx_f90_interface.cpp:164
#20 0x0000000010246684 in atm_comp_mct::atm_init_mct (eclock=..., cdata=..., x2a=..., a2x=..., nlfilename=..., _nlfilename=6) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/mct_coupling/atm_comp_mct.F90:174
#21 0x000000001006823c in component_mod::component_init_cc (eclock=..., comp=..., comp_init=0x102444e0 <atm_comp_mct::atm_init_mct>, infodata=..., nlfilename=..., seq_flds_x2c_fluxes=..., seq_flds_c2x_fluxes=..., _nlfilename=6, _seq_flds_x2c_fluxes=0, _seq_flds_c2x_fluxes=0)
    at /gpfs/alpine/cli115/scratch/sarat/repos/scream/driver-mct/main/component_mod.F90:248
#22 0x00000000100473f0 in cime_comp_mod::cime_init () at /gpfs/alpine/cli115/scratch/sarat/repos/scream/driver-mct/main/cime_comp_mod.F90:1438
#23 0x000000001005fa40 in cime_driver () at /gpfs/alpine/cli115/scratch/sarat/repos/scream/driver-mct/main/cime_driver.F90:122
#24 0x000000001005fbf0 in main (argc=1, argv=0x7ffffd944ba6) at /gpfs/alpine/cli115/scratch/sarat/repos/scream/driver-mct/main/cime_driver.F90:23
#25 0x0000200006114078 in generic_start_main.isra () from /lib64/power9/libc.so.6
#26 0x0000200006114264 in __libc_start_main () from /lib64/power9/libc.so.6
#27 0x0000000000000000 in ?? ()
AaronDonahue commented 2 years ago

@sarats , did you set the

 64       spa:
 65         SPA Remap File: none

?? otherwise it will try to remap the ne4 data using and ne30->ne4 map.

sarats commented 2 years ago

Oh ok, that might explain this error message then:

14: 1: terminate called after throwing an instance of 'std::logic_error'
191 14: 1:   what():  /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/physics/spa/spa_functions_impl.hpp:518: FAIL:
192 14: 1: ncol==spa_horiz_interp.source_grid_ncols
193 14: 1: ERROR update_spa_data_from_file: Number of columns in remap data (866 doesn't match the SPA data file (48602).
194 14: 1:
oksanaguba commented 2 years ago

can i ask why there is discussion about ne4 IC on summit? I think Luca already ran on 1 rank (though did he report T<0), and i was able to run SMS.ne4_ne4.F2010-SCREAMv1 on 1 rank and saw

Atmosphere step = 119
  model time = 0001-01-05 23:00:00

[EAMXX] Finalize ...
[EAMXX] Finalize ... done!

So I assume it was successful?

AaronDonahue commented 2 years ago

This is 2 ranks right? Maybe there is something wrong with multiple ranks...

oksanaguba commented 2 years ago

@sarat -- can you run ./preview_run in your folder? for me, NRES and tasks-per-resource are messed up every time.

oksanaguba commented 2 years ago

But 2 ranks (1 per node) running manually seem to work (I think Luca did the same earlier):

jsrun -X 1 --nrs 2 --rs_per_host 1 --tasks_per_rs 1 -d plane:1 --cpu_per_rs 7 --gpu_per_rs 1 --bind packed:smt:1 --latency_priority gpu-cpu --stdio_mode prepended   --smpiargs="-gpu" /gpfs/alpine/cli115/proj-shared/onguba/e3sm_scratch/SMS.ne4_ne4.F2010-SCREAMv1.summit_gnugpu.20220525_170033_r56ez7/bld/e3sm.exe
bash-4.4$ ls -la atm.log.220525-193250 
-rw------- 1 onguba cli115 8333 May 25 19:46 atm.log.220525-193250
bash-4.4$ tail atm.log.220525-193250 
  model time = 0001-01-05 21:00:00

Atmosphere step = 118
  model time = 0001-01-05 22:00:00

Atmosphere step = 119
  model time = 0001-01-05 23:00:00

[EAMXX] Finalize ...
[EAMXX] Finalize ... done!
sarats commented 2 years ago

jsrun -X 1 -n 2 -r 1 -a 1 -c 7 -g 1 --stdio_mode prepended <path>/e3sm.exe

This is a DEBUG build.

With the ne4 data file, the crash occurs as there is an attempt to use a large team size. This seems to be from the exception handler in scream/components/scream/src/dynamics/homme/atmosphere_dynamics.cpp:428.

4: 1: Kokkos::Impl::ParallelFor< Cuda > requested too large team size.

4: 0: WARNING: SPA Remap File has been set to 'NONE', assuming that SPA data and simulation are on the same grid - skipping horizontal interpolation
4: 1: WARNING: SPA Remap File has been set to 'NONE', assuming that SPA data and simulation are on the same grid - skipping horizontal interpolation
4: 0: Create Pool
4: 0: Tesla V100-SXM2-16GB
4: 0: INFORM: Automatically inserting fence() after every parallel_for
4: 0: [EAMXX] initialize_atm_procs ... done!
4: 0: [EAMxx::init] resolution-dependent device memory footprint: 70.230680MB
4: 0: [EAMxx::init] resolution-dependent host memory footprint: 26.815656MB
4: 0: Atmosphere step = 0
4: 0:   model time = 0001-01-01 00:00:00
4: 0:
4: 0: 4: 1: terminate called after throwing an instance of 'std::logic_error'
4: 1:   what():  /gpfs/alpine/cli115/scratch/sarat/repos/scream/components/scream/src/dynamics/homme/atmosphere_dynamics.cpp:428: FAIL:
4: 1: false
4: 1: Kokkos::Impl::ParallelFor< Cuda > requested too large team size.
4: 1: Traceback functionality not available
4: 1:
sarats commented 2 years ago

I found this related issue: https://github.com/E3SM-Project/HOMMEXX/issues/293

Note that some folks saw similar errors in Trilinos only when doing debug build/runs. https://github.com/kokkos/kokkos/issues/4095

That might explain why some folks might get a successful run if they are not using DEBUG.

ndkeen commented 2 years ago

Or https://github.com/E3SM-Project/scream/issues/1485

sarats commented 2 years ago

Yes, that's similar to the workaround the Trilinos folks used as well, cap it to 512.

sarats commented 2 years ago

So, I can confirm that we can get a successful run with a non-DEBUG build with 2 ranks on 2 nodes. It even finalized properly and exited. Anyway, one mystery solved regarding why DEBUG builds didn't work.

Even the 2 ranks on single node with HOMMEXX_MPI_ON_DEVICE OFF worked. So, there seems to be an issue with GPU to GPU peer communication.

A minor thing to note is the creation of the 'fort.98' file. $ cat fort.98 P3_INIT (reading/creating look-up tables) ...

oksanaguba commented 2 years ago

Just in case it is lost on slack -- the PAMI error shows up with standalone homme when using modules from cime. If using different gnu/cuda, homme standalone works on 1 node, on 2 nodes 12 resources, etc.

i am trying to search a combination of gnu/cuda, so that scream is build with it and homme does not have PAMI errors. So far, the problem is to build ekat.

bartgol commented 2 years ago

@sarats when you did these experiments, were you always using the same modules? And were they the current CIME defaults (gnu/9.3 and cuda/11.5)? Or were they the ones you changed to in #1690?

bartgol commented 2 years ago

can i ask why there is discussion about ne4 IC on summit? I think Luca already ran on 1 rank (though did he report T<0)

After updating master last week (to get the sub cycling fix), I was able to run (interactively) with 2+ ranks on 2+ nodes, by specifying correct resources to jsrun, and disabling MPI_ON_DEVICE in homme. I thought I was also able to run on 2 nodes with 12 ranks using case.submit. All of this was with summit's default modules (gcc/9.3 and cuda/11.5).

However, today I can't run using case.submit. Still have to try manually via interactive nodes.

bartgol commented 6 months ago

Closing b/c stale (and we probably don't run on Summit anymore)