E3SM-Project / v3atm

Fork of E3SM for testing v3 atm changes
Other
0 stars 5 forks source link

aggressive outputting seems to cause segmentation faults #46

Closed mahf708 closed 1 year ago

mahf708 commented 1 year ago

Steps to reproduce on chrysalis

  1. Take the latest runscript used by @golaz and co from the coupled team (internal link: https://acme-climate.atlassian.net/l/cp/AsKf90WL).
  2. Apply the following edits (click arrow to see details).
patch for run script ```diff 152,154c152,154 < nhtfrq = 0,-24,-6,-6,-3,-24,0 < mfilt = 1,30,120,120,240,30,1 < avgflag_pertape = 'A','A','I','A','A','A','I' --- > nhtfrq = 0,-24,-6,-6,-3,-24,0,-27,-1 > mfilt = 1,30,120,120,240,30,1,240,240 > avgflag_pertape = 'A','A','I','A','A','A','I','I','I' 156c156 < fincl1 = 'extinct_sw_inp','extinct_lw_bnd7','extinct_lw_inp','CLD_CAL', 'TREFMNAV', 'TREFMXAV' --- > fincl1 = 'extinct_sw_inp','extinct_lw_bnd7','extinct_lw_inp','CLD_CAL', 'TREFMNAV', 'TREFMXAV', 'cdnc', 'lwp', 'iwp', 'lcc', 'icc', 'clt', 'cod', 'ccn', 'ttop', 'OMEGA500', 'OMEGA700', 'TH7001000', 'U850', 'V850', 'SOLIN', 'FSNT', 'FSNTOA', 'FSUTOA', 'FSUTOA_d1', 'FSUTOAC', 'FSUTOAC_d1', 'FLUT', 'FLNT', 'FLUTC', 'FLNTC', 'FSNSC', 'FSDSC', 'CLDLOW_CAL', 'CLDMED_CAL', 'CLDHGH_CAL' 162a163,165 > fincl8 = 'cdnc', 'lwp', 'iwp', 'lcc', 'icc', 'clt', 'cod', 'ccn', 'ttop', 'OMEGA500', 'OMEGA700', 'TH7001000', 'U850', 'V850', 'SOLIN', 'FSNT', 'FSNTOA', 'FSUTOA', 'FSUTOA_d1', 'FSUTOAC', 'FSUTOAC_d1', 'FLUT', 'FLNT', 'FLUTC', 'FLNTC', 'FSNSC', 'FSDSC', 'CLDLOW_CAL', 'CLDMED_CAL', 'CLDHGH_CAL', 'AODVIS', 'FSNT', 'FLNT', 'FSNTC', 'FLNTC', 'FSNT_d1', 'FLNT_d1', 'FSNTC_d1', 'FLNTC_d1', 'FSNS', 'FLNS', 'FSNSC', 'FLNSC', 'FSNS_d1', 'FLNS_d1', 'FSNSC_d1', 'FLNSC_d1', 'CLDHGH', 'CLDMED', 'CLDLOW', 'T', 'TS', 'TREFHT', 'BURDEN1', 'BURDEN2', 'BURDEN3', 'BURDEN4', 'BURDEN5', 'BURDENSO4', 'BURDENSEASALT', 'BURDENDUST', 'AODSS', 'AODSO4', 'AODDUST' > fincl9 = 'PS','Q','T','Z3','CLOUD','CONCLD','CLDICE','CLDLIQ','FREQR','REI','REL','PRECT','TMQ','PRECC','TREFHT','QREFHT','OMEGA','CLDTOT','LHFLX','SHFLX','FLDS','FSDS','FLNS','FSNS','FLNSC','FSDSC','FSNSC','AODVIS','AODABS','LS_FLXPRC','LS_FLXSNW','LS_REFFRAIN','ZMFLXPRC','ZMFLXSNW','CCN1','CCN2','CCN3','CCN4','CCN5','num_a1','num_a2','num_a3','num_a4','so4_a1','so4_a2','so4_a3','AREL','TGCLDLWP','AQRAIN','ANRAIN','FREQR','PRECL','RELHUM' > fincl9lonlat='262.5e_36.6n','204.6e_71.3n','147.4e_2.0s','166.9e_0.5s','130.9e_12.4s','331.97e_39.09n' 177,181c180,186 < ! history_aero_optics = .true. < ! history_aerosol = .true. < ! history_amwg = .true. < ! history_budget = .true. < ! history_verbose = .true. --- > history_aero_optics = .true. > history_aerosol = .true. > history_amwg = .true. > history_budget = .true. > do_aerocom_ind3 = .true. > cosp_llidar_sim = .true. > history_verbose = .true. 285c290 < ! rad_diag_1 = 'A:H2OLNZ:H2O','N:O2:O2','N:CO2:CO2','A:O3:O3','A:N2OLNZ:N2O','A:CH4LNZ:CH4','N:CFC11:CFC11','N:CFC12:CFC12' --- > rad_diag_1 = 'A:H2OLNZ:H2O','N:O2:O2','N:CO2:CO2','A:O3:O3','A:N2OLNZ:N2O','A:CH4LNZ:CH4','N:CFC11:CFC11','N:CFC12:CFC12' ```
  1. Ensure you have fetch_code, etc. settings correctly set.
  2. Run the script.

What actually happens

The simulation segfaults right after 2001-10-01. This was tested by me on chrysalis 7 different times, with the segfault happening at the same exact spot. This was tested before and after the maintenance on chrysalis. Example snippet of the logs is below.

example signal 11 and backtrace ``` 1609: imp_sol : @ (lchnk,lev,col) = 7010 66 3 failed 1609: 1 times 1865: imp_sol: Time step 1.8000000000000E+03 failed to converge @ (lchnk,lev,col,nstep) = 7266 65 3295344 1865: imp_sol : @ (lchnk,lev,col) = 7266 65 3 failed 1865: 1 times 1728: [chr-0254:1434860:0:1434860] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28) 2688: [chr-0430:3248581:0:3248581] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28) 1024: [chr-0161:3164882:0:3164882] Caught signal 11 (Segmentation fault: address not mapped to object at address 0x28) 1728: ==== backtrace (tid:1434860) ==== 1728: 0 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0 1728: 1 0x000000000000b6d3 do_bcast() ???:0 1728: 2 0x000000000000d140 vmc_bcast() ???:0 1728: 3 0x000000000000320a hmca_mcast_vmc_bcast() ???:0 1728: 4 0x0000000000008ae3 hmca_bcol_ucx_p2p_bcast_mcast() ???:0 1728: 5 0x000000000004adc1 hmca_coll_ml_parallel_bcast() ???:0 1728: 6 0x0000000000101b23 mca_coll_hcoll_bcast() /tmp/svcbuilder/spack-stage-openmpi-4.1.1-qiqkjbudgcwkdbgw6p5rdrujcu4davb2/spack-src/ompi/mca/coll/hcoll/coll_hcoll_ops.c:59 1728: 7 0x00000000000aae54 PMPI_Bcast() /tmp/svcbuilder/spack-stage-openmpi-4.1.1-qiqkjbudgcwkdbgw6p5rdrujcu4davb2/spack-src/ompi/mpi/c/profile/pbcast.c:114 1728: 8 0x0000000000051524 ompi_bcast_f() /tmp/svcbuilder/spack-stage-openmpi-4.1.1-qiqkjbudgcwkdbgw6p5rdrujcu4davb2/spack-src/ompi/mpi/fortran/mpif-h/profile/pbcast_f.c:80 1728: 9 0x0000000004a6763f shr_mpi_mod_mp_shr_mpi_bcastr0_() /lcrc/group/e3sm/ac.ngmahfouz/E3SMv3_dev/run.20230209.amip.v3atm.base.chrysalis/code/share/util/shr_mpi_mod.F90:676 1728: 10 0x000000000485ede2 seq_infodata_mod_mp_seq_infodata_exchange_() /lcrc/group/e3sm/ac.ngmahfouz/E3SMv3_dev/run.20230209.amip.v3atm.base.chrysalis/code/driver-mct/shr/seq_infodata_mod.F90:2557 1728: 11 0x0000000000456ade component_mod_mp_component_exch_() /lcrc/group/e3sm/ac.ngmahfouz/E3SMv3_dev/run.20230209.amip.v3atm.base.chrysalis/code/driver-mct/main/component_mod.F90:902 1728: 12 0x000000000043854a cime_comp_mod_mp_cime_run_.A() /lcrc/group/e3sm/ac.ngmahfouz/E3SMv3_dev/run.20230209.amip.v3atm.base.chrysalis/code/driver-mct/main/cime_comp_mod.F90:3986 1728: 13 0x0000000000455b80 MAIN__() /lcrc/group/e3sm/ac.ngmahfouz/E3SMv3_dev/run.20230209.amip.v3atm.base.chrysalis/code/driver-mct/main/cime_driver.F90:153 1728: 14 0x0000000000425822 main() ???:0 1728: 15 0x0000000000023493 __libc_start_main() ???:0 1728: 16 0x000000000042572e _start() ???:0 1728: ================================= 1024: ==== backtrace (tid:3164882) ==== 1024: 0 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0 1024: 1 0x000000000000b6d3 do_bcast() ???:0 1024: 2 0x000000000000d140 vmc_bcast() ???:0 1024: 3 0x000000000000320a hmca_mcast_vmc_bcast() ???:0 1024: 4 0x0000000000008ae3 hmca_bcol_ucx_p2p_bcast_mcast() ???:0 1024: 5 0x000000000004adc1 hmca_coll_ml_parallel_bcast() ???:0 1024: 6 0x0000000000101b23 mca_coll_hcoll_bcast() /tmp/svcbuilder/spack-stage-openmpi-4.1.1-qiqkjbudgcwkdbgw6p5rdrujcu4davb2/spack-src/ompi/mca/coll/hcoll/coll_hcoll_ops.c:59 1024: 7 0x00000000000aae54 PMPI_Bcast() /tmp/svcbuilder/spack-stage-openmpi-4.1.1-qiqkjbudgcwkdbgw6p5rdrujcu4davb2/spack-src/ompi/mpi/c/profile/pbcast.c:114 1024: 8 0x0000000000051524 ompi_bcast_f() /tmp/svcbuilder/spack-stage-openmpi-4.1.1-qiqkjbudgcwkdbgw6p5rdrujcu4davb2/spack-src/ompi/mpi/fortran/mpif-h/profile/pbcast_f.c:80 1024: 9 0x0000000004a6763f shr_mpi_mod_mp_shr_mpi_bcastr0_() /lcrc/group/e3sm/ac.ngmahfouz/E3SMv3_dev/run.20230209.amip.v3atm.base.chrysalis/code/share/util/shr_mpi_mod.F90:676 1024: 10 0x000000000485ede2 seq_infodata_mod_mp_seq_infodata_exchange_() /lcrc/group/e3sm/ac.ngmahfouz/E3SMv3_dev/run.20230209.amip.v3atm.base.chrysalis/code/driver-mct/shr/seq_infodata_mod.F90:2557 1024: 11 0x0000000000456ade component_mod_mp_component_exch_() /lcrc/group/e3sm/ac.ngmahfouz/E3SMv3_dev/run.20230209.amip.v3atm.base.chrysalis/code/driver-mct/main/component_mod.F90:902 1024: 12 0x000000000043854a cime_comp_mod_mp_cime_run_.A() /lcrc/group/e3sm/ac.ngmahfouz/E3SMv3_dev/run.20230209.amip.v3atm.base.chrysalis/code/driver-mct/main/cime_comp_mod.F90:3986 1024: 13 0x0000000000455b80 MAIN__() /lcrc/group/e3sm/ac.ngmahfouz/E3SMv3_dev/run.20230209.amip.v3atm.base.chrysalis/code/driver-mct/main/cime_driver.F90:153 1024: 14 0x0000000000425822 main() ???:0 1024: 15 0x0000000000023493 __libc_start_main() ???:0 1024: 16 0x000000000042572e _start() ???:0 1024: ================================= 2688: ==== backtrace (tid:3248581) ==== 2688: 0 0x0000000000012b20 .annobin_sigaction.c() sigaction.c:0 2688: 1 0x000000000000b6d3 do_bcast() ???:0 2688: 2 0x000000000000d140 vmc_bcast() ???:0 2688: 3 0x000000000000320a hmca_mcast_vmc_bcast() ???:0 2688: 4 0x0000000000008ae3 hmca_bcol_ucx_p2p_bcast_mcast() ???:0 2688: 5 0x000000000004adc1 hmca_coll_ml_parallel_bcast() ???:0 2688: 6 0x0000000000101b23 mca_coll_hcoll_bcast() /tmp/svcbuilder/spack-stage-openmpi-4.1.1-qiqkjbudgcwkdbgw6p5rdrujcu4davb2/spack-src/ompi/mca/coll/hcoll/coll_hcoll_ops.c:59 2688: 7 0x00000000000aae54 PMPI_Bcast() /tmp/svcbuilder/spack-stage-openmpi-4.1.1-qiqkjbudgcwkdbgw6p5rdrujcu4davb2/spack-src/ompi/mpi/c/profile/pbcast.c:114 2688: 8 0x0000000000051524 ompi_bcast_f() /tmp/svcbuilder/spack-stage-openmpi-4.1.1-qiqkjbudgcwkdbgw6p5rdrujcu4davb2/spack-src/ompi/mpi/fortran/mpif-h/profile/pbcast_f.c:80 2688: 9 0x0000000004a6763f shr_mpi_mod_mp_shr_mpi_bcastr0_() /lcrc/group/e3sm/ac.ngmahfouz/E3SMv3_dev/run.20230209.amip.v3atm.base.chrysalis/code/share/util/shr_mpi_mod.F90:676 2688: 10 0x000000000485ede2 seq_infodata_mod_mp_seq_infodata_exchange_() /lcrc/group/e3sm/ac.ngmahfouz/E3SMv3_dev/run.20230209.amip.v3atm.base.chrysalis/code/driver-mct/shr/seq_infodata_mod.F90:2557 2688: 11 0x0000000000456ade component_mod_mp_component_exch_() /lcrc/group/e3sm/ac.ngmahfouz/E3SMv3_dev/run.20230209.amip.v3atm.base.chrysalis/code/driver-mct/main/component_mod.F90:902 2688: 12 0x000000000043854a cime_comp_mod_mp_cime_run_.A() /lcrc/group/e3sm/ac.ngmahfouz/E3SMv3_dev/run.20230209.amip.v3atm.base.chrysalis/code/driver-mct/main/cime_comp_mod.F90:3986 2688: 13 0x0000000000455b80 MAIN__() /lcrc/group/e3sm/ac.ngmahfouz/E3SMv3_dev/run.20230209.amip.v3atm.base.chrysalis/code/driver-mct/main/cime_driver.F90:153 2688: 14 0x0000000000425822 main() ???:0 2688: 15 0x0000000000023493 __libc_start_main() ???:0 2688: 16 0x000000000042572e _start() ???:0 2688: ================================= srun: error: chr-0254: task 1728: Segmentation fault (core dumped) srun: Terminating job step 283445.0 0: slurmstepd: error: *** STEP 283445.0 ON chr-0095 CANCELLED AT 2023-02-11T19:26:41 *** 2161: forrtl: error (78): process killed (SIGTERM) 2161: Image PC Routine Line Source 2161: libpnetcdf.so.3.0 000015555406B6BC for__signal_handl Unknown Unknown 2161: libpthread-2.28.s 0000155551099B20 Unknown Unknown Unknown 2161: libmpi.so.40.30.1 00001555518D173C ompi_request_defa Unknown Unknown 2161: libmpi.so.40.30.1 0000155551906DE2 MPI_Wait Unknown Unknown 2161: libmpi_mpifh.so.4 0000155551E9987F mpi_wait Unknown Unknown ```

The major difference between the scripts

We are interested in verbose outputs to test hypotheses related to the cloud portions of the radiative forcing. In my modification, I turn on a lot of outputs, both in quantity and frequency; e.g., see tapes 0, 7, and 8.

One possible explanation

In conversation with @susburrows and @crterai over email, problematic memory allocation was floated as a potential cause. To this end, the memory logging appears to show growing usage (e.g., see ${CASEROOT}/run/memory.*.log and below for a snippet thereof ā€” TOD is for time, VSZ_* virtual size, and RSS_* resident set size).

snippet of memory log ``` #TOD, VSZ_CPL_N_0, RSS_CPL_N_0, VSZ_ATM_N_0, RSS_ATM_N_0, VSZ_LND_N_0, RSS_LND_N_0, VSZ_ICE_N_0, RSS_ICE_N_0, VSZ_OCN_N_0, RSS_OCN_N_0, VSZ_GLC_N_0, RSS_GLC_N_0, VSZ_ROF_N_0, RSS_ROF_N_0, VSZ_WAV_N_0, RSS_WAV_N_0, VSZ_IAC_N_0, RSS_IAC_N_0 19850100.00000, 131572.891, 62806.785, 131572.891, 62806.785, 131572.891, 62806.785, 131572.891, 62806.785, 131572.891, 62806.785, 131572.891, 62806.785, 131572.891, 62806.785, 131572.891, 62806.785, 131572.891, 62806.785 19850102.00000, 136404.164, 68137.703, 136404.164, 68137.703, 136404.164, 68137.703, 136404.164, 68137.703, 136404.164, 68137.703, 136404.164, 68137.703, 136404.164, 68137.703, 136404.164, 68137.703, 136404.164, 68137.703 19850104.00000, 139010.285, 70708.500, 139010.285, 70708.500, 139010.285, 70708.500, 139010.285, 70708.500, 139010.285, 70708.500, 139010.285, 70708.500, 139010.285, 70708.500, 139010.285, 70708.500, 139010.285, 70708.500 19850104.00000, 140429.457, 72175.055, 140429.457, 72175.055, 140429.457, 72175.055, 140429.457, 72175.055, 140429.457, 72175.055, 140429.457, 72175.055, 140429.457, 72175.055, 140429.457, 72175.055, 140429.457, 72175.055 19850104.00000, 140933.492, 72639.461, 140933.492, 72639.461, 140933.492, 72639.461, 140933.492, 72639.461, 140933.492, 72639.461, 140933.492, 72639.461, 140933.492, 72639.461, 140933.492, 72639.461, 140933.492, 72639.461 ... ... ... 20011030.00000, 183698.875, 114674.051, 183698.875, 114674.051, 183698.875, 114674.051, 183698.875, 114674.051, 183698.875, 114674.051, 183698.875, 114674.051, 183698.875, 114674.051, 183698.875, 114674.051, 183698.875, 114674.051 20011032.00000, 184499.785, 115677.332, 184499.785, 115677.332, 184499.785, 115677.332, 184499.785, 115677.332, 184499.785, 115677.332, 184499.785, 115677.332, 184499.785, 115677.332, 184499.785, 115677.332, 184499.785, 115677.332 20011100.00000, 185152.148, 116386.137, 185152.148, 116386.137, 185152.148, 116386.137, 185152.148, 116386.137, 185152.148, 116386.137, 185152.148, 116386.137, 185152.148, 116386.137, 185152.148, 116386.137, 185152.148, 116386.137 20011102.00000, 185302.902, 116538.180, 185302.902, 116538.180, 185302.902, 116538.180, 185302.902, 116538.180, 185302.902, 116538.180, 185302.902, 116538.180, 185302.902, 116538.180, 185302.902, 116538.180, 185302.902, 116538.180 20011104.00000, 184920.223, 116123.035, 184920.223, 116123.035, 184920.223, 116123.035, 184920.223, 116123.035, 184920.223, 116123.035, 184920.223, 116123.035, 184920.223, 116123.035, 184920.223, 116123.035, 184920.223, 116123.035 20011104.00000, 184979.785, 116104.832, 184979.785, 116104.832, 184979.785, 116104.832, 184979.785, 116104.832, 184979.785, 116104.832, 184979.785, 116104.832, 184979.785, 116104.832, 184979.785, 116104.832, 184979.785, 116104.832 20011104.00000, 184918.605, 116070.320, 184918.605, 116070.320, 184918.605, 116070.320, 184918.605, 116070.320, 184918.605, 116070.320, 184918.605, 116070.320, 184918.605, 116070.320, 184918.605, 116070.320, 184918.605, 116070.320 20011106.00000, 184852.578, 116088.887, 184852.578, 116088.887, 184852.578, 116088.887, 184852.578, 116088.887, 184852.578, 116088.887, 184852.578, 116088.887, 184852.578, 116088.887, 184852.578, 116088.887, 184852.578, 116088.887 20011108.00000, 183274.434, 114475.215, 183274.434, 114475.215, 183274.434, 114475.215, 183274.434, 114475.215, 183274.434, 114475.215, 183274.434, 114475.215, 183274.434, 114475.215, 183274.434, 114475.215, 183274.434, 114475.215 20011108.00000, 183286.434, 114478.395, 183286.434, 114478.395, 183286.434, 114478.395, 183286.434, 114478.395, 183286.434, 114478.395, 183286.434, 114478.395, 183286.434, 114478.395, 183286.434, 114478.395, 183286.434, 114478.395 20011108.00000, 183548.172, 114717.840, 183548.172, 114717.840, 183548.172, 114717.840, 183548.172, 114717.840, 183548.172, 114717.840, 183548.172, 114717.840, 183548.172, 114717.840, 183548.172, 114717.840, 183548.172, 114717.840 20011110.00000, 184601.762, 115791.066, 184601.762, 115791.066, 184601.762, 115791.066, 184601.762, 115791.066, 184601.762, 115791.066, 184601.762, 115791.066, 184601.762, 115791.066, 184601.762, 115791.066, 184601.762, 115791.066 ```

Potential workarounds

If the issue is indeed due to memory accumulation, a potential workaround is splitting jobs into smaller pieces (smaller wall times, more resubmits). The runs can continue after the segfault using CONTINUE_RUN="TRUE". Copying @wlin7 for awareness and visibility, who helped me in submitting the latter runs. See below for memory log after continuing.

snippet of memory log after continuing ``` 20011106.00000, 184810.023, 116035.113, 184810.023, 116035.113, 184810.023, 116035.113, 184810.023, 116035.113, 184810.023, 116035.113, 184810.023, 116035.113, 184810.023, 116035.113, 184810.023, 116035.113, 184810.023, 116035.113 20011108.00000, 183811.734, 114961.238, 183811.734, 114961.238, 183811.734, 114961.238, 183811.734, 114961.238, 183811.734, 114961.238, 183811.734, 114961.238, 183811.734, 114961.238, 183811.734, 114961.238, 183811.734, 114961.238 20011108.00000, 183831.734, 114967.109, 183831.734, 114967.109, 183831.734, 114967.109, 183831.734, 114967.109, 183831.734, 114967.109, 183831.734, 114967.109, 183831.734, 114967.109, 183831.734, 114967.109, 183831.734, 114967.109 20011108.00000, 183907.996, 115031.527, 183907.996, 115031.527, 183907.996, 115031.527, 183907.996, 115031.527, 183907.996, 115031.527, 183907.996, 115031.527, 183907.996, 115031.527, 183907.996, 115031.527, 183907.996, 115031.527 20011110.00000, 184742.074, 115928.895, 184742.074, 115928.895, 184742.074, 115928.895, 184742.074, 115928.895, 184742.074, 115928.895, 184742.074, 115928.895, 184742.074, 115928.895, 184742.074, 115928.895, 184742.074, 115928.895 20000100.00000, 134003.371, 64337.293, 134003.371, 64337.293, 134003.371, 64337.293, 134003.371, 64337.293, 134003.371, 64337.293, 134003.371, 64337.293, 134003.371, 64337.293, 134003.371, 64337.293, 134003.371, 64337.293 20000102.00000, 135633.641, 67162.969, 135633.641, 67162.969, 135633.641, 67162.969, 135633.641, 67162.969, 135633.641, 67162.969, 135633.641, 67162.969, 135633.641, 67162.969, 135633.641, 67162.969, 135633.641, 67162.969 20000104.00000, 138140.230, 69686.070, 138140.230, 69686.070, 138140.230, 69686.070, 138140.230, 69686.070, 138140.230, 69686.070, 138140.230, 69686.070, 138140.230, 69686.070, 138140.230, 69686.070, 138140.230, 69686.070 20000104.00000, 139673.371, 71218.363, 139673.371, 71218.363, 139673.371, 71218.363, 139673.371, 71218.363, 139673.371, 71218.363, 139673.371, 71218.363, 139673.371, 71218.363, 139673.371, 71218.363, 139673.371, 71218.363 ```
sarats commented 1 year ago

One thing to check. Do you have a old version of machine file configuration or is it in sync with upstream master?

mahf708 commented 1 year ago

It's from the NGD_v3atm branch: https://github.com/E3SM-Project/v3atm/tree/NGD_v3atm. Not sure if that answers your questions though! I am pretty new here...

sarats commented 1 year ago

The crashes are apparently in the MPI layer during info exchange. I wonder if you are asking to allocate a big chunk of memory?

sarats commented 1 year ago

This branch is 310 commits ahead, 3300 commits behind master.

I think someone familiar with this branch's development history needs to take a look.

mahf708 commented 1 year ago

One thing to check. Do you have a old version of machine file configuration or is it in sync with upstream master?

This may also explain why case.st_archive isn't working automatically for me (requesting 43 debug nodes on chrysalis) but it works manually šŸ˜„

Btw, this specific run is here /lcrc/group/e3sm/ac.ngmahfouz/E3SMv3_dev/run.20230209.amip.v3atm.base.chrysalis and you can change base to base2, dmsX, ssaX, and dustX (where X is 2 or 4 and later 8) for my cases. I would examine base2 because base got messed up somehow and refused to continue if that helps. The only key change is the aggressive outputting (at least the only key change by design, that is)

wlin7 commented 1 year ago

This branch is 310 commits ahead, 3300 commits behind master.

I think someone familiar with this branch's development history needs to take a look.

@sarat, we chose to leave this branch well behind master to allow the out-of-sync NGD development codes to be put together for assessment simulations. That was a big compromise for sure, to avoid delay of assessment efforts for proposed NGD features.

sarats commented 1 year ago

I understand. I just commented that someone who knows what changes are in this branch would have a better context in understanding the root cause of this issue. Unfortunately, I don't have the context or cycles to debug this.

wlin7 commented 1 year ago

Unfortunately for this issue, I don't know any better about the cause. We are relying on @mahf708 to dig deep.

sarats commented 1 year ago

This branch is definitely using old machine configs as I don't see any cmake macro files in https://github.com/E3SM-Project/v3atm/tree/NGD_v3atm/cime_config/machines.

config_compilers.xml is no longer used in E3SM master.

https://github.com/E3SM-Project/E3SM/tree/master/cime_config/machines You make want to check config_machines.xml entry Chrysalis as well as the specific compiler's cmake macros. Example: https://github.com/E3SM-Project/E3SM/blob/master/cime_config/machines/cmake_macros/intel_chrysalis.cmake

susburrows commented 1 year ago

@wlin7 @sarats - while Naser (@mahf708) indeed has strong troubleshooting skills, he only joined the project a month ago (as a postdoc), and his work focuses primarily on diagnostics rather than model development. I don't think we should put the responsibility for diagnosing this issue on him. When I brought this up during the v3 integration call on Friday, @crterai suggested that the best approach might be to test batches of code from this branch for memory leaks before they are merged to master, rather than trying to diagnose/isolate the issue on this branch.

mahf708 commented 1 year ago

Yeah, I cannot reasonably spend more time on this, especially with other urgent matters (I am also supposed to be doing science, can you believe it?) and the outdated nature of this branch as it relates both to v3atm master and e3sm master with respect to the infrasturcture. I do highly recommend we get the infrastructure part of this branch up to date as soon as possible. I suspect there might be more than one issue related to the outdated submodules.

I updated one submodule in #49 specifically to help better report IO issues for others to fix. I will soon write a response to your questions about submodules, @wlin7, because your questions illustrate the stark pros and cons of relying on submodules in projects...

mahf708 commented 1 year ago

Having said that, @wlin7 is working on rebasing this branch with master and a full integration effort is going to be underway any moment. So, we could simply just keep these items in mind as the integration/rebasing efforts unfold.

mahf708 commented 1 year ago

Closing this per internal notes at: https://acme-climate.atlassian.net/wiki/spaces/ATMOS/pages/3699310593/2023-03-03+Meeting+notes+-+v3+Integration?NO_SSR=1