Closed gdicker1 closed 3 months ago
- [ ] Appropriate machine configuration (ccs_configs)
Currently I'm having an issue with this. After following the instructions provided (here) I get errors during my case.setup
step.:
./case.setup
ERROR: module command /usr/share/lmod/lmod/libexec/lmod python purge failed with message:
Unloading the cpe module is insufficient to restore the system defaults.
Please run 'source /opt/cray/pe/cpe/23.12/restore_lmod_system_defaults.[csh|sh]'.
ERROR: case.setup failed
--- End loop for EWv21_PmtrDbg_FHS94.mpasa120.perlmutter_ew_debug.nvhpc.64 ---
I think I can solve this by removing the "purge"
and "rm"
commands from the perlmutter_ew_debug entry
- [x] Modules (nvhpc software stack)
I think we do have this, just something to keep track of and update.
To check this a example test (FHS94 on mpasa120_mpasa120 grid) should be able to:
- [x] run
create_newcase
script- [x] run
case.setup
- [x] run
case.build
- [ ] run
case.submit
Right now, case.submit
failed due to missing MPAS-A partition files in the check_input_data
step (i.e. that I didn't copy over files I knew weren't in a CESM input data source).
- [x] A generally accessible input data space (at least by all EW developers)
See this comment in #34
- [ ] run
case.submit
@supreethms1809 I think I need some help here. I created a case on Perlmutter (using --machine perlmutter_ew_debug
and -i /global/cfs/cdirs/m4180/inputdata
) but things fail with a MPI abort error when running the compset. The run fails early (during init) with no ouput to drv.log (the only other file in the run directory)
From file: "/pscratch/sd/g/gdicker/2024Mar08-164113_EWv21_PmtrDbg_FHS94.mpasa120.perlmutter_ew_debug.nvhpc.64/run/cesm.log.22715130.240308-170402" on Perlmutter
... # repeated t_initf) output per thread 26: (t_initf) profile_ovhd_measurement= F 26: (t_initf) profile_add_detail= F 26: (t_initf) profile_papi_enable= F 5: (t_initf) Read in prof_inparm namelist from: drv_in 5: (t_initf) Using profile_disable= F 5: (t_initf) profile_timer= 4 5: (t_initf) profile_depth_limit= 4 5: (t_initf) profile_detail_limit= 2 5: (t_initf) profile_barrier= F 5: (t_initf) profile_outpe_num= 1 5: (t_initf) profile_outpe_stride= 0 5: (t_initf) profile_single_file= F 5: (t_initf) profile_global_stats= T 5: (t_initf) profile_ovhd_measurement= F 5: (t_initf) profile_add_detail= F 5: (t_initf) profile_papi_enable= F ... # repeated MPI_ABORT output per thread 5: -------------------------------------------------------------------------- 5: MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 CREATE FROM 0 5: Proc: [[38668,0],0] 5: Errorcode: 1 5: 5: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. 5: You may or may not see output from other processes, depending on 5: exactly when Open MPI kills them. 5: -------------------------------------------------------------------------- srun: error: nid002180: tasks 5,14-16,18-19,23,26,39,42,47,50,59: Exited with exit code 1 srun: Terminating StepId=22715130.0 0: slurmstepd: error: *** STEP 22715130.0 ON nid002180 CANCELLED AT 2024-03-09T01:04:16 *** srun: error: nid002180: tasks 0-4,6-13,17,20-22,24-25,27-38,40-41,43-46,48-49,51-58,60-63: Terminated srun: Force Terminated StepId=22715130.0
I haven't tried a run with DEBUG=true because I think NVHPC dies in general when we turn that on for EW/CESM.
Closing due to lack of progress/interest. This can be re-opened later
This issue is intended to capture the work needed and issues being experienced when running EarthWorks on Perlmutter. This issue can be closed once there is a reliable initial state on Perlmutter. This includes:
To check this a example test (FHS94 on mpasa120_mpasa120 grid) should be able to:
create_newcase
scriptcase.setup
case.build
case.submit
(It's fine if these steps require some small edits, as long as they are in case-specific files like user_nl_cam.)