EarthWorksOrg / EarthWorks

Other
3 stars 2 forks source link

EarthWorks porting to Perlmutter #33

Closed gdicker1 closed 3 weeks ago

gdicker1 commented 5 months ago

This issue is intended to capture the work needed and issues being experienced when running EarthWorks on Perlmutter. This issue can be closed once there is a reliable initial state on Perlmutter. This includes:

To check this a example test (FHS94 on mpasa120_mpasa120 grid) should be able to:

(It's fine if these steps require some small edits, as long as they are in case-specific files like user_nl_cam.)

gdicker1 commented 5 months ago
  • [ ] Appropriate machine configuration (ccs_configs)

Currently I'm having an issue with this. After following the instructions provided (here) I get errors during my case.setup step.:

./case.setup
ERROR: module command /usr/share/lmod/lmod/libexec/lmod python purge  failed with message:
Unloading the cpe module is insufficient to restore the system defaults.
Please run 'source /opt/cray/pe/cpe/23.12/restore_lmod_system_defaults.[csh|sh]'.
ERROR: case.setup failed
--- End loop for EWv21_PmtrDbg_FHS94.mpasa120.perlmutter_ew_debug.nvhpc.64 ---

I think I can solve this by removing the "purge" and "rm" commands from the perlmutter_ew_debug entry

gdicker1 commented 5 months ago
  • [x] Modules (nvhpc software stack)

I think we do have this, just something to keep track of and update.

gdicker1 commented 5 months ago

To check this a example test (FHS94 on mpasa120_mpasa120 grid) should be able to:

  • [x] run create_newcase script
  • [x] run case.setup
  • [x] run case.build
  • [ ] run case.submit

Right now, case.submit failed due to missing MPAS-A partition files in the check_input_data step (i.e. that I didn't copy over files I knew weren't in a CESM input data source).

gdicker1 commented 5 months ago

Related PR: https://github.com/EarthWorksOrg/ccs_config_cesm/pull/17

gdicker1 commented 5 months ago
  • [x] A generally accessible input data space (at least by all EW developers)

See this comment in #34

gdicker1 commented 5 months ago
  • [ ] run case.submit

@supreethms1809 I think I need some help here. I created a case on Perlmutter (using --machine perlmutter_ew_debug and -i /global/cfs/cdirs/m4180/inputdata) but things fail with a MPI abort error when running the compset. The run fails early (during init) with no ouput to drv.log (the only other file in the run directory)

From file: "/pscratch/sd/g/gdicker/2024Mar08-164113_EWv21_PmtrDbg_FHS94.mpasa120.perlmutter_ew_debug.nvhpc.64/run/cesm.log.22715130.240308-170402" on Perlmutter

... # repeated t_initf) output per thread
26:  (t_initf)       profile_ovhd_measurement=  F
26:  (t_initf)       profile_add_detail=        F
26:  (t_initf)       profile_papi_enable=       F
 5:  (t_initf) Read in prof_inparm namelist from: drv_in
 5:  (t_initf) Using profile_disable=           F
 5:  (t_initf)       profile_timer=                       4
 5:  (t_initf)       profile_depth_limit=                 4
 5:  (t_initf)       profile_detail_limit=                2
 5:  (t_initf)       profile_barrier=           F
 5:  (t_initf)       profile_outpe_num=                   1
 5:  (t_initf)       profile_outpe_stride=                0
 5:  (t_initf)       profile_single_file=       F
 5:  (t_initf)       profile_global_stats=      T
 5:  (t_initf)       profile_ovhd_measurement=  F
 5:  (t_initf)       profile_add_detail=        F
 5:  (t_initf)       profile_papi_enable=       F
 ... # repeated MPI_ABORT output per thread
 5: --------------------------------------------------------------------------
 5: MPI_ABORT was invoked on rank 0 in communicator MPI COMMUNICATOR 3 CREATE FROM 0
 5:   Proc: [[38668,0],0]
 5:   Errorcode: 1
 5:
 5: NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
 5: You may or may not see output from other processes, depending on
 5: exactly when Open MPI kills them.
 5: --------------------------------------------------------------------------
srun: error: nid002180: tasks 5,14-16,18-19,23,26,39,42,47,50,59: Exited with exit code 1
srun: Terminating StepId=22715130.0
 0: slurmstepd: error: *** STEP 22715130.0 ON nid002180 CANCELLED AT 2024-03-09T01:04:16 ***
srun: error: nid002180: tasks 0-4,6-13,17,20-22,24-25,27-38,40-41,43-46,48-49,51-58,60-63: Terminated
srun: Force Terminated StepId=22715130.0

I haven't tried a run with DEBUG=true because I think NVHPC dies in general when we turn that on for EW/CESM.

gdicker1 commented 3 weeks ago

Closing due to lack of progress/interest. This can be re-opened later