NCAR / DART

Data Assimilation Research Testbed
https://dart.ucar.edu/
Apache License 2.0
192 stars 143 forks source link

Modify WRF-DART Tutorial scripting for Derecho #627

Closed braczka closed 6 months ago

braczka commented 8 months ago

Use case

Modify WRF-DART tutorial code to work with Derecho.

Is your feature request related to a problem?

Scripting currently compatible with de-comissionned Cheyenne, however both systems using PBS queueing systems so modification should be minimal.

Describe your preferred solution

1) Adapt queuing system options for Derecho 2) Adapt mpi run commands 3) Adapt WRF and DART processor layout 4) Document system environment that works with precompiled Derecho compatible WRF executables and any other steps that depart from Cheyenne

Describe any alternatives you have considered

None.

braczka commented 8 months ago

Initial testing suggests modifying system environment (through c-shell scripting module load commands) is not working with Derecho. Initial workaround was to modify system environment through log-in .tcshrc scripting to adjust system environment.

Initial testing of PBS scripting with WRF job provides output and error files for each task within in a single ensemble member. Prior submissions on Cheyenne only provided single output/error files. Could be non-issue as WRF simulations are successful.

braczka commented 8 months ago

Update: Tutorial works OK unitil step:

./driver.csh 2017042706 param.csh >& run.out &

The DART filter step completes successfully, however subsequent WRF model advance after update fails for some ensemble members. Still investigating cause. Possible reasons include use of newer WRF version (4.0) and using pre-compiled wrfda executable. WRF-DART tutorial was designed with WRF 3.9.1.

braczka commented 8 months ago

Csh script submission errors: Inserting the command 'source /etc/profile.d/z00_modules.csh' into the csh script within a PBS submission allows the module load command to work. Cisl-help recommended avoiding the usage of module load in any non-PBS csh scripts for now. Both these issues should be addressed in the next downtime (Feb 5-7).

To isolate cause of WRFv4.0 step advance failure, I switched back to the WRFv3.9.1, which has been used/recommended to run the tutorial. Source code for WRF, WPS and WRFDA build on Derecho are located here:

WRF_DM_SRC_DIR = /glade/work/bmraczka/WRF/WRFV3.9.1.1.TAR.gz
WPS_SRC_DIR = /glade/work/bmraczka/WRF/WPSV3.9.1.TAR.gz
VAR_SRC_DIR = /glade/work/bmraczka/WRF/WRFDA_V3.9.1.tar.gz

Following guidance from cislhelp I switched from standard intel compiler to gnu to compile WRF on Derecho. The upgrade in GCC has led to some bugs requiring certain environmental settings as a workaround as documented here:https://forum.mmm.ucar.edu/threads/how-to-fix-rank-mismatch-between-actual-argument-at-1-and-actual-argument-at-2-scalar-and-rank-1.14995/.

To successfully build the required WRF, WPS and WRFDA executables I did the following:

>> module --force purge
>> module load ncarenv/23.09 gcc/12.2.0 udunits/2.2.28 ncview/2.1.9 ncarcompilers/1.0.0 craype/2.7.23 cray-mpich/8.1.27 hdf5-mpi/1.12.2 netcdf-mpi/4.9.2
>> cd {WRF_directory}
>> ./configure    # Choose gnu dmpar option (34), then option 1 to generate configure.wrf

! Edits to configure.wrf file
 FCBASEOPTS = $(FCBASEOPTS_NO_G) $(FCDEBUG) -fallow-argument-mismatch  -fallow-invalid-boz
...
LDFLAGS = $(OMP) $(FCFLAGS) $(LDFLAGS_LOCAL) -ltirpc

>> ./compile em_real  >& compile.log
>> cd {WPS_directory}
>> ./configure.    # Choose option 1 for gfortran (serial)

! Edits to configure.wrf
FFLAGS              = -ffree-form -O -fconvert=big-endian -frecord-marker=4 -fallow-argument-mismatch -fallow-invalid-boz
F77FLAGS            = -ffixed-form -O -fconvert=big-endian -frecord-marker=4 -fallow-argument-mismatch -fallow-invalid-boz

Edit to WPS install ~/ungrib/src/ngl/g2/intmath.f

Solution posted on github here: (https://github.com/wrf-model/WPS/pull/119/files) Will solve the error: Argument of 'iand' have different tupe parameters

./compile >& compile.log

cd {WRFDA_directory} ./configure wrda --> choose option 34. (dmpar) GNU (gfortran/gcc)

Edit configure.wrf file

  FCBASEOPTS = $(FCBASEOPTS_NO_G) $(FCDEBUG) -fallow-argument-mismatch -fallow-invalid-boz
  FCOPTIM = -O2 -ftree-vectorize -funroll-loops -fallow-argument-mismatch

Make manual edit in {WRDA}/var/da/da_monitor/da_rad_diags.f90 This will avoid Symbol -- must be declared before the namelist id declared error

  integer                                :: nproc, cycle_period
   integer, parameter                     :: maxnum = 20
   character(len=20), dimension(maxnum)   :: instid
   character(len=6)                       :: file_prefix
   character(len=10)                      :: start_date, end_date

   namelist /record1/ nproc, instid, file_prefix, start_date, end_date, cycle_period
           ! nproc: number of processsors used when writing out inv files
           ! instid, eg dmsp-16-ssmis
           ! file_prefix, inv or oma
           ! start_date, end_date, eg 2006100100, 2006102800
           ! cycle_period (hours) between dates, eg 6 or 12
   integer, parameter                     :: maxlvl = 100
   integer                                :: nml_unit = 87
   integer                                :: nlev, ilev, ich
   integer                                :: nlev_rtm, nlev_mdl
!  character(len=20), dimension(maxnum)   :: instid
!  character(len=6)                       :: file_prefix
!  character(len=10)                      :: start_date, end_date

./compile all_wrfvar >& compile.log

Check for all executables at the end of compile step as following documentation: [(https://www2.mmm.ucar.edu/wrf/users/docs/user_guide_V3/user_guide_V3.9/users_guide_chap2.htm#_Required_Compilers_and_1)

braczka commented 8 months ago

Successfully ran full WRF-DART Tutorial on Derecho. All output statistics/diagnostics looked nearly identical to the previous Cheyenne intel compiler example provided in the WRF-DART web diagnostics section. I used the gfortran compiler for WRF executables (as described in previous comments) and also with the DART build. For the DART build I used the mkmf.template.gfortran as template and edited the following line:

FFLAGS = -O2 -ffree-line-length-none -fallow-argument-mismatch -fallow-invalid-boz $(INCS)

Because the tutorial code often uses nco and ncl commands, and current Derecho environment makes it challenging to load these modules using csh scripting, this necessitated insertion of:

   source /etc/profile.d/z00_modules.csh
   module load nco
   module load ncl 

within PBS portion of init_ensemble_var.csh script. Because driver.csh also requires nco commands in non-PBS scripting I also inserted module load nco and ncl commands within my home directory .tcshrc to generate the proper environment.

Based on this, I can generate a PR to update the WRF-DART tutorial csh scripting itself, and also provide improved documentation on how to generate the correct environment on Derecho. Not sure if I should wait on issuing PR given the system will be undergoing changes during the Feb 5-7th downtime. Probably will issue a draft PR and wait until system is more stable before trying to merge.

WRF source code for this build and simulation is located here:

WRF_DM_SRC_DIR    = /glade/work/bmraczka/WRF/WRFv3.9.1.1       
WPS_SRC_DIR       = /glade/work/bmraczka/WRF/WPSv3.9.1                   
VAR_SRC_DIR       = /glade/work/bmraczka/WRF/WRFDAv3.9.1          

My WRF-DART tutorial example (Derecho, gfortan, WRFv3.9.1) is located here:

/glade/derecho/scratch/bmraczka/WRFv3.9.1_DART_Tutorial/

My prior example (Derecho, precompiled intel exectuables, WRFv4.0) is located here:

/glade/derecho/scratch/bmraczka/WRFv3.9.1_DART_Tutorial/

I am circling back to the WRFv4.0 case to figure out why it failed on Derecho... newer WRF version?, hybrid-coordinate system? intel compiler issue?

braczka commented 8 months ago

Typo fix:

Prior example (Derecho, precompiled intel exectuables, WRFv4.0) located here:

/glade/derecho/scratch/bmraczka/WRF_DART_Tutorial/

braczka commented 7 months ago

The csh module load command issues were resolved during the Feb 5-7th downtime. Module loads can now be directly executed through execution of csh scripting, and through PBS submissions, therefore I will not include temporary csh related fixes mentioned earlier in this issue in subsequent PR.

hkershaw-brown commented 6 months ago

fixed by #636