ESCOMP / mizuRoute

Reach-based river routing model
http://escomp.github.io/mizuRoute/
GNU General Public License v3.0
42 stars 53 forks source link

ERP tests failing to run #406

Closed ekluzek closed 1 year ago

ekluzek commented 1 year ago

Exact restart tests with a change in threadcount are failing to run. It fails on opening the restart file on

ERP_D_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default

Other compilers fail as well. ERS tests work. And ERP tests without mizuRoute work as well.

nmizukami commented 1 year ago

Hi Erik, was this test done after merge PR #401, which is supposed to fix this? if so, there is still some problem finding restart file?

ekluzek commented 1 year ago

This is with that merge. That solved #388 which allows ERI tests to work. Here I don't think the problem is finding the restart file, but it seems to have trouble when it tries to open the restart file. This ERP test runs with 1 MPI task and 25 threads, and then does the restart with half as many threads. A restart test where the number of threads doesn't change works fine, and starting up with either number of threads seems to be fine.

nmizukami commented 1 year ago

Ok, I will need to take a close look at this one as well as amazon grid one too.

ekluzek commented 1 year ago

OK, it seems to have trouble only when running with MPI for a single task. When I run with mpi-serial I thought it was working for both intel and gnu compilers -- but I was wrong and it fails for both MPI and non-MPI.

ERP_D_Mmpi-serial_P1x25.5x5_amazon.I2000Clm50Sp.cheyenne_intel.mizuroute-default

It might be useful to have a multi-task MPI test to make sure you can do restarts with a differing number of MPI tasks even though we know it will change answers because of #256.

ekluzek commented 1 year ago

My previous comment was actually incorrect, and it does fail for both MPI and non-MPI. I'll update the comment above...

ekluzek commented 1 year ago

The traceback for ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default looks like this:

Abort with message No such file or directory in file /glade/scratch/vanderwb/hpci-stack/220919-1520/4744/pio-2.5.9/src/clib/pioc_support.c at line 2832
Obtained 10 stack frames.
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0x111f749]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0x111dd7f]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0x111dd02]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0x111e626]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe(PIOc_openfile+0x16) [0x11197e6]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0x10d5a4f]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0xec0679]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0xeb0ba0]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0xe850d0]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0xddb6cc]
forrtl: error (76): Abort trap signal
Image              PC                Routine            Line        Source             
cesm.exe           000000000130AC5B  Unknown               Unknown  Unknown
libpthread-2.22.s  00002B6AF2016C00  Unknown               Unknown  Unknown
libc-2.22.so       00002B6AF35512A7  gsignal               Unknown  Unknown
libc-2.22.so       00002B6AF355267A  abort                 Unknown  Unknown
cesm.exe           000000000111DD84  Unknown               Unknown  Unknown
cesm.exe           000000000111DD02  Unknown               Unknown  Unknown
cesm.exe           000000000111E626  Unknown               Unknown  Unknown
cesm.exe           00000000011197E6  Unknown               Unknown  Unknown
cesm.exe           00000000010D5A4F  Unknown               Unknown  Unknown
cesm.exe           0000000000EC0679  pio_utils_mp_open         333  pio_utils.f90
cesm.exe           0000000000EB0BA0  historyfile_mp_op         301  historyFile.f90
cesm.exe           0000000000E850D0  write_simoutput_p         521  write_simoutput_pio.f90
cesm.exe           0000000000DDB6CC  rtmmod_mp_route_i         257  RtmMod.F90
cesm.exe           0000000000DC7C6B  rof_comp_nuopc_mp         531  rof_comp_nuopc.F90
libesmf.so         00002B6AEEDD6A40  _ZN5ESMCI6FTable1     Unknown  Unknown
libesmf.so         00002B6AEEDDA9EB  ESMCI_FTableCallE     Unknown  Unknown
libesmf.so         00002B6AEF424B3A  _ZN5ESMCI3VMK5ent     Unknown  Unknown
libesmf.so         00002B6AEF441105  _ZN5ESMCI2VM5ente     Unknown  Unknown
libesmf.so         00002B6AEEDD80FA  c_esmc_ftablecall     Unknown  Unknown

rof.log ends on...

(OPNFIL): Successfully opened file ./rpointer.rof on unit= 99 Reading restart data.....

(GETFIL): attempting to find local file ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute -default.GC.ctsm51d114mizuchlist.mizuroute.r.2000-01-07-00000.nc (GETFIL): using ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute -default.GC.ctsm51d114mizuchlist.mizuroute.r.2000-01-07-00000.nc in current working directory

Line 333 of pio_utils is the open of the restart file:


    ierr = pio_openfile(pioIoSystem, pioFileDesc, iotype, trim(fname), mode)
    if(ierr/=pio_noerr)then; message=trim(message)//'Could not open netCDF'; return; endif
ekluzek commented 1 year ago

Note, that 25 and 12 thread ERS tests run as expected so there isn't something about the thread count for reading the restart files.

They fail the comparison because of #390, but do work...

ERS_D_Mmpi-serial_P1x12.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default COMPARE_base_rest
ERS_D_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default COMPARE_base_rest

ekluzek commented 1 year ago

When I compare the namelists for restart for the ERP test to the ERS 12 thread test the comparison looks correct to me with the difference being the casename.

diff -wbcr CaseDocs/ /glade/work/erik/ctsm_worktrees/mizuRoute/cime/scripts/cases/ERS_D_Mmpi-serial_P1x12.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/CaseDocs/ | less
(ctsm_pylib) case2/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist> pwd
/glade/work/erik/ctsm_worktrees/mizuRoute/cime/scripts/cases/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/case2/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist
nmizukami commented 1 year ago
301  historyFile.f90
521  write_simoutput_pio.f90

This is trying to open history file, not restart file. that seems to be incorrect. but the messsage said the file is not there? so rpointer is wrong?

ekluzek commented 1 year ago

Ahhh, you are right the problem is that the history file isn't there. The restart file is, but it's not copying over the history file like it should.

ekluzek commented 1 year ago

OK, to get this to work, a string variable needs to be added to the restart file that contains the name(s) of the history file(s) that needs to be read in. The names of that needs to be added to the config_archive.xml file as well. The history file name is the same as the name of the history file that gets put into the mizuroute.rpointer file. For clm this variable is called locfnh and it's added to the archive as...

<rest_history_varname>locfnh</rest_history_varname>

I propose something longer and more descriptive like "restart_history_filenames".

ekluzek commented 1 year ago

Actually when gauge data is output the filenames should include both the history file and gauge file (so hfileout, and hfileout_gage)

ekluzek commented 1 year ago

I've got this working in #391 (writing the filenames to the restart file and having them copied over), however the ERP tests still fail because it's trying to read in the history file with a date of 2000-01-12-00000.nc rather than 2000-01-07-00000.nc. The Jan/12th date is on the restart pointer file as well for the history file. This means something is going wrong with the logic for reading in the history file at restart.

nmizukami commented 1 year ago

Hmm... dose the test stop on 2000-01-07 but generate the 2000-01-12 history file? Does the test produce daily history file or monthly? maybe need to know the configurations used for the test to make a better guess.

ekluzek commented 1 year ago

@nmizukami I found the problem. This might be a case that you didn't think about. For CESM especially for testing we have cases where you right out restart files before the end of the run. So you output restarts for day 7, but run until day 12. At which point the history file is updated. But, the restart file hasn't been updated.

So when I make the following change I get it to work...

diff --git a/route/build/src/write_simoutput_pio.f90 b/route/build/src/write_simoutput_pio.f90
index 0d9d5862..98f61b50 100644
--- a/route/build/src/write_simoutput_pio.f90
+++ b/route/build/src/write_simoutput_pio.f90
@@ -108,8 +108,8 @@ SUBROUTINE main_new_file(ierr, message)
     end if

     ! update history files
-    call io_rpfile('w', ierr, cmessage)
-    if(ierr/=0)then; message=trim(message)//trim(cmessage); return; endif
+    !call io_rpfile('w', ierr, cmessage)
+    !if(ierr/=0)then; message=trim(message)//trim(cmessage); return; endif

  END SUBROUTINE main_new_file

Is it OK to just get rid of that call to io_rpfile? Or is this case important for standalone? If it's important for standalone there could be an if block around it.

Once I have that change above the test works as I expect it to.

nmizukami commented 1 year ago

Thanks Erik for finding this. I can see that. It seems that there is no need for this for standalone either. So rpointer file needs to have 2000-01-07 for both history file and restart file?

ekluzek commented 1 year ago

Yes, the rpointer file needs to have them consistent: 2000-01-07 for both history and restart file. In this case at least. Depending on how often the history file is written the history file date could be behind the restart file -- but never in front of it.

I'll just remove those lines then.

nmizukami commented 1 year ago

Ok. after these lines are removed, io_rpfile (for writing) is called only in restart_output, which is called in main_restart (this is main restart writing routine). so rpointer file is updated only when restart file is written and history file name is picked up at the time when restart file is written.