Closed ekluzek closed 1 year ago
Hi Erik, was this test done after merge PR #401, which is supposed to fix this? if so, there is still some problem finding restart file?
This is with that merge. That solved #388 which allows ERI tests to work. Here I don't think the problem is finding the restart file, but it seems to have trouble when it tries to open the restart file. This ERP test runs with 1 MPI task and 25 threads, and then does the restart with half as many threads. A restart test where the number of threads doesn't change works fine, and starting up with either number of threads seems to be fine.
Ok, I will need to take a close look at this one as well as amazon grid one too.
OK, it seems to have trouble only when running with MPI for a single task. When I run with mpi-serial I thought it was working for both intel and gnu compilers -- but I was wrong and it fails for both MPI and non-MPI.
ERP_D_Mmpi-serial_P1x25.5x5_amazon.I2000Clm50Sp.cheyenne_intel.mizuroute-default
It might be useful to have a multi-task MPI test to make sure you can do restarts with a differing number of MPI tasks even though we know it will change answers because of #256.
My previous comment was actually incorrect, and it does fail for both MPI and non-MPI. I'll update the comment above...
The traceback for ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default looks like this:
Abort with message No such file or directory in file /glade/scratch/vanderwb/hpci-stack/220919-1520/4744/pio-2.5.9/src/clib/pioc_support.c at line 2832
Obtained 10 stack frames.
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0x111f749]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0x111dd7f]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0x111dd02]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0x111e626]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe(PIOc_openfile+0x16) [0x11197e6]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0x10d5a4f]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0xec0679]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0xeb0ba0]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0xe850d0]
/glade/scratch/erik/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/bld/case2bld/cesm.exe() [0xddb6cc]
forrtl: error (76): Abort trap signal
Image PC Routine Line Source
cesm.exe 000000000130AC5B Unknown Unknown Unknown
libpthread-2.22.s 00002B6AF2016C00 Unknown Unknown Unknown
libc-2.22.so 00002B6AF35512A7 gsignal Unknown Unknown
libc-2.22.so 00002B6AF355267A abort Unknown Unknown
cesm.exe 000000000111DD84 Unknown Unknown Unknown
cesm.exe 000000000111DD02 Unknown Unknown Unknown
cesm.exe 000000000111E626 Unknown Unknown Unknown
cesm.exe 00000000011197E6 Unknown Unknown Unknown
cesm.exe 00000000010D5A4F Unknown Unknown Unknown
cesm.exe 0000000000EC0679 pio_utils_mp_open 333 pio_utils.f90
cesm.exe 0000000000EB0BA0 historyfile_mp_op 301 historyFile.f90
cesm.exe 0000000000E850D0 write_simoutput_p 521 write_simoutput_pio.f90
cesm.exe 0000000000DDB6CC rtmmod_mp_route_i 257 RtmMod.F90
cesm.exe 0000000000DC7C6B rof_comp_nuopc_mp 531 rof_comp_nuopc.F90
libesmf.so 00002B6AEEDD6A40 _ZN5ESMCI6FTable1 Unknown Unknown
libesmf.so 00002B6AEEDDA9EB ESMCI_FTableCallE Unknown Unknown
libesmf.so 00002B6AEF424B3A _ZN5ESMCI3VMK5ent Unknown Unknown
libesmf.so 00002B6AEF441105 _ZN5ESMCI2VM5ente Unknown Unknown
libesmf.so 00002B6AEEDD80FA c_esmc_ftablecall Unknown Unknown
rof.log ends on...
(OPNFIL): Successfully opened file ./rpointer.rof on unit= 99 Reading restart data.....
(GETFIL): attempting to find local file ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute -default.GC.ctsm51d114mizuchlist.mizuroute.r.2000-01-07-00000.nc (GETFIL): using ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute -default.GC.ctsm51d114mizuchlist.mizuroute.r.2000-01-07-00000.nc in current working directory
Line 333 of pio_utils is the open of the restart file:
ierr = pio_openfile(pioIoSystem, pioFileDesc, iotype, trim(fname), mode)
if(ierr/=pio_noerr)then; message=trim(message)//'Could not open netCDF'; return; endif
Note, that 25 and 12 thread ERS tests run as expected so there isn't something about the thread count for reading the restart files.
They fail the comparison because of #390, but do work...
ERS_D_Mmpi-serial_P1x12.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default COMPARE_base_rest
ERS_D_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default COMPARE_base_rest
When I compare the namelists for restart for the ERP test to the ERS 12 thread test the comparison looks correct to me with the difference being the casename.
diff -wbcr CaseDocs/ /glade/work/erik/ctsm_worktrees/mizuRoute/cime/scripts/cases/ERS_D_Mmpi-serial_P1x12.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/CaseDocs/ | less
(ctsm_pylib) case2/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist> pwd
/glade/work/erik/ctsm_worktrees/mizuRoute/cime/scripts/cases/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist/case2/ERP_Mmpi-serial_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default.GC.ctsm51d114mizuchlist
301 historyFile.f90
521 write_simoutput_pio.f90
This is trying to open history file, not restart file. that seems to be incorrect. but the messsage said the file is not there? so rpointer is wrong?
Ahhh, you are right the problem is that the history file isn't there. The restart file is, but it's not copying over the history file like it should.
OK, to get this to work, a string variable needs to be added to the restart file that contains the name(s) of the history file(s) that needs to be read in. The names of that needs to be added to the config_archive.xml file as well. The history file name is the same as the name of the history file that gets put into the mizuroute.rpointer file. For clm this variable is called locfnh and it's added to the archive as...
<rest_history_varname>locfnh</rest_history_varname>
I propose something longer and more descriptive like "restart_history_filenames".
Actually when gauge data is output the filenames should include both the history file and gauge file (so hfileout, and hfileout_gage)
I've got this working in #391 (writing the filenames to the restart file and having them copied over), however the ERP tests still fail because it's trying to read in the history file with a date of 2000-01-12-00000.nc rather than 2000-01-07-00000.nc. The Jan/12th date is on the restart pointer file as well for the history file. This means something is going wrong with the logic for reading in the history file at restart.
Hmm... dose the test stop on 2000-01-07 but generate the 2000-01-12 history file? Does the test produce daily history file or monthly? maybe need to know the configurations used for the test to make a better guess.
@nmizukami I found the problem. This might be a case that you didn't think about. For CESM especially for testing we have cases where you right out restart files before the end of the run. So you output restarts for day 7, but run until day 12. At which point the history file is updated. But, the restart file hasn't been updated.
So when I make the following change I get it to work...
diff --git a/route/build/src/write_simoutput_pio.f90 b/route/build/src/write_simoutput_pio.f90
index 0d9d5862..98f61b50 100644
--- a/route/build/src/write_simoutput_pio.f90
+++ b/route/build/src/write_simoutput_pio.f90
@@ -108,8 +108,8 @@ SUBROUTINE main_new_file(ierr, message)
end if
! update history files
- call io_rpfile('w', ierr, cmessage)
- if(ierr/=0)then; message=trim(message)//trim(cmessage); return; endif
+ !call io_rpfile('w', ierr, cmessage)
+ !if(ierr/=0)then; message=trim(message)//trim(cmessage); return; endif
END SUBROUTINE main_new_file
Is it OK to just get rid of that call to io_rpfile? Or is this case important for standalone? If it's important for standalone there could be an if block around it.
Once I have that change above the test works as I expect it to.
Thanks Erik for finding this. I can see that. It seems that there is no need for this for standalone either. So rpointer file needs to have 2000-01-07 for both history file and restart file?
Yes, the rpointer file needs to have them consistent: 2000-01-07 for both history and restart file. In this case at least. Depending on how often the history file is written the history file date could be behind the restart file -- but never in front of it.
I'll just remove those lines then.
Ok. after these lines are removed, io_rpfile (for writing) is called only in restart_output, which is called in main_restart (this is main restart writing routine). so rpointer file is updated only when restart file is written and history file name is picked up at the time when restart file is written.
Exact restart tests with a change in threadcount are failing to run. It fails on opening the restart file on
ERP_D_P1x25.5x5_amazon_r05.I2000Clm50SpMizGs.cheyenne_intel.mizuroute-default
Other compilers fail as well. ERS tests work. And ERP tests without mizuRoute work as well.