E3SM-Project / scream

Fork of E3SM used to develop exascale global atmosphere model written in C++
https://e3sm-project.github.io/scream/
Other
76 stars 55 forks source link

Problems with restarting a cime screamv1 case #1544

Closed ndkeen closed 2 years ago

ndkeen commented 2 years ago

Using master of Apr13th. Here is a script to create case that will run 4 steps and write a restart:

#! /bin/csh                                                                                                                                                                                                                  
set compset=F2000SCREAMv1
set mach=cori-knl
set res=ne4_ne4
set compiler=intel
set account=e3sm
set q=debug
set wt="0:19:00"

set wdir="$CSCRATCH/e3sm_scratch/$mach"
set cn="$wdir/f4.$compset.$res.$compiler.DEBUG.4s.restarttest"

create_newcase --case $cn --res $res --mach $mach --compiler $compiler --compset $compset --project $account --walltime=$wt --queue=$q

cd $cn

./xmlchange --file env_run.xml STOP_OPTION="nsteps"
./xmlchange --file env_run.xml STOP_N="4"
./xmlchange --file env_run.xml REST_OPTION="nsteps"
./xmlchange --file env_run.xml REST_N="4"

./xmlchange --file env_build.xml DEBUG="TRUE"

case.setup
case.build
ls -l bld/*.exe*

set sbase=/global/cfs/cdirs/e3sm/ndk/ne4restarttest
set sin=si.ne4.noout.wr.yaml

ls -l run/data/scream_input.yaml
ls -l $sbase/$sin
cp run/data/scream_input.yaml run/data/scream_input.yaml-original
cp $sbase/$sin run/data/scream_input.yaml

ls -l $sbase/*dat8
cp $sbase/*dat8 run/data

case.submit -a="-t $wt --qos=$q "

This looks to have worked.

cori05% cat run/rpointer.atm 
model_restart.INSTANT.Steps_x4.0001-01-01.002000.r.nc

cori05% lr run/*.r.*
-rw-rw-r-- 1 ndk ndk   244528 Apr 13 16:36 run/f4.F2000SCREAMv1.ne4_ne4.intel.DEBUG.4s.restarttest.cice.r.0001-01-01-01200.nc
-rw-rw-r-- 1 ndk ndk 23197148 Apr 13 16:36 run/f4.F2000SCREAMv1.ne4_ne4.intel.DEBUG.4s.restarttest.elm.r.0001-01-01-01200.nc
-rw-rw-r-- 1 ndk ndk 15188908 Apr 13 16:36 run/model_restart.INSTANT.Steps_x4.0001-01-01.002000.r.nc
-rw-rw-r-- 1 ndk ndk  1717292 Apr 13 16:36 run/f4.F2000SCREAMv1.ne4_ne4.intel.DEBUG.4s.restarttest.cpl.r.0001-01-01-01200.nc

And then to try reading from the restart (and running 4 more steps to write another restart), I did this:

   cd case_dir
   cp /global/cfs/cdirs/e3sm/ndk/ne4restarttest/si.ne4.noout.rwr.yaml run/data/scream_input.yaml
   xmlchange CONTINUE_RUN=TRUE
   case.submit

Which fails with:

26:   what():  /global/cscratch1/sd/ndk/wacmy/s49-apr12/components/scream/src/share/io/scream_output_manager.cpp:355: FAIL:
26: found
26: Error! Output restart requested, but no history restart file found in 'rpointer.atm'.
26:    restart file name root: model_restart
26:    rpointer content:
26:

case: /global/cscratch1/sd/ndk/e3sm_scratch/cori-knl/f4.F2000SCREAMv1.ne4_ne4.intel.DEBUG.4s.restarttest
PeterCaldwell commented 2 years ago

"history restart files" are restart files for our output. We don't need them in this case because all of our output is instantaneous rather than averaged over some interval which was only half completed when we wrote our output. So there seems to be error catching enabled here that doesn't make sense. The error says "Output restart requested", so maybe we can just not request output restart (for at least a temporary fix)?

bartgol commented 2 years ago

As I told privately to @ndkeen , the error message is misleading. The error is printed in the AD if the correct model restart file is not found in the rpointer file. Interestingly, the rpointer generated by scream does contain the correct model restart file name.

I will have to debug this.

bartgol commented 2 years ago

I can reproduce this on mappy. Working on it.