Open JessicaMeixner-NOAA opened 6 months ago
@jiandewang I've noticed that recently 1/4 deg runs we only have:
MOM.res_1.nc MOM.res_2.nc MOM.res_3.nc MOM.res.nc
but before we used to have:
MOM.res_4.nc
as well.
Is this a recent model change?
@JessicaMeixner-NOAA in the very beginning of UFS runs we used CPC ocean DA which was run on GAEA. It had 5 file restart files. But for UFS runs on HERA and wcoss2 there will be only 4 restart files. The total # of files depends on machine and netcdf library. So there is nothing wrong on your runs.
Thanks for letting me know. It makes it very hard for the workflow if we do not know exactly how many. I think we're consistently at 4 total right now, so I'll modify the workflow as such for now. We can explore other options when needed or if @aerorahul has better advice for a path forward.
I don't have any advice at the moment, but this makes it very difficult to manage, maintain, and plan for. Is there anything we can do to force the data to be on N files regardless of machine and netcdf library? We are using the same library version on pretty much every platform with spack-stack, so this shouldn't be that variable.
The highlighted section of code already checks for existence before copying, so it should be able to handle 4 or 5 as-is. The only issue now is if one is missing you won't get notice of it.
Agree that having a variable number of files in general is a huge pain.
@WalterKolczynski-NOAA actually the forecast job fails if MOM.res_4.nc wasn't there: /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_03/C384iaucold02/cold02/COMROOT/cold02/logs/2021063000/gdasfcst.log
+ forecast_postdet.sh[507]: for mom6_restart_file in "${mom6_restart_files[@]}"
+ forecast_postdet.sh[508]: restart_file=20210630.060000.MOM.res_4.nc
+ forecast_postdet.sh[509]: /bin/cp -p /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_03/C384iaucold02/TMP/RUNDIRS/cold02/gdasfcst.2021063000/restart/MOM6_RESTART/20210630.060000.MOM.res_4.nc /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_03/C384iaucold02/cold02/COMROOT/cold02/gdas.20210630/00//model_data/ocean/restart/20210630.060000.MOM.res_4.nc
/bin/cp: cannot stat '/scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_03/C384iaucold02/TMP/RUNDIRS/cold02/gdasfcst.2021063000/restart/MOM6_RESTART/20210630.060000.MOM.res_4.nc': No such file or directory
+ forecast_postdet.sh[1]: postamble exglobal_forecast.sh 1714154554 1
+ preamble.sh[70]: set +x
End exglobal_forecast.sh at 18:20:24 with error code 1 (time elapsed: 00:17:50)
Okay, looks like that bit is in MOM6_out()
, which does not check if files exist like the IC staging in MOM6_postdet()
does. So it seems the forecast is completing successfully and then only dying when it goes to copy the output.
Okay, looks like that bit is in
MOM6_out()
, which does not check if files exist like the IC staging inMOM6_postdet()
does. So it seems the forecast is completing successfully and then only dying when it goes to copy the output.
Yes, that's my interpretation too.
So far I'm able to get past the first half cycle... so making progress. I'll keep others up to date next week. We'll likely want to wait for open PRs to be merged before opening one for this up anyways.
My working branch is here: https://github.com/JessicaMeixner-NOAA/global-workflow/tree/feature/c384wcda
It has updates for CICE & MOM6 for copying in restart files. It also changed the number of restarts to 3. Tests are running and there should be a PR soon after the test runs longer and additional checks are made.
gfsfcst is failling due to running with IAU and the copying of restarts at the end of the run.
@aerorahul is there a way to easily turn off the copying of check-point restarts for the gfsfcst so that we can continue to make runs for evaluation of the science while #1776 is still being addressed? When I ran with updates to enable this in early March, I don't remember this being an issue.
gfsfcst is failling due to running with IAU and the copying of restarts at the end of the run.
@aerorahul is there a way to easily turn off the copying of check-point restarts for the gfsfcst so that we can continue to make runs for evaluation of the science while #1776 is still being addressed? When I ran with updates to enable this in early March, I don't remember this being an issue.
I would suggest we add a logical COPY_CHECKPOINT_RESTARTS = YES|NO
in config.fcst
and set it to NO
for RUN=gfs
.
In forecast_postdet.sh
, wrap the copying of the checkpoint restarts in the conditional:
if [[ "${COPY_CHECKPOINT_RESTARTS:-}" == "YES" ]]; then
# copy logic here
fi
I wonder if another option would be to set restart_interval_gfs=$FHMAX_GFS (and maybe make it configurable with the yaml inputs)? This way we would also be saving the run-time of not writing these restarts. That being said, I don't know if setting it to FHMAX_GFS would give correct restart output either... I would have to test that. If that doesn't work then I can implement the COPY_CHECKPOINT_RESTARTS option as you suggest.
If restart_interval_gfs = FHMAX
is set, there won't be restarts in the middle and any failure would have to start from the beginning. If thats acceptable, its a solution, though I wouldn't advocate for it because if a failure happens towards the end, one would end up running a long forecast, at a high resolution, and use up a lot more resources. If the problem was with the code, the failure will be reproduced, but if the failure was due to a machine instability, hopefully using a new allocation might get this through; relatively cheaply.
At this time, I'm unsure that we can write consistent restart outputs so restarting when things fail for machine output is not even an option. This is intended to be a temporary stop-gap measure so that work can progress while the check-point restart capability for DOIAU=YES is being worked on.
I think one can write consistent restart outputs when DOIAU=NO
. This was tested when the ability to restart was added to the coupled model in PR #2510
The staggered restarts is a known issue when IAU is ON
, and there is work in the model being done to make that consistent.
In this scenario, I'm running the gfs forecasts with DOIAU=YES.
In this scenario, I'm running the gfs forecasts with DOIAU=YES.
Yes. I understand that. However, the restart checkpoints are a requirement.
Adding COPY_CHECKPOINT_RESTARTS
allows to override the current limitation in the model that is preventing running gfs forecasts with DOIAU=YES
, and retain a restart on failure capability.
@aerorahul I just mean as a temporary solution instead of adding the COPY_CHECKPOINT_RESTARTS option. It may or may not work. If it doesn't work as a temporary work around I'll work to add the COPY_CHECKPOINT_RESTARTS option so we can make progress in the meantime while we wait for the model updates for IAU.
So I believe I misrepresented the error above. Looking to add a "COPY_CHECKPOINT_RESTARTS" option, I actually could not find a place where the check-point restarts are written. Based on how MOM6 is currently writing restarts, I am currently trying to run with restart_interval_gfs=FHMAX+3
Is the need @aerorahul then to create a flag COPY_FINAL_RESTARTS that either copies the last restart or skip?
Having estart_interval_gfs=FHMAX+3 caused issues in the atm model copying of files.
My plan now will be to add a COPY_FINAL_RESTARTS yes/no switch.
What new functionality do you need?
We are working to set-up some C384mx025 S2S cycled experiments. We plan to cold-start the first half cycle, which will not have IAU, and then use IAU after that.
Right now, if DOIAU is no, when running the cold start half cycle for C384, the forecast failes in the final stages because there are only 3 mom_res${num}.nc restart files, and the scripts assume 4: https://github.com/NOAA-EMC/global-workflow/blob/develop/ush/forecast_postdet.sh#L382-L384
Moreover, the first half cycle only copies over 1 restart file for MOM6, and then you do not have restarts to start with IAU for the next cycle. Changes similar to those made here: https://github.com/JessicaMeixner-NOAA/global-workflow/commit/94e7fc7033c88f6b51b45f4a02fe4fb3d69bc87a except will need to be refactored to match recent updates in the develop branch.
What are the requirements for the new functionality?
Be able to run C384mx025 S2S cycled experiment with first half cycle being a cold start without IAU and then continue running with IAU.
Acceptance Criteria
Be able to run C384mx025 S2S cycled experiment with first half cycle being a cold start without IAU and then continue running with IAU.
Suggest a solution (optional)
Need to figure out why there are only 3 MOM6 restarts for the run I'm working on currently and not 4, or add flexibility in the scripts to grab any of the restarts if we do not a-priori know how many & copy additional restarts to COM in the first half cycle, in the same way the atm model does. There's been lots of updates since I've last successfully done this while waiting for features to be merged, so there might be other issues as well.