NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
70 stars 163 forks source link

Be able to run S2S cold start w/out IAU for first half-cycle, warm start+IAU afterwards #2546

Open JessicaMeixner-NOAA opened 2 months ago

JessicaMeixner-NOAA commented 2 months ago

What new functionality do you need?

We are working to set-up some C384mx025 S2S cycled experiments. We plan to cold-start the first half cycle, which will not have IAU, and then use IAU after that.

Right now, if DOIAU is no, when running the cold start half cycle for C384, the forecast failes in the final stages because there are only 3 mom_res${num}.nc restart files, and the scripts assume 4: https://github.com/NOAA-EMC/global-workflow/blob/develop/ush/forecast_postdet.sh#L382-L384

Moreover, the first half cycle only copies over 1 restart file for MOM6, and then you do not have restarts to start with IAU for the next cycle. Changes similar to those made here: https://github.com/JessicaMeixner-NOAA/global-workflow/commit/94e7fc7033c88f6b51b45f4a02fe4fb3d69bc87a except will need to be refactored to match recent updates in the develop branch.

What are the requirements for the new functionality?

Be able to run C384mx025 S2S cycled experiment with first half cycle being a cold start without IAU and then continue running with IAU.

Acceptance Criteria

Be able to run C384mx025 S2S cycled experiment with first half cycle being a cold start without IAU and then continue running with IAU.

Suggest a solution (optional)

Need to figure out why there are only 3 MOM6 restarts for the run I'm working on currently and not 4, or add flexibility in the scripts to grab any of the restarts if we do not a-priori know how many & copy additional restarts to COM in the first half cycle, in the same way the atm model does. There's been lots of updates since I've last successfully done this while waiting for features to be merged, so there might be other issues as well.

JessicaMeixner-NOAA commented 2 months ago

@jiandewang I've noticed that recently 1/4 deg runs we only have: MOM.res_1.nc MOM.res_2.nc MOM.res_3.nc MOM.res.nc but before we used to have: MOM.res_4.nc as well.

Is this a recent model change?

jiandewang commented 2 months ago

@JessicaMeixner-NOAA in the very beginning of UFS runs we used CPC ocean DA which was run on GAEA. It had 5 file restart files. But for UFS runs on HERA and wcoss2 there will be only 4 restart files. The total # of files depends on machine and netcdf library. So there is nothing wrong on your runs.

JessicaMeixner-NOAA commented 2 months ago

Thanks for letting me know. It makes it very hard for the workflow if we do not know exactly how many. I think we're consistently at 4 total right now, so I'll modify the workflow as such for now. We can explore other options when needed or if @aerorahul has better advice for a path forward.

aerorahul commented 2 months ago

I don't have any advice at the moment, but this makes it very difficult to manage, maintain, and plan for. Is there anything we can do to force the data to be on N files regardless of machine and netcdf library? We are using the same library version on pretty much every platform with spack-stack, so this shouldn't be that variable.

WalterKolczynski-NOAA commented 2 months ago

The highlighted section of code already checks for existence before copying, so it should be able to handle 4 or 5 as-is. The only issue now is if one is missing you won't get notice of it.

Agree that having a variable number of files in general is a huge pain.

JessicaMeixner-NOAA commented 2 months ago

@WalterKolczynski-NOAA actually the forecast job fails if MOM.res_4.nc wasn't there: /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_03/C384iaucold02/cold02/COMROOT/cold02/logs/2021063000/gdasfcst.log

+ forecast_postdet.sh[507]: for mom6_restart_file in "${mom6_restart_files[@]}"
+ forecast_postdet.sh[508]: restart_file=20210630.060000.MOM.res_4.nc
+ forecast_postdet.sh[509]: /bin/cp -p /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_03/C384iaucold02/TMP/RUNDIRS/cold02/gdasfcst.2021063000/restart/MOM6_RESTART/20210630.060000.MOM.res_4.nc /scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_03/C384iaucold02/cold02/COMROOT/cold02/gdas.20210630/00//model_data/ocean/restart/20210630.060000.MOM.res_4.nc
/bin/cp: cannot stat '/scratch1/NCEPDEV/climate/Jessica.Meixner/cycling/iau_03/C384iaucold02/TMP/RUNDIRS/cold02/gdasfcst.2021063000/restart/MOM6_RESTART/20210630.060000.MOM.res_4.nc': No such file or directory
+ forecast_postdet.sh[1]: postamble exglobal_forecast.sh 1714154554 1
+ preamble.sh[70]: set +x
End exglobal_forecast.sh at 18:20:24 with error code 1 (time elapsed: 00:17:50)
WalterKolczynski-NOAA commented 2 months ago

Okay, looks like that bit is in MOM6_out(), which does not check if files exist like the IC staging in MOM6_postdet() does. So it seems the forecast is completing successfully and then only dying when it goes to copy the output.

JessicaMeixner-NOAA commented 2 months ago

Okay, looks like that bit is in MOM6_out(), which does not check if files exist like the IC staging in MOM6_postdet() does. So it seems the forecast is completing successfully and then only dying when it goes to copy the output.

Yes, that's my interpretation too.

JessicaMeixner-NOAA commented 2 months ago

So far I'm able to get past the first half cycle... so making progress. I'll keep others up to date next week. We'll likely want to wait for open PRs to be merged before opening one for this up anyways.

JessicaMeixner-NOAA commented 2 months ago

My working branch is here: https://github.com/JessicaMeixner-NOAA/global-workflow/tree/feature/c384wcda

It has updates for CICE & MOM6 for copying in restart files. It also changed the number of restarts to 3. Tests are running and there should be a PR soon after the test runs longer and additional checks are made.

JessicaMeixner-NOAA commented 2 months ago

gfsfcst is failling due to running with IAU and the copying of restarts at the end of the run.

@aerorahul is there a way to easily turn off the copying of check-point restarts for the gfsfcst so that we can continue to make runs for evaluation of the science while #1776 is still being addressed? When I ran with updates to enable this in early March, I don't remember this being an issue.

aerorahul commented 2 months ago

gfsfcst is failling due to running with IAU and the copying of restarts at the end of the run.

@aerorahul is there a way to easily turn off the copying of check-point restarts for the gfsfcst so that we can continue to make runs for evaluation of the science while #1776 is still being addressed? When I ran with updates to enable this in early March, I don't remember this being an issue.

I would suggest we add a logical COPY_CHECKPOINT_RESTARTS = YES|NO in config.fcst and set it to NO for RUN=gfs.
In forecast_postdet.sh, wrap the copying of the checkpoint restarts in the conditional:

if [[ "${COPY_CHECKPOINT_RESTARTS:-}" == "YES" ]]; then
    # copy logic here
fi
JessicaMeixner-NOAA commented 2 months ago

I wonder if another option would be to set restart_interval_gfs=$FHMAX_GFS (and maybe make it configurable with the yaml inputs)? This way we would also be saving the run-time of not writing these restarts. That being said, I don't know if setting it to FHMAX_GFS would give correct restart output either... I would have to test that. If that doesn't work then I can implement the COPY_CHECKPOINT_RESTARTS option as you suggest.

aerorahul commented 2 months ago

If restart_interval_gfs = FHMAX is set, there won't be restarts in the middle and any failure would have to start from the beginning. If thats acceptable, its a solution, though I wouldn't advocate for it because if a failure happens towards the end, one would end up running a long forecast, at a high resolution, and use up a lot more resources. If the problem was with the code, the failure will be reproduced, but if the failure was due to a machine instability, hopefully using a new allocation might get this through; relatively cheaply.

JessicaMeixner-NOAA commented 2 months ago

At this time, I'm unsure that we can write consistent restart outputs so restarting when things fail for machine output is not even an option. This is intended to be a temporary stop-gap measure so that work can progress while the check-point restart capability for DOIAU=YES is being worked on.

aerorahul commented 2 months ago

I think one can write consistent restart outputs when DOIAU=NO. This was tested when the ability to restart was added to the coupled model in PR #2510 The staggered restarts is a known issue when IAU is ON, and there is work in the model being done to make that consistent.

JessicaMeixner-NOAA commented 2 months ago

In this scenario, I'm running the gfs forecasts with DOIAU=YES.

aerorahul commented 2 months ago

In this scenario, I'm running the gfs forecasts with DOIAU=YES.

Yes. I understand that. However, the restart checkpoints are a requirement. Adding COPY_CHECKPOINT_RESTARTS allows to override the current limitation in the model that is preventing running gfs forecasts with DOIAU=YES, and retain a restart on failure capability.

JessicaMeixner-NOAA commented 2 months ago

@aerorahul I just mean as a temporary solution instead of adding the COPY_CHECKPOINT_RESTARTS option. It may or may not work. If it doesn't work as a temporary work around I'll work to add the COPY_CHECKPOINT_RESTARTS option so we can make progress in the meantime while we wait for the model updates for IAU.

JessicaMeixner-NOAA commented 2 months ago

So I believe I misrepresented the error above. Looking to add a "COPY_CHECKPOINT_RESTARTS" option, I actually could not find a place where the check-point restarts are written. Based on how MOM6 is currently writing restarts, I am currently trying to run with restart_interval_gfs=FHMAX+3

Is the need @aerorahul then to create a flag COPY_FINAL_RESTARTS that either copies the last restart or skip?

JessicaMeixner-NOAA commented 2 months ago

Having estart_interval_gfs=FHMAX+3 caused issues in the atm model copying of files.

My plan now will be to add a COPY_FINAL_RESTARTS yes/no switch.