NOAA-EMC / godas

7 stars 4 forks source link

Low-res coupled UFS #337

Closed guillaumevernieres closed 2 years ago

guillaumevernieres commented 2 years ago

Description

This issue is an attempt at finishing the work that @DeniseWorthen did a few years back (UFS issue #289). Here's what she can give us and what the issue is:

What needs to be done

kestonsmith-noaa commented 2 years ago

OK. Let me know if you have a starting point in mind. Also is the MOM6-CICE coupling intended to be tw0 way or just MOM6-> CICE6? - Keston

guillaumevernieres commented 2 years ago

OK. Let me know if you have a starting point in mind. Also is the MOM6-CICE coupling intended to be tw0 way or just MOM6-> CICE6? - Keston

@kestonsmith-noaa There was an attempt a few month ago, but it was abandoned. We should start from where they left off.

kestonsmith-noaa commented 2 years ago

OK sounds good- is there a git hub repository or link to that work?

On Mon, Jul 11, 2022 at 8:30 AM Guillaume Vernieres < @.***> wrote:

OK. Let me know if you have a starting point in mind. Also is the MOM6-CICE coupling intended to be tw0 way or just MOM6-> CICE6? - Keston

@kestonsmith-noaa https://github.com/kestonsmith-noaa There was an attempt a few month ago, but it was abandoned. We should start from where they left off.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/godas/issues/337#issuecomment-1180350345, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZUY35GRVCJMCHWWAFL2AVLVTQHT5ANCNFSM523LLN5A . You are receiving this because you were mentioned.Message ID: @.***>

-- Keston Smith Support Scientist IMSG at NWS/NCEP/Environmental Modeling Center National Oceanic and Atmospheric Administration (774) 766-1545

guillaumevernieres commented 2 years ago

OK sounds good- is there a git hub repository or link to that work? On Mon, Jul 11, 2022 at 8:30 AM Guillaume Vernieres < @.> wrote: OK. Let me know if you have a starting point in mind. Also is the MOM6-CICE coupling intended to be tw0 way or just MOM6-> CICE6? - Keston @kestonsmith-noaa https://github.com/kestonsmith-noaa There was an attempt a few month ago, but it was abandoned. We should start from where they left off. — Reply to this email directly, view it on GitHub <#337 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AZUY35GRVCJMCHWWAFL2AVLVTQHT5ANCNFSM523LLN5A . You are receiving this because you were mentioned.Message ID: @.> -- Keston Smith Support Scientist IMSG at NWS/NCEP/Environmental Modeling Center National Oceanic and Atmospheric Administration (774) 766-1545

I think so, but I forgot where. I'll organize a meeting with the person(s) who worked on this issue. Edited the description, but in case of, here's the issue link: https://github.com/ufs-community/ufs-weather-model/issues/289

DeniseWorthen commented 2 years ago

I have a sandbox on orion here: /work/noaa/stmp/dworthen/stmp/dworthen/FV3_RT/cpld_c48/cpld_control_c48

I've added the following to the MOM_input:

! === module MOM ===
VERBOSITY = 9                   ! default = 2
                                ! Integer controlling level of messaging
                                !   0 = Only FATAL messages
                                !   2 = Only FATAL, WARNING, NOTE [default]
                                !   9 = All)
DEBUG = True                    !   [Boolean] default = False
                                ! If true, write out verbose debugging data.

The error file (err) shows the following:

12: h-point: mean=   3.9560357017408208E+33 min=  -1.8301177695709252E+00 max=   9.9692099683868690E+36 Post extract_sfc SST
12: h-point: c=     47514 Post extract_sfc SST
12: h-point: mean=   3.9560357017408208E+33 min=   0.0000000000000000E+00 max=   9.9692099683868690E+36 Post extract_sfc SSS
12: h-point: c=     36038 Post extract_sfc SSS

The model fails with

 2: [Orion-01-42:453600:0:453600] Caught signal 8 (Floating point exception: floating-point overflow)
 2: ==== backtrace (tid: 453600) ====
 2:  0 0x0000000007efa676 nst_module_mp_cool_skin_()  /work/noaa/marine/dworthen/ufs_c48/FV3/ccpp/physics/physics/module_nst_model.f90:863

which I believe is caused by the _FillValue (E+36) appearing in non-masked areas of the ocean. To use a MOM6 restart, change the bottom of the input.nml (ie, MOM_input_nml) to use input_filename = 'r' and add the MOM6 restart to the INPUT directory.

The sandbox can be copied and run from another location using sbatch job_card. Note the executable (fv3.exe) and required module files (modules.fv3) are sym linked to my current build directory. If rebuilding is required, let me know and I can explain how to do it from your own UWM checkout.

hyunchul386 commented 2 years ago

Thanks a lot for the sandbox, and I will checking it at the sandbox and let you know if the rebuilding is required.

hyunchul386 commented 2 years ago

I could reproduced the errors from my sandbox.

guillaumevernieres commented 2 years ago

OK. Let me know if you have a starting point in mind. Also is the MOM6-CICE coupling intended to be tw0 way or just MOM6-> CICE6? - Keston

@kestonsmith-noaa There was an attempt a few month ago, but it was abandoned. We should start from where they left off.

@kestonsmith-noaa : I tend to read 1/2 of the wordls and extrapolate the intent of a post, bad bad me, sorry! Yes, it's a two-way coupling between all components.

guillaumevernieres commented 2 years ago

I could reproduced the errors from my sandbox.

Very cool! Thanks @DeniseWorthen and @hyunchul386 .

guillaumevernieres commented 2 years ago

I can't believe @hyunchul386 or @DeniseWorthen didn't click the boxes!!!! ... So I had to do it, sorry.

DeniseWorthen commented 2 years ago

@guillaumevernieres That's what managers are for :-)

guillaumevernieres commented 2 years ago

@hyunchul386 , are you planning to work on this today as well?

hyunchul386 commented 2 years ago

@Guillaume Vernieres - NOAA Federal @.***> Yes, the one day run for 2021-03-22 is running at my sandbox with a new ocean IC.

On Wed, Jul 13, 2022 at 9:35 AM Guillaume Vernieres < @.***> wrote:

@hyunchul386 https://github.com/hyunchul386 , are you planning to work on this today as well?

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/godas/issues/337#issuecomment-1183233066, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ73E4ME2ZHG5WT35Z5R7DVT3A2RANCNFSM523LLN5A . You are receiving this because you were mentioned.Message ID: @.***>

guillaumevernieres commented 2 years ago

@guillaume Vernieres - NOAA Federal @.> Yes, the one day run for 2021-03-22 is running at my sandbox with a new ocean IC. On Wed, Jul 13, 2022 at 9:35 AM Guillaume Vernieres < @.> wrote: @hyunchul386 https://github.com/hyunchul386 , are you planning to work on this today as well? — Reply to this email directly, view it on GitHub <#337 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ73E4ME2ZHG5WT35Z5R7DVT3A2RANCNFSM523LLN5A . You are receiving this because you were mentioned.Message ID: @.***>

@hyunchul386 That is totally crazy!!!!

hyunchul386 commented 2 years ago

@guillaumevernieres The one day run completed, which does not mean the correct (?) results, but the one day run finished. Issues may be whether the results are reasonable or not, and to re-tune the memory. Currently 20 tasks and wallclock time is more than 80 min for one day run.

DeniseWorthen commented 2 years ago

@hyunchul386 The compilation was in debug mode; to test the timing you would need to recompile w/o debug=on. Do you want to try that?

hyunchul386 commented 2 years ago

@DeniseWorthen Thank you. I'll check it.

DeniseWorthen commented 2 years ago

@hyunchul386 Do you want the instructions for re-compiling in debug mode, or do you know how to do that already?

hyunchul386 commented 2 years ago

@DeniseWorthen, I think rebuild the run, and would you let me know about this rebuild?

DeniseWorthen commented 2 years ago

To compile the model you can do this:

git clone https://github.com/ufs-community/ufs-weather-model.git ufs-weather-model
cd ufs-weather-model
git submodule update --init --recursive
cd tests

To compile in debug mode:

./compile.sh orion.intel '-DAPP=S2S -DDEBUG=ON -DCCPP_SUITES=FV3_GFS_v17_coupled_p8' '' YES NO 2>&1 | tee compile.log

To compile in non-debug mode:

./compile.sh orion.intel '-DAPP=S2S -DCCPP_SUITES=FV3_GFS_v17_coupled_p8' '' YES NO 2>&1 | tee compile.log

The first "YES" means to build cleanly. You can change this to "NO" if you're making code changes and want to test a code change. It will then only rebuild what is needed.

The "NO" means to clean afterwards. Generally leave this as NO so that you don't need to rebuild from scratch each time.

hyunchul386 commented 2 years ago

@DeniseWorthen Thank you, I'll check it

guillaumevernieres commented 2 years ago

... Doing my part and clicking one more box!

DeniseWorthen commented 2 years ago

@hyunchul386 I forgot the final step. You'll have in the tests directory both fv3.exe and modules.fv3. You'll need to remove the sym-links to my build in the sandbox. Then either copy those two files from your build into the sandbox (or sym link them).

hyunchul386 commented 2 years ago

@DeniseWorthen Got it. Thanks a lot.

DeniseWorthen commented 2 years ago

@hyunchul386 One more item. Turn off the verbosity and debug settings in the MOM_input.

hyunchul386 commented 2 years ago

@DeniseWorthen Okay Thanks By the way, the results of the debug run looks normal in my glance. Just FYI, attached is a quick and dirt checkout for the previous debug mode run from ferret UFS-c48o5d25k_SST

hyunchul386 commented 2 years ago

Just update for the non-debug mode run, as expected, the run time of non-debug run is drastically reduced from 6183 secs to 478 secs for one day run. @DeniseWorthen Thanks a lot.

guillaumevernieres commented 2 years ago

Just update for the non-debug mode run, as expected, the run time of non-debug run is drastically reduced from 6183 secs to 478 secs for one day run. @DeniseWorthen Thanks a lot.

How many nodes are you using @hyunchul386 ?

hyunchul386 commented 2 years ago

@guillaumevernieres The run uses1node with 20 tasks.

guillaumevernieres commented 2 years ago

I'm going to ask for a few more features that we will need for the DA @hyunchul386 : 1 - a diag/history collection for MOM6 and CICE on the native grid containing snapshots of the DA variables every hour. Maybe @kestonsmith-noaa or @DeniseWorthen can guide us on the CICE side. 2 - save intermittent restarts for all the components. It's just 1 flag, but it needs to be tested. 3 - split the restart into a MOM6 io_layout of 2,2 to simulate what we need to do for the 1/4 deg model

I'll add radio boxes in the description of course :)

DeniseWorthen commented 2 years ago

I can help w/ the restart writing for the coupled system and the history writing for CICE. What are the needed DA variables for CICE?

hyunchul386 commented 2 years ago

It seems that the Ice DA variables are "hsnon, hicen, cicen".

DeniseWorthen commented 2 years ago

By the "n" do you mean you want these variables by thickness category? Normally we write out the composit values (ie, added up over all thickness categories). So snow thickness by category, ice thickness by category and ice concentration by category?

guillaumevernieres commented 2 years ago

That's correct @DeniseWorthen , we need the seaice var per categories. We do currently aggregate these variables, but this is probably going to change soon-ish.

kestonsmith-noaa commented 2 years ago

Do we want the other CICE state variables as well i.e. sice001, sice002,...sice007,and qice00x?

guillaumevernieres commented 2 years ago

Do we want the other CICE state variables as well i.e. sice001, sice002,...sice007,and qice00x?

Good point @kestonsmith-noaa , at this point, all we need from CICE are the intermittent restart then. No point writing a history file which would have 90% of the content of a restart.

DeniseWorthen commented 2 years ago

So what you want is just hourly restart files.

Do you need the full instructions for restarting the coupled model, or do you already have that set up somehow?

hyunchul386 commented 2 years ago

Yes, would you give the instructions for restarting the coupled model?

guillaumevernieres commented 2 years ago

So what you want is just hourly restart files.

Do you need the full instructions for restarting the coupled model, or do you already have that set up somehow?

There's 2 things here:

DeniseWorthen commented 2 years ago

@guillaumevernieres So all the models can write checkpoint restarts; the components are controlled w/ restart_n, but FV3 is controlled with restart_interval in the model_configure. There is this reference here that should help.

I can work on adding a c48/5deg control and restart test to the RTs if that is what you need.

guillaumevernieres commented 2 years ago

I can work on adding a c48/5deg control and restart test to the RTs if that is what you need.

Yes, that is what we need @DeniseWorthen .

DeniseWorthen commented 2 years ago

@hyunchul386 Could you provide me the location of the MOM6 restart you used?

hyunchul386 commented 2 years ago

@DeniseWorthen the location is /work/noaa/stmp/hlee/stmp/hlee/FV3_RT/cpld_c48/cpld_control_c48

DeniseWorthen commented 2 years ago

@guillaumevernieres I'm setting up the control/restart tests now. I think there may be an issue writing the MOM6 restart on the first hour from a "cold start".

In this case, cold start means not having actual FV3 and CICE6 native ICs. For both these components, the model starts up using ICs from other sources. So they're not complete---for example, most ICs fields for CICE6 are filled w/ zero. Within the fast loop, FV3 calculates the missing CICE6 fluxes on the first time-step. For MOM6 though, we use a lagged startup. That means that MOM6 doesn't advance on the first coupling timestep. Instead, we advance two times on the second coupling timestep. At that point we advance normally.

Long story short, for the "cold start", MOM6 currently can't write a restart at the first hour, because the coupling timestep is also 1-hour and MOM6 doesn't advance until hour=2. MOM6 is currently failing when I try to write restarts on a one-hour interval.

guillaumevernieres commented 2 years ago

@DeniseWorthen : I'm not too concerned about cold starting, is that a requirement for the RT ? If so, I would suggest to just make it work for the requirement of the RT and be done with it.

hyunchul386 commented 2 years ago

@DeniseWorthen Just FYI, my run give 3 hourly restart files and one hourly MOM6 diag files, /work/noaa/stmp/hlee/stmp/hlee/FV3_RT/cpld_c48/cpld_control_c48

I am not sure for the CICE diag/history files, because CICE diag would be differently controlled from MOM/FV3.

DeniseWorthen commented 2 years ago

@hyunchul386 This is an issue w/ the restart file, not the diag file.

@guillaumevernieres The cold start is not a requirement for the RT per se, it is because we don't have the actual ICs for FV3 and CICE6 that we do the lagged startup. I need to fix MOM6 being able to write a restart (even if it doesn't advance) or we need to provide FV3 and CICE6 ICs---and our "control" run would then actually be a restart.

guillaumevernieres commented 2 years ago

@hyunchul386 This is an issue w/ the restart file, not the diag file.

@guillaumevernieres The cold start is not a requirement for the RT per se, it is because we don't have the actual ICs for FV3 and CICE6 that we do the lagged startup. I need to fix MOM6 being able to write a restart (even if it doesn't advance) or we need to provide FV3 and CICE6 ICs---and our "control" run would then actually be a restart.

@DeniseWorthen , couldn't we do a short forecast offline, dump fv3/cice/mom6 restarts and use that to build the RT?

DeniseWorthen commented 2 years ago

@guillaumevernieres What length of forecast would you want for the warmstart+restart tests?

The warmstart test will use a staged IC for FV3,CICE,MOM6 and CMEPS. Currently I've set it up to run a 6 hr forecast, and write restarts for FV3,MOM,CICE6 and CMEPS every hour. Does that set up work, or do you need something else (longer/shorter)?

The restart test will start from one of the warmstart test's checkpoint restarts. So, it could start from the first hour restarts, or the 5th hour restarts. It will run from whatever the restart hour is out to 6 hours.

The baseline will be compared between the warmstart and the warmstart+restart.

All components will also write final restarts.

guillaumevernieres commented 2 years ago

6 hr forecast is perfect @DeniseWorthen .