NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
74 stars 164 forks source link

gdasgldas task fails with Restart Tile Space Mismatch #622

Closed BrettHoover-NOAA closed 1 year ago

BrettHoover-NOAA commented 2 years ago

Expected behavior gdasgldas task should complete successfully and global-workflow should continue to cycle fv3gdas

Current behavior gdasgldas task is failing on the first cycle in which the task is not skipped (first 00z analysis period when enough data has been produced to trigger task).

Machines affected This error is being expressed on Orion.

To Reproduce I am seeing this bug in a test of global-workflow being conducted on Orion in the following directories: expid: /work/noaa/da/bhoover/para/bth_test code: /work/noaa/da/bhoover/global-workflow ROTDIR: /work/noaa/stmp/bhoover/ROTDIRS/bth_test RUNDIR: /work/noaa/stmp/bhoover/RUNDIRS/bth_test

This run is initialized on 2020082200, and designed to terminate 2 weeks later on 2020090500.

Experiment setup: /work/noaa/da/bhoover/global-workflow/ush/rocoto/setup_expt.py --pslot bth_test --configdir /work/noaa/da/bhoover/global-workflow/parm/config --idate 2020082200 --edate 2020090500 --comrot /work/noaa/stmp/bhoover/ROTDIRS --expdir /work/noaa/da/bhoover/para --resdet 384 --resens 192 --nens 80 --gfs_cyc 1

Workflow setup: /work/noaa/da/bhoover/global-workflow/ush/rocoto/setup_workflow.py --expdir /work/noaa/da/bhoover/para/bth_test

Initial conditions: /work/noaa/da/cthomas/ICS/2020082200/

The error is found in the gdasgldas task on 2020082600.

Log file: /work/noaa/stmp/bhoover/ROTDIRS/bth_test/logs/2020082600/gdasgldas.log

Context This run is being used by a new Orion user and member of the satellite DA group, only to familiarize myself with the process of carrying out an experiment. There have been no code-changes made for this run. I followed directions for cloning and building the global-workflow, and setting up a cycled experiment, from the available wiki:

https://github.com/NOAA-EMC/global-workflow/wiki/

I did not create the initial condition files, they were instead produced for me. The global-workflow repository was cloned on January 25 2022 (d3028b9d8268028226f9c27800fcd6655e9e4bb8)

The task fails with the following error in the log-file:

0: NOAH Restart File Used: noah.rst
0: 1 1536 768 389408 0: Restart Tile Space Mismatch, Halting.. 0: endrun is being called 0: application called MPI_Abort(MPI_COMM_WORLD, 1) - process 0

The dimension size of 389408 is suspicious, since earlier in the log a different dimension size is referenced, e.g.:

0: MSG: maketiles -- Size of Grid Dimension: 398658 ( 0 )

When I search for "389408" in the log-file, it only appears in two places, one is in the Restart Tile Space Mismatch error, and the other is while running exec/gldas_rst, when reporting the results of a FAST_BYTESWAP:

216.121 + /work/noaa/da/bhoover/global-workflow/exec/gldas_rst 216.121 + 1>& 1 2>& 2 FAST_BYTESWAP ALGORITHM HAS BEEN USED AND DATA ALIGNMENT IS CORRECT FOR 4 ) 1536 768 4 9440776 2 tmp0_10cmdown GLDAS STC1 3 tmp10_40cmdown GLDAS STC2 4 tmp40_100cmdown GLDAS STC3 5 tmp100_200cmdown GLDAS STC4 6 soill0_10cmdown GLDAS SLC1 7 soill10_40cmdown GLDAS SLC2 8 soill40_100cmdown GLDAS SLC3 9 soill100_200cmdown GLDAS SLC4 10 soilw0_10cmdown GLDAS SMC1 11 soilw10_40cmdown GLDAS SMC2 12 soilw40_100cmdown GLDAS SMC3 13 soilw100_200cmdown GLDAS SMC4 15 landsfc 18 vtypesfc 71 tmpsfc GLDAS SKNT 72 weasdsfc GLDAS SWE 79 cnwatsfc GLDAS CMC 88 snodsfc GLDAS SNOD 1 1536 768 389408 216.602 + err=0

I believe that the error is related to the difference in tile-size between these two values.

Detailed Description I have proposed no change or addition to the code for this run.

Additional Information Prior gdasgldas tasks in the run from initialization to 2020082600 were successful, but they were all skipped either because the analysis was for a non-00z period or because the requisite number of cycles had not been completed to allow the task to trigger. There are no successful gdasgldas tasks in this run that I can use to compare to the one that has failed. I have conferred with more experienced EMC users of fv3gdas and the cause of the problem is not obvious.

Possible Implementation I have no implementation plan to offer.

CatherineThomas-NOAA commented 2 years ago

@HelinWei-NOAA

HelinWei-NOAA commented 2 years ago

see the email from Dave

Hi Helin and Jun,

I am finding with a recent upgrade of the global workflow that there is a mismatch between the land-sea mask algorithms of the UFS and the GLDAS. This is resulting in a failure of the gdasgldas job, specifically when the Land Information System (LIS) executable is run. The error reported is 'Restart Tile Space Mismatch, Halting..', which can be found in sorce/gldas_model.fd/lsms/noah.igbp/noahrst.F:104. I have only verified this error for a C192/C96 run on 2020080500, where NCH = 97296 and LIS%D%NCH = 99582.

With the recent upgrade, the UFS is modifying the input land sea mask during the first gdasfcst and outputting this modified mask to the tiled surface restart files. Thus, all future forecasts, analyses, etc use this modified mask until the GLDAS reads its own from $FIX/fix_gldas/FIX_T382/lmask_gfs_T382.bfsa.

So I have a few questions. First, the UFS's modification of the land-sea mask is expected, correct? Secondly, should a new fix file be created for the GLDAS with the modified land-sea mask or is the UFS-modified land-sea mask time dependent and thus not fixed? Lastly, should I expect this to be an issue at all resolutions?


Since GLDAS will not be included in the next operational implementation, we need to have someone to decide if we should spend more time on this task.

yangfanglin commented 2 years ago

Is the cycling working without gdasgldas ? Have the user tried at the operational resolutions (C768/C384) ?

BrettHoover-NOAA commented 2 years ago

The cycling appears to fail when gdasgldas fails. I have not run at operational resolution, the current run is an initial test-run I'm doing as a new global-workflow user on Orion and the initial conditions were provided to me at C384/C192.

Brett

On Mon, Jan 31, 2022 at 9:06 PM Fanglin Yang @.***> wrote:

Is the cycling working without gdasgldas ? Have the user tried at the operational resolutions (C768/C384) ?

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/global-workflow/issues/622#issuecomment-1026435283, or unsubscribe https://github.com/notifications/unsubscribe-auth/AXNDXO746HD2FAC4IFILZOLUY5E37ANCNFSM5NHOMEAQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you authored the thread.Message ID: @.***>

CatherineThomas-NOAA commented 2 years ago

@yangfanglin The experiment had no errors when running in the first few days before GLDAS turns on, so I assume that it would run if GLDAS was turned off altogether.

If as @HelinWei-NOAA says that GLDAS will not be used going forward, should we turn it off now in our experiments? The DA team is still running atmosphere-only cases since we have non-coupling data upgrades to worry about.

yangfanglin commented 2 years ago

@CatherineThomas-NOAA If the cycling at C384/C192 resolutions with gdasgldas turned on is working on WCOSS, there is likely issues related to the setups on Orion. I had some discussion with Daryl about the use of gdasgldas in future systems. We can discuss more about this issue with Daryl offline.

CatherineThomas-NOAA commented 2 years ago

@AndrewEichmann-NOAA Has your WCOSS experiment progressed to the point where the gldas step is run? If so, did it fail or run without issue?

AndrewEichmann-NOAA commented 2 years ago

@CatherineThomas-NOAA No - I ran into rstprod access issues and am waiting for a response from helpdesk

CatherineThomas-NOAA commented 2 years ago

Thanks @AndrewEichmann-NOAA. I can run a short warm start test on WCOSS.

CatherineThomas-NOAA commented 2 years ago

The warm start test on WCOSS ran the gldas step without failure. Now that Hera is back, I can try a quick test there as well.

CatherineThomas-NOAA commented 2 years ago

Now that Hera is back, I can try a quick test there as well.

Global-workflow develop does not build on Hera (#561), so I can't run this test at this time.

DavidHuber-NOAA commented 2 years ago

I ran into this issue with the S4 and Jet ports, which I reported to Helin. I have since turned off GLDAS altogether and everything has run OK out to 10 days.

Below is more of the thread between Helin, @junwang-noaa, and myself:

David,

GLDAS should use the same land-sea mask as UFS. If the land-sea mask can be changed during the forecast, that certainly will bring up some issues for GLDAS.

Helin

Dave,

The model does check/modify the land-sea mask according to the input land/lake fraction and soil type values, this is applied to both non-fractional grid and fractional grid. The changes are required to make sure the model has consistent land sea mask, and soil types. The new land sea mask is output in the model sfc file. I expect you do not change the oro data, the soil type data during your run, so this land sea mask won't change. In other words, I think you need to create lmask_gfs_T382.bfsa once with this land sea mask in the history file.

Jun

arunchawla-NOAA commented 2 years ago

@HelinWei-NOAA @barlage and @yangfanglin should Jun's suggestion be followed at the mask be generated for GLDAS from the history file ? It would be good to know if the same issue is seen on other platforms

yangfanglin commented 2 years ago

@arunchawla-NOAA Cathy reported that "the warm start test on WCOSS ran the gldas step without failure". So the failures on other platforms might be a porting issue. @CatherineThomas-NOAA Cathy, can you confirm ? What is the resolution you tested on WCOSS ? I assume you were using the GFS.v16* tag instead of the UFS, right ?

WalterKolczynski-NOAA commented 2 years ago

@BrettHoover-NOAA I'm unable to access either the code or experiment directories.

I ran a test on Orion a few days ago and didn't have any issue, so this probably isn't a port issue. I'm setting up another test just to be sure.

CatherineThomas-NOAA commented 2 years ago

My test on WCOSS was C384/C192 with warm start initial conditions from the PR #500 test. This was using the head of develop global-workflow at the time (d3028b9), compiling and running with atmosphere only. Since then, I've run new tests on Hera and Orion with the recent update to develop (97ebc4d) and ran into no issues with gldas.

@BrettHoover-NOAA It's possible this issue got fixed inadvertently with the recent develop update. It could also be related to the set of ICs that you started from. How about you try to replicate the test I ran first? I'll point you to my initial conditions offline.

WalterKolczynski-NOAA commented 2 years ago

I was able to run a 6½-cycle warm-start from a fresh clone overnight without issue. I'm also using the C384 ICs Cathy produced for PR #500.

BrettHoover-NOAA commented 2 years ago

@CatherineThomas-NOAA I was able to complete your test with the new develop (97ebc4d) and warm-start ICs on Orion, and I ran into no problems, gdasgldas appears to finish successfully.

CatherineThomas-NOAA commented 2 years ago

@BrettHoover-NOAA Great to hear. It looks like your Orion environment is working properly. Maybe to round out this set of tests you could try the warm-start 2020083000 ICs but with the original workflow that you cloned, assuming you still have it.

BrettHoover-NOAA commented 2 years ago

@CatherineThomas-NOAA I have that test running right now, I'll report back ASAP

BrettHoover-NOAA commented 2 years ago

@CatherineThomas-NOAA The warm-start test with the original workflow also finished successfully.

CatherineThomas-NOAA commented 2 years ago

@BrettHoover-NOAA Great! There may have been an incompatibility with the other ICs then.

@HelinWei-NOAA @DavidHuber-NOAA Is the land-sea mask problem that you mentioned early documented elsewhere? Can this issue be closed?

HelinWei-NOAA commented 2 years ago

@CatherineThomas-NOAA No. It hasn't been documented elsewhere. But I have let Fanglin and Mike know this issue. IMO this issue can be closed now.

@BrettHoover-NOAA Great! There may have been an incompatibility with the other ICs then.

@HelinWei-NOAA @DavidHuber-NOAA Is the land-sea mask problem that you mentioned early documented elsewhere? Can this issue be closed?

DavidHuber-NOAA commented 2 years ago

@CatherineThomas-NOAA I'm also OK with this issue being closed.

DavidHuber-NOAA commented 2 years ago

I gave this a fresh cold start test (C192/C96) on Orion over the weekend and received the same error. Initial conditions were generated on Hera (/scratch1/NESDIS/nesdis-rdo2/David.Huber/ufs_utils/util/gdas_init), outputting them here: /scratch1/NESDIS/nesdis-rdo2/David.Huber/output/192. UFS_Utils was checked out using the same hash as the global workflow checkout script (04ad17e2).

These were then transferred to Orion where a test ran from 2020073118 through 2020080500, where gdasgldas failed with the same message ("Restart Tile Space Mismatch, Halting.."). The global workflow hash used was 64b1c1e and can be found here: /work/noaa/nesdis-rdo2/dhuber/gw_dev. Logs from the run can be found here: /work/noaa/nesdis-rdo2/dhuber/para/com/test_gldas/logs.

A comparison of the land surface mask (lsmsk) between the IC tile1 surface file and the tile1 restart file shows a difference.

I also created initial conditions for C384/C192 and compared the land surface mask against Cathy's tile 1 restart surface file, which shows no difference.

Lastly, I copied the C384/C192 ICs over to Orion and executed just the gdasfcst job, then compared the lsmsk field as before and there was a difference.

It is this modification by the UFS that triggers problems for GLDAS and I think any further tests could be limited to just a single half-cycle run of gdasfcst. I tracked this modification to the addLsmask2grid subroutine, which is tied to ocean fraction which in turn is set in the orographic fix files, which are identical on Orion and Hera. So I am at a loss as to why these differ between warm and cold starts. Is this expected behavior and if so should GLDAS be turned off for cold starts?

KateFriedman-NOAA commented 2 years ago

All, is there guidance on whether users should turn off GLDAS when running cold-started experiments for now? @jkhender just hit the same error on Orion with a cold-started experiment using global-workflow develop as of 2/2/22. Thanks!

jkhender commented 2 years ago

correction - my experiment is running on Hera

DavidHuber-NOAA commented 2 years ago

This is what I would suggest. I don't think that a fix file update will work for everyone since warm starts seem to be using the current fix files without a problem, implying that an update would result in a mismatch for those users (testing could confirm this). Alternatively, two sets of fix files could be created, one for warm starts and one for cold, but that would require some scripting to know which to link to.

KateFriedman-NOAA commented 2 years ago

correction - my experiment is running on Hera

Oops, thanks for correcting @jkhender !

Alternatively, two sets of fix files could be created, one for warm starts and one for cold, but that would require some scripting to know which to link to.

So this is related to a fix file issue? Sorry, been out of the loop on this. Thanks!

DavidHuber-NOAA commented 2 years ago

So this is related to a fix file issue? Sorry, been out of the loop on this. Thanks!

I believe so, yes. See the email thread between Jun, Helin, and myself here and here.

AndrewEichmann-NOAA commented 2 years ago

I've now encountered this on WCOSS, on the fourth full 00Z cycle into a run. I modified the rocoto script to skip over it.

rotdir /gpfs/dell1/ptmp/Andrew.Eichmann/baily expdir /gpfs/dell2/emc/modeling/save/Andrew.Eichmann/para/baily

global-workflow is my EFOSI fork branch off from master d7319f19aceca6ae6d7ce9b06c6eb731832d1de1 (Feb 2)

KateFriedman-NOAA commented 2 years ago

@AndrewEichmann-NOAA For clarification...you skipped the failing GLDAS job in your fourth full 00z cycle but kept GLDAS on for following cycles? If so, did the GLDAS work in the next cycles? Thanks!

AndrewEichmann-NOAA commented 2 years ago

@KateFriedman-NOAA I skipped the fourth full 00z cycle, tried turning off in config.base, which didn't have any effect (still in the workflow), so I regenerated the xml file to have the workflow without it (this being since I wrote the above). Basically I tried not running it at all, so I don't know if it would have worked properly on later cycles.

KateFriedman-NOAA commented 2 years ago

@AndrewEichmann-NOAA Gotcha, thanks for the clarification!

All, it sounds like we should advise that users don't run with GLDAS if they are cold-starting an experiment, especially given the note from @HelinWei-NOAA that we won't be using the GLDAS moving forward. @yangfanglin @CatherineThomas-NOAA @WalterKolczynski-NOAA Any objections to this guidance? We have users with stalled experiments because of this cold-start problem. Thanks!

CatherineThomas-NOAA commented 2 years ago

@KateFriedman-NOAA No objection from me.

RussTreadon-NOAA commented 2 years ago

What guidance do we have for developers who cold start and experiment AND want to run GLDAS? The JEDI GDAS wants to mimic all operational job steps apart from the analysis step which will be JEDI VAR. As this is just a prototype we are running at C96L127. After cycling sufficient days to run the 00Z GLDAS, job gdasgldas fails with the restart tile space mismatch error reported in this issue.

Is there an offline executable we can run to make a cold start experiment look like a warm start experiment? What is the target date for turning off the GLDAS in operations? I don't view turning off GLDAS as a solution.

RussTreadon-NOAA commented 2 years ago

As requested by g-w staff, this issue has been reopened.

RussTreadon-NOAA commented 2 years ago

An Orion C96L127 parallel running ufs model tag release/P8a was used to create a test set of fix/fix_gldas/FIX_T190 files.

A 00Z gdasgldas job using /work/noaa/global/glopara/fix_NEW/fix_gldas/FIX_T190 was submitted. This job failed with the error Restart Tile Space Mismatch. The test FIX_T190 files were swapped in and the 00Z gdasgldas job rerun. The job ran to completion without any error messages.

yangfanglin commented 2 years ago

@RussTreadon-NOAA , what are the differences between the original FIX_T190 files and test FIX_T190 files ? Sorry if it has been discussed and document somewhere else.

RussTreadon-NOAA commented 2 years ago

@HelinWei-NOAA created the FIX_T190 files using output from the Orion C96L127 parallel.

yangfanglin commented 2 years ago

@HelinWei-NOAA Thanks Helin for creating the new files. Are you planning to create the new fix files for other resolutions as well ?

HelinWei-NOAA commented 2 years ago

@yangfanglin Yes. As long as the sfc nemsio files for other resolutions can be provided.

HelinWei-NOAA commented 2 years ago

@KateFriedman-NOAA My plan is to add a flag "GLDAS_FIX" to config.base, it is set to UFS or GFS.

run GLDAS to spin up land ICs

export DO_GLDAS="YES" export gldas_cyc=00export GLDAS_FIX="UFS" if [ $GLDAS_FIX = "UFS" ]; then export FIXgldas=the place of the new gldas fixed fields based on UFS fi

What do you think?Do you prefer to move this section to config.gldas?

KateFriedman-NOAA commented 2 years ago

@HelinWei-NOAA I would prefer GLDAS-specific settings go into config.gldas. Perhaps gldas_cyc can go in there too. @WalterKolczynski-NOAA thoughts?

Questions:

1) Where will these UFS-FIXgldas files live? Can we add them into the main FIX_DIR alongside the GFS-FIXgldas files? Can you explain how they will differ between GFS and UFS? We currently have the following for GLDAS in FIX_DIR:

[Kate.Friedman@v71a1 fix_gldas]$ pwd
/gpfs/dell2/emc/modeling/noscrub/emc.glopara/git/fv3gfs/fix_NEW/fix_gldas
[Kate.Friedman@v71a1 fix_gldas]$ ll
total 16
drwxr-xr-x 2 emc.glopara emcmodel 2048 Dec 13  2019 FIX_T1534
drwxr-xr-x 2 emc.glopara emcmodel 2048 Dec 13  2019 FIX_T190
drwxr-xr-x 2 emc.glopara emcmodel 2048 Dec 13  2019 FIX_T382
drwxr-xr-x 2 emc.glopara emcmodel 2048 Dec 13  2019 FIX_T766

2) How will we/users know when to switch to the UFS set instead of the GFS set? Can we use another switch to determine that? Is this UFS set for the coupled system? If so, perhaps it could key off of the APP setting?

HelinWei-NOAA commented 2 years ago

@WalterKolczynski-NOAA @KateFriedman-NOAA I am very confused about two directories under global-workflow/fix. fix_fv3_fracoro is a copy of fix_fv3_gmted2010 but with the latest data used by UFS PT. How does the workflow determine which one will be used? We should only use one of them for consistency. But now it seems to me that FIX_SFC is linked to fix_fv3_fracoro and FIXfv3 is linked to fix_fv3_gmted2010.

KateFriedman-NOAA commented 2 years ago

@WalterKolczynski-NOAA @KateFriedman-NOAA I am very confused about two directories under global-workflow/fix. fix_fv3_fracoro is a copy of fix_fv3_gmted2010 but with the latest data used by UFS PT. How does the workflow determine which one will be used? We should only use one of them for consistency. But now it seems to me that FIX_SFC is linked to fix_fv3_fracoro and FIXfv3 is linked to fix_fv3_gmted2010.

@HelinWei-NOAA The following forecast job child scripts set the mentioned FIX paths:

Tagging @yangfanglin to explain fix file usage further and check we are being consistent/correct.

JessicaMeixner-NOAA commented 2 years ago

The fix_fv3_fracoro should not just simply be a copy of the fix_fv3_gmted2010. These have the fractional masks (ie. a grid cell can have both ocean and land). @shansun6 and others added the capability for fractional grid and would likely be the best person to provide more details on the contents of the fracoro directory.

HelinWei-NOAA commented 2 years ago

@KateFriedman-NOAA Thanks for your explanation. @yangfanglin In the current setting, the new dataset including viirs-based vegetation type can only be used when fractional grid is used. I am wondering if they should be also used (if they can @shansun6) without fractional grid. You turn fractional grid off in your cycled test run, but gldas still failed. We need to figure this out.

HelinWei-NOAA commented 2 years ago

@JessicaMeixner-NOAA When I said a copy, I meant the structure is the same (number of files, file names). The only difference is the oro data has more information (lake fracyion, lake depth) and all other fixed fields are based on this oro data.

HelinWei-NOAA commented 2 years ago

@shansun6 If the dataset created for fractional grid can't be used for the run without fractional grid, probably you should create two datasets w/o fractional grid from the same raw data at the same time for consistency.