NOAA-EMC / GSI

Gridpoint Statistical Interpolation
GNU Lesser General Public License v3.0
66 stars 149 forks source link

GSI convergence problems in scout runs in 2020 #755

Closed jderber-NOAA closed 2 months ago

jderber-NOAA commented 3 months ago

Jeff Whitaker's group is reporting convergence issues in their 3dvar C96L127 atm-only scout run starting in 2020.

"Initial cost function = 5.330609621864649467E+06 Initial gradient norm = 6.047187774679628015E+07 cost,grad,step,b,step? = 1 0 5.330609621864649467E+06 6.047187774679628015E+07 1.325505088896680807E-09 0.000000000000000000E+00 SMALL cost,grad,step,b,step? = 1 1 5.337598182749103755E+06 4.183988366714026779E+07 1.268375413166407207E-09 4.731939262212094266E-01 SMALL PCGSOI: WARNING * Stopping inner iteration Penalty increase or constant 1 1 0.100131102470077527E+01 0.100000000000000000E+01

I've tried various things to get around this: 1) different initial times (from ops and/or replay) in 2020 and 2021 - no impact 2) zero initial bias correction or bias correction from ops - no impact 3) leaving out various observing systems (no radiances, no sat winds, no gps etc) - no impact"

Examining runs to determine the source of the issues.

jderber-NOAA commented 3 months ago

Script I was provided did not work properly on Hera. The issue appeared to be in the loading the modules. I replaced these lines with what I normally used for running the GSI. . /apps/lmod/lmod/init/ksh module purge module use /scratch1/NCEPDEV/da/Russ.Treadon/git/gsi/develop/modulefiles module load gsi_hera.intel

module list

This appeared to make it run.

The second issue was that the output files gsitest_hera.err and gsitest_hera.out were not being deleted and the output from the latest run was being appended to these files. This creates some confusion especially when the job did not run properly. So now I am deleting these files before running the test scripts.

jderber-NOAA commented 3 months ago

Examining the stepsizes predicted by each term within stpcalc indicates that the problem is coming from the winds, and the radiances. There is also a short stepsize from the background term. This indicates that there is a problem with the gradient being calculated from the winds and radiances. Will try turning off these two observation types to see if it minimizes properly. If it does, will be necessary to look at the gradients generated from this data more closely to see why it is creating large values.

jderber-NOAA commented 3 months ago

Looking more closely at the output says that the airs_aqua, metop-a iasi,metop-b iasi, npp atms, n20 atms, npp cris-fsr, n20 cris-fsr, and metop-c amsua are the suspicious obs. For the winds, not seeing anything particularly suspicious and the wind signal may be coming from the radiances. So will turn off these radiances first.

jderber-NOAA commented 3 months ago

Didn't help much. Trying to turn off all amsu-a instruments.

jderber-NOAA commented 3 months ago

Notes.

If you start with a smaller stepsize(1.e-6), the minimization runs the full number of iterations. However, the stepsizes are very small and there is not a lot of reduction in the total penalty. This indicates that the minimization algorithm is probably OK. The problem is probably just very poorly conditioned. Need to determine reason for poor conditioning.

  1. Turn off all bias correction - no significant change.
  2. Turn of satellite error covariances - no significant change.
  3. Use observation variances from input file rather than prepbufr - no significant change.
  4. Remove moisture constraint - no significant change.
  5. Remove all sat. obs (except gps bending) - no significant change.
  6. Remove gps bending + above - no significant change.
  7. Remove all winds + above - seems to minimize properly
  8. 5+remove sat winds, profiler winds - as in 5.
  9. all data except remove all winds - as in 5.
jderber-NOAA commented 3 months ago

Seeing some strange things in the search direction for winds. Attempting to print out intermediate values as search direction is being calculated to see where strange values appear.

jderber-NOAA commented 3 months ago

Looks to me that there is an inconsistency between the background errors and the analysis resolution. JCAP=188 - never seen that resolution run before - maybe you run that all the time. Does NLAT=194, NLON=384 work for this JCAP? I would suggest trying to run the analysis at the operational resolution with the operational input files. I think that may converge properly further indicating an issue with the resolution of the analysis or input stats files.

jswhit commented 3 months ago

We're using global_berror.l127y194.f77. I just checked the global workflow and it uses JCAP=190 for C96 using that berror file. Don't know why we have it set to 188 - but I will try 190 and see what happens.

jswhit commented 3 months ago

Same problem with JCAP=190. I wonder if we need to regenerate the berror file for C96 using the backgrounds and analyses we have already generated for the scout run. The current berror file is simply interpolated from the operation C384 file.

jderber-NOAA commented 3 months ago

Jeff,Thanks for doing the experiment!I would think horizontal interpolation would be ok.  Your not doing any vertical interp. Right?I may need to print out more stuff.JohnSent from my Verizon, Samsung Galaxy smartphone -------- Original message --------From: Jeff Whitaker @.> Date: 6/16/24 12:31 PM (GMT-05:00) To: NOAA-EMC/GSI @.> Cc: jderber-NOAA @.>, Assign @.> Subject: Re: [NOAA-EMC/GSI] GSI convergence problems in scout runs in 2020 (Issue #755) Same problem with JCAP=190. I wonder if we need to regenerate the berror file for C96 using the backgrounds and analyses we have already generated for the scout run. The current berror file is simply interpolated from the operation C384 file.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were assigned.Message ID: @.***>

jswhit commented 3 months ago

no vertical interp, just horizontal

jderber-NOAA commented 3 months ago

It looks like the problem is just very poorly conditioned (i.e., the eigenvalues of the Hessian are far from each other and 1.). This can happen if the background errors are strange, there are very small obs errors for a few obs or there are many similar observations very close together. The first two of these do not appear to be true. Making modifications to the duplicate checking for wind obs to see if this helps (all radiances turned off). First try (with errors in code (forgot abs)) seems to be better.

jswhit commented 3 months ago

I've found that the solution for this case is sensitive to the number of MPI tasks used. On hercules, using 8 nodes and 10 MPI tasks per node, the error occurs. Changing the layout to 5 mpi tasks per node allows the minimization to converge (although I then get a segfault when trying to write the analyses, presumably running out of memory).

jderber-NOAA commented 3 months ago

Sounds like it might be a threading issue. I am back to the drawing board, trying to print out a bunch of stuff to see what is happening.

jswhit commented 3 months ago

I've got the 2020 stream running again by allocating 32 80-core hercules nodes (with 4 MPI tasks per node) to the GSI. Reducing the node count to 20 or below results in the convergence error.

jderber-NOAA commented 3 months ago

Jeff,

I think you are on the issue. The original script you gave me used 16 nodes and 40 tasks/node with 8 threads.

I think the number of tasks/node * number of threads should be less than or equal to the total number of processors on a node. I don't think the nodes have 320 processors. (I think it is more like 40/node). With the binding and the oversubscription to the processors, I think this is causing the issues.

I have a test in using fewer tasks per node (5 tasks/node * 8 threads = 40 processors on node), but it doesn't seem to be running. Will let you know my results.

John

jderber-NOAA commented 3 months ago

Still not working right for me. Will continue to look for issue with grid2sub and sub2grid for the u,v and sf,vp transform.

jderber-NOAA commented 3 months ago

Looks like the s2guv%rdispls_s array is being corrupted somewhere. Have to find where the corruption occurs.

jderber-NOAA commented 3 months ago

I think I have solved the problem! Test is waiting to run. Looks like one of the radiance covariance files (I think AIRS sea (correction - cris-fsr_nppsea)) is inconsistent with more active channels (coun=100) than nch_chan (92). (around line 463 of correlated_obsmod.F90). Because of this the indxRf array (dimensioned nch_chan) goes out of bounds and messes up some of the all-to-all communication arrays. Everything goes downhill from there.

The best solution is to remove the inconsistency in the definition of the nch_active input variable and the number of active channels (iuse_rad > 0) . We should also put a check in the correlated_obsmod routine to check for this case and print out a warning message (and stop?).

It is late and I will be busy most of tomorrow. So later tomorrow I will give more details.

jderber-NOAA commented 3 months ago

cris-fsr_npp sea not AIRS sea above. My run failed. I suspect my quick fix for getting around issue. Will do more later.

jderber-NOAA commented 3 months ago

Finally getting correct results with everything converging properly.

Turns out the fix is very simple.

In line 425 of correlated_obsmod.F90, the dimension of the indxRf array should be changed from nch_active to coun

 allocate(indxRf(coun),indxR(nch_active),Rcov(nctot,nctot))

For one satellite instrument (cris-fsr_npp) coun is greater than nch_active. In other cases it is less than or equal. With the current code, the setting of the values in indxRf goes outside the array and can (depending on processor layout) overwrite other variables. For Jeff's case it was overwriting indexes for the all_to_all communication. When he changed the number of processors, the layout changed and the code was able to finish successfully (with correct answer?).

I will prepare a PR with this change. Jeff, could you (or one of your group) review. Since it is one line, the review should be trivial. I will also ask Cathy if she can review.

jderber-NOAA commented 3 months ago

Unfortunately, while trying to create a PR, my push to the repository seems to no longer work. Suggestions for how to move forward?

jack-woollen commented 3 months ago

@jderber-NOAA You could make a fork of develop with the change and create the PR with that. Then someone with permission could complete the push to the main develop. In the meantime, @jswhit is using a fork with some other pending changes to run the scout runs. If the only change needed is the one liner you mention above, I'll just push that into our test fork so Jeff can check it out.

jderber-NOAA commented 3 months ago

Jack,Thanks!  I think the change could wait for the other scout run changes as long as it did not wait too long.  It is only the one line change in the issue.It only happens now when processing npp fsr correlated error.  Note it does this whether or not the data is used if it is in the satinfo file.JohnSent from my Verizon, Samsung Galaxy smartphone -------- Original message --------From: Jack Woollen @.> Date: 7/4/24 1:55 PM (GMT-05:00) To: NOAA-EMC/GSI @.> Cc: jderber-NOAA @.>, Mention @.> Subject: Re: [NOAA-EMC/GSI] GSI convergence problems in scout runs in 2020 (Issue #755) @jderber-NOAA You could make a fork of develop with the change and create the PR with that. Then someone with permission could complete the push to the main develop. In the meantime, @jswhit is using a fork with some other pending changes to run the scout runs. If the only change needed is the one liner you mention above, I'll just push that into our test fork so Jeff can check it out.

—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: @.***>

jack-woollen commented 3 months ago

Thanks John and Jeff, the fix is pushed in the fork.

jswhit commented 2 months ago

@jderber-NOAA your fix is working great - and as a bonus I can use many fewer processors now. I will create a PR with just this change so we can get it merged into develop ASAP. I would think that this bug has got to be adversely impacting the operational GDAS.