Efficiency issues in EnKF solvers for MPAS-JEDI due to vertical localization

SamuelDegelia-NOAA commented 1 month ago

When assimilating the full set of RAP mesonet obs, I found significant slowdown in LETKF for MPAS-JEDI. Using 120 MPI tasks, the LETKF took ~70 min to run.

I found that this is primarily due to vertical localization. After disabling vertical localization, the LETKF runs in ~24 min. The bottleneck now is hofx calculation, which could be sped up by running LETKF in split observer/solver mode for each member separately, and then catting the hofx file together. I do not think this will have any affect on the vertical localization issue, though.

It seems that this is a known issue due to the 3d iterator. Some discussion of the problem is here, but no solution is available yet other than using GETKF for model-space localization (which has other known efficiency issues).

guoqing-noaa commented 1 month ago

Just curious, is the total time of observer+solver much smaller than running LETKF in one step directly?

SamuelDegelia-NOAA commented 1 month ago

@guoqing-noaa, I will work on using this split method for MPAS-JEDI and get back to you. I only ever tried it for FV3-JEDI and did not notice any significant difference in the timing. But I was mainly working on it there to see if the split method could solve the halo observation distribution issue in #51 and did not pay very close attention to the cost.

guoqing-noaa commented 1 month ago

Great! Thanks, @SamuelDegelia-NOAA !

Do you have an instruction on how to run LETKF? A few GSL folks (@chunhuazhou, @hongli-wang , @HaidaoLin-NOAA, etc) may want to learn and get some hands on experiences on that.

SamuelDegelia-NOAA commented 1 month ago

I do not really have any direct instructions, but the yaml file included in RDASApp (https://github.com/NOAA-EMC/RDASApp/blob/develop/rrfs-test/testinput/rrfs_mpasjedi_2022052619_letkf.yaml) works to run LETKF for our older test case. No additional fix files or changes are needed in the run directory compared to running EnVar with setup_experiment.sh. Other than changing the yaml file, the only difference compared to EnVar is that you change the executable from mpasjedi_variational.x to mpasjedi_enkf.x.

Also, I have a directory set up where I have been running my LETKF validation experiments if anyone wants to take a look there. See /scratch1/NCEPDEV/da/Samuel.Degelia/enkf_validation/DRIVER_mpasjedi_singleob.sh for a script I have been running that sets up the LETKF run directory, runs a single observation test, and then does some quick validation plots.

guoqing-noaa commented 1 month ago

Thank you! @SamuelDegelia-NOAA

SamuelDegelia-NOAA commented 1 month ago

There is currently a problem with running LETKF in the split observer/solver mode. The hofx file created by the observer step has all missing values for the DerivedObsError group. This leads to zero increments when the solver is run. It might be a regional issue since GDAS does not report this problem. Manually filling in DerivedObsError with the values from EffectiveError0 using a python script fixes this problem and leads to non-zero increments. I need to find out why DerivedObsError is empty though so that I do not need to run this intermediate step.

Some timing stats from running the split observer/solver mode with RAP mesonet obs:

Hofx for all 30 MPAS members takes 11 min, 51 sec
A single member takes 1 min, 1 sec
This means that if we run the observer for each member individually in parallel and then cat the hofx files together, we could theoretically save ~11 min (assuming the cat script is fast...)
This still does not solve the main bottleneck though that is due to the vertical localization

SamuelDegelia-NOAA commented 1 month ago

@guoqing-noaa For the mesonet test, the runtime for split observer/solver (all members at once still) was 38 seconds faster compared to running everything together. This is for the test that took ~70 minutes due to vertical localization so its a pretty insignificant difference overall. However, the split observer/solver method would allow us to use different obs distributions for each component, so we have the option to optimize further and potentially increase the runtime difference compared to running all in one.

SamuelDegelia-NOAA commented 1 month ago

Sadly I found that using the split observer/solver for LETKF results in zero analysis increments. A comment in ufo/src/ufo/obslocalization/ObsVertLocalization.h states that vertical localization breaks the split mode and thus we need to run both as one step. However, I was able to slightly speed up LETKF by tuning the halo size (now at 1000 km) and disabling computation of h(x) for the analysis (do posterior observer: false).

Here are the runtime results:

Even after these changes, LETKF is still very slow due to the 3d geometry iterator needed to compute vertical localization.

Since the obs-space vertical localization is the primary source of the slowdown, I also evaluated the efficiency of the GETKF solver which uses modulated ensemble members to emulate model-space localization (i.e., no need for 3d iterator). This also means I was able to run in split observer/solver mode. I used RoundRobin distribution for the observer which randomly distributes obs across all PEs and is most efficient. During the solver, I used the halo obs distribution.

The runtimes for GETKF are also given in the above table. Without any optimizations, GETKF is even slower than LETKF due to needing to compute hofx for each modulated ensemble member (12 members for each real member = 360 total members). But when using do posterior observer: false and a smaller halo size during solver, GETKF now becomes very efficient.

The analysis increments between the two methods are generally similar but with some differences especially near the northern boundary of the domain. See below. I plan to do some additional diagnostics to understand which solver is better.

Due to the vertical localization issues in LETKF, I suggest using GETKF for now. The GETKF is much faster now and the model-space localization will also be beneficial when we begin assimilating radiance data since we would no longer need any obs height information like in LETKF.

SamuelDegelia-NOAA commented 1 month ago

The timing stats in my last post for "GETKF, optimized" include effects from both reducing the halo size and disabling posterior hofx calculation. To provide a little more information, I also tested the impact of only disabling the posterior hofx calculation while still using the smaller halo size:

GETKF, posterior hofx OFF: 6.2 min GETKF, posterior hofx ON: 20.8 min

The significantly increased cost from running the posterior observer compared to the prior observer is likely due to needing to use the halo obs distribution during the solver. The posterior hofx is always computed as part of the solver (which needs halo obs distribution), whereas the prior hofx can be run separately (which can use the more efficient RoundRobin distribution).

To solve the problem of missing diagnostic info, we could add an additional step to run a standalone hofx calculation after GETKF runs, which could use RoundRobin.

NOAA-EMC / RDASApp

Efficiency issues in EnKF solvers for MPAS-JEDI due to vertical localization #122