NOAA-EMC / RDASApp

Regional DAS
GNU Lesser General Public License v2.1
1 stars 8 forks source link

Problems with halo observation distribution for regional LETKF #51

Open SamuelDegelia-NOAA opened 2 months ago

SamuelDegelia-NOAA commented 2 months ago

I am experiencing an issue with the regional LETKF crashing when using the Halo observation distribution. This first post is just background on the issue, then I will discuss potential causes/solutions in a follow-up post. Here is the error message:

 0: MSONET airTemperature nlocs = 312314, nobs = 298824, min = 0, max = 19, avg = 7
 0: MSONET dewPointTemperature nlocs = 312314, nobs = 298824, min = 29, max = 29, avg = 29
 0: MSONET specificHumidity nlocs = 312314, nobs = 298824, min = 29, max = 29, avg = 29
 0: MSONET stationPressure nlocs = 312314, nobs = 298824, min = 29, max = 29, avg = 29
 0: MSONET virtualTemperature nlocs = 312314, nobs = 298824, min = 29, max = 29, avg = 29
 0: MSONET windEastward nlocs = 312314, nobs = 298824, min = 29, max = 29, avg = 29
 0: MSONET windNorthward nlocs = 312314, nobs = 298824, min = 29, max = 29, avg = 29
 1: Assertion failed: numRecognizedFlags == gnlocs in print, line 195 of /work/noaa/wrfruc/sdegelia/RDASApp/bundle/ufo/src/ufo/filters/QCmanager.cc
 3: Assertion failed: numRecognizedFlags == gnlocs in print, line 195 of /work/noaa/wrfruc/sdegelia/RDASApp/bundle/ufo/src/ufo/filters/QCmanager.cc
22: Assertion failed: numRecognizedFlags == gnlocs in print, line 195 of /work/noaa/wrfruc/sdegelia/RDASApp/bundle/ufo/src/ufo/filters/QCmanager.cc

This error occurs when using the full set of RAP mesonet observations created from bufr2ioda. We did not experience this crash using the smaller set of observations that are included as part of the LETKF ctest in RDASApp. Also, this was tested using FV3-JEDI, but I would suspect that the error also occurs with MPAS-JEDI.

Looking through the code for RDASApp/bundle/ufo/src/ufo/filters/QCmanager.cc, it seems that gnlocs is the total number of observation locations read in from the observation file, and numRecognizedFlags is the summed number of observations after being distributed to the different PEs. So there is a mismatch on the observation counts after they are distributed. Commenting out the assertion line at L195 of QCmanager.cc removes the error and allows LETKF to successfully complete. But I worry that this "solution" could cause issues elsewhere and is not dealing with the problem directly. Additionally, per a suggestion from @TingLei-NOAA, I did find that using the observation distribution option InefficientDistribution instead of Halo is successful but also extremely slow. InefficientDistribution replicates all observations on each PE and is likely not feasible for eventual real-time applications.

Other discussions for this error can be found in the UFO issues #2032 and #2212. Those issues mention solving the problem by making sure the halo size is set, but my tests include halo size and I still get the crash. The issues also discuss potentially solving it by changing the halo size, but I tried various sizes and did not have success.

SamuelDegelia-NOAA commented 2 months ago

Since the mismatch was on the order of ~1000 observations, I suspected that the error might be due to observations outside of the domain. Running LETKF with observation subsets seems to support this hypothesis:

  1. Original LETKF run with all MSONET observations = CRASH increment_airTemperature_crash_fullobs

  2. Rerun with small subset of MSONET observations and lots outside the domain = CRASH increment_airTemperature

  3. Rerun with large number of MSONET observations but all inside the domain = SUCCESS increment_airTemperature_small_subdomain_WORKS

It seems that this is not a problem with the number of observations, but with the domain check and the distribution of the observations in a regional LETKF (possibly the order at which these steps occur). We could solve this by adding a domain check during bufr2ioda, but that would obviously not be ideal.

I plan to open an issue shortly in the fv3-jedi repository to get additional thoughts.

SamuelDegelia-NOAA commented 2 months ago

Per some discussion with @TingLei-NOAA, we determined that test #⁠3 above is not realistic because it removes too many observations including lots that are not outside the domain. Below are results from a more realistic test that uses a GSI diag file to only remove the bufr2ioda observations outside of the domain (as a sort of pre-processing domain check). This run completed successfully using the halo observation distribution, providing more evidence that the domain check could be a problem for the regional LETKF.

  1. Rerun with pre-processing domain check via GSI = SUCCESS increment_airTemperature_gsi_domaincheck_success
TingLei-NOAA commented 2 months ago

@SamuelDegelia-NOAA Good job! I would accept your finding, with more confidence, that obs outside of the domains caused the issue which caused regional letkf fail using the halo distributions, which had been found by other jedi users/developers including myself. So, on the one hand, a domain check (rejecting the obs outside of domain) could make letkf succeed and could be a solution for regional letkf applications. On the other hand, if and how the halo distribution of ioda need to deal with obs outside of domains might need more discussion/investigation. That will be very helpful if your findings could initiate further discussions/investigations with UFO/IODA developers .

SamuelDegelia-NOAA commented 2 months ago

Thank you @TingLei-NOAA for the feedback. Do you think this issue would be better to bring up in the UFO repository or in fv3-jedi?

TingLei-NOAA commented 2 months ago

Yes. To initiate a discussion on , like, UFO, with your findings will help achieve a final solution for this general issue.

SamuelDegelia-NOAA commented 2 months ago

Just providing an update on this issue. After some additional investigation, I found that the problem likely occurs when obs exist outside of any halo region. This can happen when we have obs very far from the model domain such as the mesonet obs in Europe. I discussed the problem with Dan, and he found that obs outside of a halo region do not get assigned to any PE and this messes up the counts in QCmanager. There is no easy fix at the moment without some rewrites of the IODA obs. distribution code. It was also recommended not to turn off the ASSERT check since it is pretty low level.

There are two possible workarounds including (1) performing an offline domain check; or (2) running LETKF in split mode where the observer and solvers are run separately. The observer can use the efficient RoundRobin obs. distribution while the solver can continue to use halo observation distribution. Workaround (2) would be preferred.

However, when testing the LETKF in split mode, I am having problems where the analysis increments are zero. The observer component runs correctly and outputs H(x) values that look normal. The solver component then reads these in and prints out the correct count of obs, but the solver then runs in ~5 seconds and shows no increments. I am now discussing the problem further with Travis Elless to see if we can track down the issue.