bug: inflation files when using 'no copy back' variables

hkershaw-brown commented 3 years ago

:bug: Your bug may already be reported! Please search on the issue tracker before creating a new issue.

Describe the bug

This came up when reviewing the clm documentation. Here is the section of documentation:

The ‘NO_COPY_BACK’ designation has some side effects when it comes to state-space inflation (inf_flavor 2,4 or 5 - >‘VARYING_SS_INFLATION’,’RELAXATION_TO_PRIOR_SPREAD’, or ‘ENHANCED_SS_INFLATION’ - respectively). State-space >inflation requires an inflation value for everything in the DART state. If the variable has been designated as ‘NO_COPY_BACK’ >the DART write routine (when called from filter) simply skips the variable and nothing is written. This is a problem for >inflation files that need to adapt.

The solution is to run PROGRAM fill_inflation_restart to create an initial inflation file with inflation values of 1.0 (i.e. no >inflation). fill_inflation_restart has been specially designed to output inflation values for every variable in the DART state. The >idea is to copy the input inflation file to the output inflation file name before each assimilation cycle. No new values will be >written for the variables designated ‘NO_COPY_BACK’, the original values will persist.

Error Message

not sure, but I think incorrect sized inflation files

Which model(s) are you working with?

CLM, but this would affect any model that uses 'NO COPY BACK'

Version of DART

v9.11.8

Have you modified the DART code?

No

Build information

Independent of build

nancycollins commented 3 years ago

there are 2 issues here. one is scientific and one is implementation related.

scientific: when we discussed this before i don't remember any decision on what to do with fields that were needed to compute forward operators (so they were added to the state) but they weren't going to be written back so not updated by filter.

one option is to set the initial inflation to 1.0 and never change it. or set an initial inflation value (how much?) to match the rest of the state. inflation is needed when the spread of the ensemble is too small - so the answer might depend on what the field is. if it has enough spread to compute a forward operator value with enough ensemble spread then no inflation is fine. if you're using adaptive inflation, it is going to compute new inflation values but the field isn't going to be updated so it isn't clear you want to save those.

implementation: inflation files are written in parallel with the exact same code that writes output files, so fields not being written back to the output files aren't written to the inflation files either. this can be a good thing if you want to keep the inflation for non-updated fields at 1.0 and use adaptive inflation values for other fields. only the updated fields get updated inflation values.

the issue is that on read every field needs a value and if you're creating new inflation files (instead of updating existing files) they won't have all the fields . so the solution was to create an inflation file with values for all the input fields (using fill_inflation_restart, for example) and then copying it to where the output inflation is going to be and letting filter only update the adaptive parts.

i don't see any other logical way to handle this. it doesn't seem like a bad solution. if you tried to write extra fields to only the inflation files you'd be complicating the parallel i/o code, and it's not clear you want to write new values over old values anyway.

does anyone else remember issues with this that i'm forgetting?

nancycollins commented 3 years ago

p.s. this affects any models that allow missing values in the state. these are rare, fortunately.

hkershaw-brown commented 3 years ago

I guess I remember the IO code dealing with inflation files differently.

I think this affects any model that has NO COPY BACK, e.g. WRF. missing_r8 in the state is another issue on top.

In adding this as a GitHub issue, I was thinking this would just be a note to keep in mind for future development. For example, the a hybrid with a giant ensemble might have copies that you don't want to treat the same as regular ensemble copies. Or you do actually want to look at what the inflation is doing for science reasons. Note I don't know if this is important, but at the moment we are letting the model dictate what the inflation files look like.

hkershaw-brown commented 3 years ago

p.s. this affects any models that allow missing values in the state. these are rare, fortunately.

maybe this is for issue #277

nancycollins commented 3 years ago

i did get them mixed up. for this one:

this does affect any models that have NO_COPY_BACK as an option.

nancycollins commented 3 years ago

I guess I remember the IO code dealing with inflation files differently.

there are slight differences higher in the calling stack, e.g. inflation copy values aren't clamped and the filenames are fixed. but by the time you get down to the code that collects the ensemble member fields onto separate tasks and calls the netcdf write code, every output file looks the same - ensemble members, ensemble mean, ensemble sd, inflation mean and inflation sd.

mgharamti commented 3 years ago

Just a comment from me: If the user doesn't want DART to impact these variables during the update using a "no-copy-back" flag, then they shouldn't be inflated either. Inflation is in essence a mechanism to aid the update and so if the update won't happen then inflation shouldn't happen either. Inflation will change the value of the ensemble variable and the user doesn't want that. So, to me if a "no-copy-back" is requested for variable x then we DON'T inflate variable x. The only question remains is how to implement that efficiently in the code or the script without a lot of intrusive action.

nancycollins commented 3 years ago

something we've talked about for a long time is having a different mechanism for fields which are only needed to compute a forward operator but aren't going to be changed by the assimilation.

in addition to interpolating state values, there are 4 categories of non-state data that can be needed by forward operators:

single values (e.g. parameters) which are constant for all ensemble members. typically those are read into module global variables in the model_mod or forward operator code. if forward operator code, they should be read once (first time code is executed) to avoid doing i/o for each observation.
fields with values which are constant for all ensemble members (e.g. terrain height or land/water flags). these are also typically read into module global storage in the model_mod or forward operator module. if these get large (e.g. a 3d field instead of 2d) there could be memory problems because each task has a copy of them. spreading them out across tasks like we do for the ensemble members would solve this problem but you'd have to have a get_data() function and mpi windows in order to access parts of the field on other tasks during the forward operator phase of execution. we have also talked about spreading this kind of data across all tasks on the same node. that would minimize communication overhead between nodes and still use less memory per task. since this data is read-only and not updated it could be replicated on each node without problems. mpi groups might work well here.
fields with values which vary by ensemble member. these we put into the same ensemble structure as the fields that will be updated by the assimilation because we have no other per-ensemble-member structure to handle large arrays. we have talked in the past about maybe creating a separate ensemble handle for these, so only the fields which are going to be updated by the assimilation could be passed to the filter_assim() routine. or order the state vector values so items 1-N are updated and items N+1 to model_size are not. or something else less clunky.
per-obs metadata. these values are usually read into module global arrays in the forward operator module as each obs is read at the start of execution. these arrays can get large for large obs_seq files and they're replicated on each task. being able to read in only part of the obs_seq on each task would help here. some forward operators reallocate the arrays if they fill up; others make you give a max size in the namelist and fail if there are more obs. (also use more memory if you have fewer.)

hkershaw-brown commented 3 years ago

External forward operators

nancycollins commented 3 years ago

there is the additional issue that just setting them NO_COPY_BACK doesn't automatically prevent those fields from getting updated by the assimilation. the user would have to take the additional step of having the model_mod get_close() routine return very large distances from an obs to those fields, or use the obs_impact tool to avoid updating those fields. if the values are updated, the diagnostic posterior obs values will be computed using updated field values which are not going to be written and saved.

braczka commented 3 years ago

Just a comment from me: If the user doesn't want DART to impact these variables during the update using a "no-copy-back" flag, then they shouldn't be inflated either. Inflation is in essence a mechanism to aid the update and so if the update won't happen then inflation shouldn't happen either. Inflation will change the value of the ensemble variable and the user doesn't want that. So, to me if a "no-copy-back" is requested for variable x then we DON'T inflate variable x. The only question remains is how to implement that efficiently in the code or the script without a lot of intrusive action.

To echo what was said before, one way to approach this would be to set the initial inflation to 1.0 and the "no-copy-back" variables should never be inflated. In fact, it seems like a bad idea in general to set the initial inflation to anything but 1.0. Applying an initial inflation >1.0, gives a spatially uniform inflation value to the entire spatial domain, even if an observation is not nearby -- which seems unnecessary. I assume there are situations where setting an initial inflation to anything other than 1.0 is useful, otherwise this would not be an option in fill_inflation_restart? I guess I don't have enough experience to be aware of them...

Also, to make sure I am clear about this -- is DART purposefully designed such that inflation is applied to a state variable that is just being used by the forward operator, or is that just a side effect of the code implementation? I understand that Inflation is designed/intended to be applied to the (prognostic) variables which are also updated during the filter step, but am still unclear about the forward operator variables.

hkershaw-brown commented 3 years ago

so the other way of looking at this is that the inflation files (edit: when created from scratch by DART) are the correct size, but the state vector is too big.

state vector (ensemble size x variables to be inflated and updated) non-state (ensemble size x variables neither inflated nor updated)

At the moment the non-state is lumped in with the state in DART.

Are we doing more get_close calculations than we need to?
Are model_mods with NO_COPY_BACK variables using get_close to force no update?
Does this matter for science if the number of close state elements is higher than it should be, e.g. adaptive localization?
Are we looping around state we don't need to in STATE_UPDATE: do j = 1, num_close_states (there is a broadcast for each j)

Does inflating the non-state give you incorrect forward operators? maybe? Pseudo code:

  read state
  call filter_ensemble_inflate
  calculate prior fwd operator
  call assimilation (obs inflation in here)
  calculate posterior fwd
  write state

Is non-state just extra work doing inflation calculations you don't need? I think this depends on whether the inflation algorithms look across the ensemble only, or across the ensemble and across the state.
Am I missing something in the code that is dealing with state / non-state? e.g forcing the non-state to be not updated in get_close, get_state_meta_data, or setting inflation values for the non-state variable (Brett's comment above)

nancycollins commented 3 years ago

see my answers/opinions to your (very good) questions below:

At the moment the non-state is lumped in with the state in DART.

Are we doing more get_close calculations than we need to?

yes

Are model_mods with NO_COPY_BACK variables using get_close to force no update?

probably not. CLM has no model-specific get_close routine so it's using the one in the location mod which has no knowledge of NO_COPY_BACK fields. i know one model that avoids updates in get_close and it isn't even marked as NO_COPY_BACK (but maybe should be?) is POP which has a field in the state which indicates land/ocean grid points. we explicitly look for that in get_close and set the distance to be huge. using the obs_impact tool (which is newer than the POP code) would at least skip calling update code for dry land in the state loop. it wouldn't keep the land field values out of the get_close list, but the loop would cycle sooner with obs_impact, i think. i'm not sure if you'd see a detectable performance improvement unless you could skip the broadcast. (i haven't traced the code path to see if that's possible.)

Does this matter for science if the number of close state elements is higher than it should be, e.g. adaptive localization?

it might - but the current adaptive localization mechanism is imprecise in many ways. for openers, it only computes in 2d rather than 3d. also, it assumes observations are distributed evenly in space. see comments and code in function revised_distance(orig_dist, newcount, oldcount, base, cutfloor) in assim_tools_mod.f90 for more info on how this works.

Are we looping around state we don't need to in STATE_UPDATE: do j = 1, num_close_states (there is a broadcast for each j)

yes

Does inflating the non-state give you incorrect forward operators? maybe? Pseudo code:
  read state
  call filter_ensemble_inflate
  calculate prior fwd operator
  call assimilation (obs inflation in here)
  calculate posterior fwd
  write state

if people set up the inflation files so non-state data always has 1.0 as the inflation factor then no. even if the inflation code computes an updated value it isn't going to be used until the following assimilation step. if it isn't written out, it's discarded. otherwise, the answer to your question is yes.

where you do get "wrong" forward operator values is since the default is that the assimilation updates the NO_COPY_BACK values just like everything else, the prior forward operator values are fine (assuming no inflation or inflation of 1.0) but the posterior forward operator values (which are diagnostic only) are based on updated values and so aren't consistent with the original values.

Is non-state just extra work doing inflation calculations you don't need? I think this depends on whether the inflation algorithms look across the ensemble only, or across the ensemble and across the state.

if adaptive inflation is being used, then yes, extra work is done to compute values which could be skipped since they aren't written out and used in subsequent cycles.

Am I missing something in the code that is dealing with state / non-state? e.g forcing the non-state to be not updated in get_close, get_state_meta_data, or setting inflation values for the non-state variable (Brett's comment above)

this is the crux question. no, you're not missing anything. the way this was implemented was to create the smallest footprint in the code. nothing in the main dart code knows about non-state fields except the i/o code that doesn't write them back. the rest of the system is unaware of anything related to non-state values. as you point out, if you did this "right" there would be a lot of locations in the code that might have to know about non-state values.

hkershaw-brown commented 3 years ago

what is force_copy_bock for? Looks like it was intended to overwrite no_copy_back for particular ensemble copies.

if ( do_prior_inflate ) & call set_io_copy_flag(file_info, STAGE_COPIES(PRIORINF_MEAN), WRITE_COPY, & inherit_units=.false., force_copy_back=force_copy)

nancycollins commented 3 years ago

it think it was added when we made the fill_inflation_restart utility. we needed the i/o routines to write everything that was in the state regardless of what the copy back flag was set to for filter.

hkershaw-brown commented 3 years ago

I'm not super sure I buy that, but I don't think I understand the stages to write and copies yet. Note for later: filter setting by copy: https://github.com/NCAR/DART/blob/3bfe1baff9ce35fd29e0c830dd4f96b979d43751/assimilation_code/modules/assimilation/filter_mod.f90#L2494

filter setting by stage: https://github.com/NCAR/DART/blob/3bfe1baff9ce35fd29e0c830dd4f96b979d43751/assimilation_code/modules/assimilation/filter_mod.f90#L2685

nancycollins commented 3 years ago

i wonder if some of the complexity is related to:

if you put 'input' as an output stage, it computes and outputs the ensemble mean and sd, since the input ensemble members are presumably already in the input file. for multi-file output, it creates two new files. for combination/single file output, it adds the mean and sd variables to the netcdf file.
the namelist lets you control how many ensemble members are written to the diagnostic files, where diagnostic is pretty much any stage except output. (preassim, postassim, etc). for the output stage, all members are always written because they're needed to advance the whole ensemble as you cycle.
we do still allow 'write all stages at end' although i don't know how many users have tried it. it is a memory/io-time tradeoff. for 1/10th degree pop and other huge models, it was faster to write all members, all stages at once in parallel at the end rather than write out preassim, postassim, etc during the run. note: there is probably still a lurking bug in that code somewhere. tim was using this feature with the PGI compiler on another system and it crashed when trying to write all members/all stages at the end. we never could track it down and it went away when he turned this off.

hkershaw-brown commented 2 years ago

notes on copies and stages to write as I am reading the RTPS stuff for issue #353. The RTPS copy is outside the if (write_all_stages_at_end) then statement in filter_mod https://github.com/NCAR/DART/blob/8873fcd7b254baf9867c65fc1745279a7cff66b3/assimilation_code/modules/assimilation/filter_mod.f90#L2386

There is a force_copy https://github.com/NCAR/DART/blob/8873fcd7b254baf9867c65fc1745279a7cff66b3/assimilation_code/modules/assimilation/filter_mod.f90#L2504-L2506 what is this for if not forcing the write of a variable? The write_variables routine appears to be checking for this force:

https://github.com/NCAR/DART/blob/2d3f1534fd4956d839a637d1eeb7b5570c603e50/assimilation_code/modules/io/direct_netcdf_mod.f90#L1551

related #106 🐳

hkershaw-brown commented 1 year ago

same problem for perturb_single_instance

creates files that do not have 'NO_COPY_BACK' variables

hkershaw-brown commented 10 months ago

related #560

NCAR / DART