Feature request: DART should allow users to indicate what netcdf dims are not part of state variables

nancycollins commented 3 years ago

Use case

our current netcdf read code makes hardcoded assumptions about 2 dimension names. users who inadvertently use those names may not be able to use our read routines without altering their files. it's possible this could be avoided with a few lines of code change and a new namelist item.

if a user has a netcdf variable that uses 'member' as a dimension and it isn't our "single file" defined format (where all ensemble members, inflation, etc are in a single netcdf file), that dimension will be skipped and the variables will be added to the dart state with 1 less dimension. e.g. real temperature(member, xdim, ydim) will read in as a 2d variable, xdim by ydim. the user might be expecting to interpolate in a 3d field between members (where member is significant to their model and unrelated to ensemble members).

if a user has a variable with "time" as a dimension, e.g. real temperature(time, lon, lat) and wanted to be able to read this in as a 3d variable and interpolate in time, there is no way to disable our special handing of the time dimension.

finally, if a user has a different dimension name that is not part of the field, e.g. real temperature(leapfrog, lon, lat, vert) where leapfrog is 2 for a leapfrog integration scheme, and they want to read in a 3d array of (lon, lat, vert), there is no way to indicate that one of the dimensions should not be used as a state dimension.

this may also require the user specifies what index to use when reading in the non-state dimension. it could default to the last entry.

Is your feature request related to a problem?

the read code creates fields in the model state by reading in netcdf variables and using the sizes and dimensions to compute the size of each variable. it silently skips 'time' and 'member' if they are used as dimension names, which in many cases is the right choice. but there is no way to augment or disable this behavior if it isn't right and it fails in mysterious ways.

Describe your preferred solution

in state_structure_mod.f90, lines 590-599 are where "non-state" dimensions are skipped on variable read.

they are currently hardcoded to be only 'time' and 'member'.

'member' should only be skipped if this is a single-file netcdf file that has all ensemble members written in a single variable (a consolidated/combined/single-file netcdf file where we have defined the dim names, variable names, and other format issues).

'time' is usually right to be skipped, but if someone wanted to read in a time series of 2d fields and interpolate in time as part of their model_mod code, they can't.

the "non-state" dimension names should default to 'time', also 'member' for single-file input, and then have a namelist where other dimension names can be added or the defaults overridden.

Describe any alternatives you have considered

in the past we've been able to change the dimension names on files because they were constructed for testing or we had control over what the netcdf file dimension names were, or we told users to use the NCO utilities to remove or rename the dimensions. this may be sufficient and this issue isn't pressing enough to make this change.

but if the i/o code is going to be refactored to isolate the "single-file" code paths and to pull all the netcdf i/o dependent code out of the state structure code so a different i/o file format could possibly be supported, then this is one issue to look at. so far this hasn't caused major problems that i'm aware of, but it could be a simple fix to extend the functionality and generalize the code.

full disclosure: there are other places in the write code, and possibly in the io_dims vs dims routines, that also know about these magic names so i may be overly optimistic about how easy this is to fix.

hkershaw-brown commented 3 years ago

also single-file does not work with multiple domains.

hkershaw-brown commented 3 years ago

Nancy, do you know which models you had to do this for?

in the past we've been able to change the dimension names on files because they were constructed for testing or wehad control over what the netcdf file dimension names were, or we told users to use the NCO utilities to remove or rename the dimensions. this may be sufficient and this issue isn't pressing enough to make this change.

nancycollins commented 3 years ago

i have a clear but useless memory of testing a bug fix for a model_mod by constructing an input file that gave me errors because of this issue but i don't remember which model. i have been running some tests on L96 where i'm changing the member and time dims to different names and it seems to keep working in spite of what i change which is not what i expect either. i will continue to try to construct a reasonable test case here. (i clearly need more coffee.)

nancycollins commented 3 years ago

this is a slight tangent, but when i was dumping our 'consolidated format' netcdf files, i noticed that we declare a 'member' dimension with a size (number of ensemble members) but we don't declare a 'member' variable. the variable would be where we assign values of the dimension items. apparently netcdf defaults to starting at 0 and going up to (number of ensemble members - 1). it would seem more natural to create a member variable and start it at 1 up to num_ens but i don't know if that would break the matlab diagnostics or obs_diag or anything else.

hkershaw-brown commented 2 years ago

note on IO refactoring (see #309 11.) filenames.

NCAR / DART