Correct grid information in netCDF output

leifdenby commented 6 years ago

Currently MONC's output files doesn't contain grid-position information, i.e. positions for the spatial coordinates (x, y, etc). This is required for CF-compliancy. Also currently variables which aren't colocated (for example the velocity components) are defined on the same grid in the output files, e.g.

netcdf diagnostics_ts_25.0 {
dimensions:
        x = 66 ;
        y = 66 ;
        z = 76 ;
variables:
        ...
        double w(time_series_600_1800.0, x, y, z) ;
        double u(time_series_600_1800.0, x, y, z) ;
}

To remedy this grid-information (staggering and positions) for each scalar field should be communicated to the MONC IO server and coordinates for each scalar field need to be written to the output file.

My suggestions:

I think the variable size in the MONC io XML config files should be renamed to grid instead, to make it explicit that this is about setting both the size and position of the grid on which variables are defined.
I would add a grid=“auto" option, in this case the IO server would expect grid information from MONC itself, and fail if it isn’t received.
Grid-Information, when using the Cartesian grid, could be communicated through a 6-bit integer as binary encoding of which dimensions and used and whether staggered on centered values are used for each. E.g. 110010 might mean “use x and y dimensions” (xyz encoded as the first three bits, 110) and “use centered grid in the x-direction and staggered in y (encoded as 010, i.e. the last bit, for z, would be ignored). This could be communicated through the same MPI-datatype that I extended for the field meta information, data_sizing_description_type. This information would be used when grid=“auto” in the XML config file. And variables with the positions for the xn,yn,zn,x,y,z grid positions would automatically be written to every NetCDF file that MONC creates. This might not be the best approach but I think that data_sizing_description_type.dim_sizes is inadequate as it stands because it doesn’t communicate whether x,y or z is used and information about staggering variables.
Grid-information, for variables on non-Cartesian grid. I would suggest we survey what people need here. My gut feeling is that simply supporting 1D arrays with position information (in time or space…) might suffice, I expect people are wanting to extract time series.

leifdenby commented 6 years ago

For my own future reference here's how I would define the grid for individual variables

module grid_definition
   implicit none
   private

   integer, parameter :: USE_X = 32        ! 100 000
   integer, parameter :: USE_Y = 16        ! 010 000
   integer, parameter :: USE_Z = 8         ! 001 000

   integer, parameter :: X_STAGGERED = 4   ! 000 100
   integer, parameter :: Y_STAGGERED = 2   ! 000 010
   integer, parameter :: Z_STAGGERED = 1   ! 000 001

   integer, parameter :: X_GRID_STAGGERED = USE_X + X_STAGGERED
   integer, parameter :: Y_GRID_STAGGERED = USE_Y + Y_STAGGERED
   integer, parameter :: Z_GRID_STAGGERED = USE_Z + Z_STAGGERED
   integer, parameter :: X_GRID_CENTERED = USE_X
   integer, parameter :: Y_GRID_CENTERED = USE_Y
   integer, parameter :: Z_GRID_CENTERED = USE_Z

   public X_GRID_STAGGERED, X_GRID_CENTERED
   public Y_GRID_STAGGERED, Y_GRID_CENTERED
   public Z_GRID_STAGGERED, Z_GRID_CENTERED
end module

program test
    use grid_definition

    integer var2d_grid;

    ! define the grid to be used for a variable which is defined in 2D
    var2d_grid = X_GRID_CENTERED + Y_GRID_STAGGERED
end program test

stevenleeds commented 6 years ago

I was just thinking, why not simply use booleans?

leifdenby commented 6 years ago

I was just thinking, why not simply use booleans?

It's just easier to create the data structure to store and send one integer instead of six booleans I think :) But maybe you've thought of a better way of doing it. How would you do it?

stevenleeds commented 6 years ago

I was just thinking that compared to the total amount of metainformation, even six integers (0/1) would be small (but maybe smaller data formats are possible, like booleans). https://www.unidata.ucar.edu/software/netcdf/netcdf/netCDF-external-data-types.html The main advantage of using different variables rather than a single integer is that it keeps the grid information explicit. I tend to prefer ease of use over a small efficiency gain.

leifdenby commented 6 years ago

The main advantage of using different variables rather than a single integer is that it keeps the grid information explicit. I tend to prefer ease of use over a small efficiency gain.

Yes :) What I'm describing above is not what to store in the netCDF file, that would have to be CF-compliant (and so have different variables for each coordinate), but instead how to communicate from MONC worker to MONC IO server what the grid is.

Leeds-MONC / monc

Correct grid information in netCDF output #8