New format proposal - Githubissues

erget / subsampled-coordinates

Repository for storing CDL demonstrating subsampled coordinates in CF-netCDF

Apache License 2.0

0 stars 3 forks source link

New format proposal #10

Open oceandatalab opened 4 years ago

oceandatalab commented 4 years ago

I open this issue to discuss the new format proposed by Anders in https://github.com/cf-convention/discuss/issues/37#issuecomment-679290210

Here is a copy of the CDL for the VIIRS example:

dimensions :
  // VIIRS M-Band (750 m resolution imaging) 
  m_track = 768 ;
  m_scan = 3200 ;
  m_channel = 16 ;

  // VIIRS I-Band (375 m resolution imaging)
  i_track = 1536 ;
  i_scan = 6400 ; 
  i_channel = 5 ;

  // Tie points and interpolation zones (shared between VIIRS M-Band and I-Band)
  tp_track = 96 ;
  tp_scan = 205 ;
  track_interpolation_zone = 48 ;
  scan_interpolation_zone = 200 ;

variables:
  // VIIRS M-Band 
  float m_radiance(m_track, m_scan, m_channel) ;

  // VIIRS I-Band 
  float i_radiance(i_track, i_scan, i_channel) ;

  // Tie point based interpolation container for location, supporting both VIIRS M-Band and I-Band
  char tp_interpolation ;
    tp_interpolation : dimensions = "tp_track (tie_point) - m_track (location) - i_track (location)  tp_scan (tie_point) - m_scan (location) - i_scan (location)"  // association of grid dimensions and definition of grid functions (location or cell boundary)
    tp_interpolation : offset = "m_track - tp_track = 0.5   m_scan - tp_scan = 0.5   i_track - tp_track = 0.5   i_scan - tp_scan = 0.5"  // definition of grid offset in units of cells

    tp_interpolation : interpolation_indices = "m_track: m_track_indices  m_scan:m_scan_indices  i_track: i_track_indices  i_scan:i_scan_indices" ; // associate dimensions with indices
    tp_interpolation : interpolation_name = "bi_quadratic" ;
    tp_interpolation : interpolation_coefficients = "expansion_coefficient_track alignment_coefficient_track expansion_coefficient_scan alignment_coefficient_scan" ;
    tp_interpolation : interpolation_flags = "interpolation_zone_flags" ;
    tp_interpolation : location_tie_points = "lat lon" ;

  // Interpolation indices
  int m_track_indices(tp_track) ;
  int m_scan_indices(tp_scan) ;
  int i_track_indices(tp_track) ;
  int i_scan_indices(tp_scan) ;

  // Tie points
  float lat(tp_track, tp_scan) ;
    lat : standard_name = "latitude" ;
    lat : units = "degrees_north" ;
  float lon(tp_track, tp_scan) ;
    lon : standard_name = "longitude" ;
    lon : units = "degrees_east" ;

  // Interpolation coefficients and flags
  short expansion_coefficient_track(track_interpolation_zone, tp_scan) ;
  short alignment_coefficient_track(track_interpolation_zone, tp_scan) ;
  short expansion_coefficient_scan(tp_track, scan_interpolation_zone) ;
  short alignment_coefficient_scan(tp_track, scan_interpolation_zone) ;
  byte interpolation_zone_flags(track_interpolation_zone, scan_interpolation_zone) ;
    interpolation_zone_flags:valid_range = "1b, 7b" ;
    interpolation_zone_flags:flag_masks = "1b, 2b, 4b" ;
    interpolation_zone_flags:flag_meanings = "location_use_cartesian  sensor_direction_use_cartesian  solar_direction_use_cartesian" ;

And the relevant comments from the meeting minutes https://github.com/cf-convention/discuss/issues/37#issuecomment-680038841:

CDL approach

@AndersMS proposes using implicit references as outlined in NUG to associate field constructs with coordinates. This utilised shared dimensions to imply connections. In our case, the field construct shares the full resolution dimension with the interpolation container variable, whereas the interpolation container variable shares the compacted dimension with the tie-points. @erget posits that this has no impact on the CF Data Model because the full set of coordinates can be reconstructed for every data point, so that the lack of explicitly encoded coordinates in the netCDF file can be described logically as an encoding issue. @oceandatalab would prefer explicit to implicit references. Also noted: Explicit references would require potentially cumbersome updates and store information redundantly. There is a trade-off here between explicitness & conciseness. In all cases, the proposed approach allows coordinates to be reconstructed without having to unpack a complete field construct, thus fulfilling one of the use cases that we had not known exactly how to address before.

AndersMS commented 4 years ago

Thank you for opening this issue Sylvain.

Three things I wanted to add:

The SGRID convention also covers 3D meshes and boundaries and includes 3D examples. I believe the scheme proposed would work for that as well. We could do a simple CDL example to verify.
In this presentation and video from Ryan Abernathey, prepared for the CF Convention meeting earlier this year, he brings up two points, that we could consider:
- How do we describe periodic (i.e. “wraparound”) dimensions? Does this matter?
- What about cubed-sphere type multi-faceted grids?
We should do an CDL that combines the mesh notation and subsampling notation in one example.

oceandatalab commented 4 years ago

Being able to offer a unified solution for both the subsampled coordinates and the adjacent cells boundary issues would indeed be great, but I think our priority should be to decide how interpolation information should be defined in CDL, so if possible I'd like to keep the focus of this issue on formatting choices for subsampled coordinates, at least until we find an agreeable compromise.

After taking some time to read the CDL example and digging a little in the documentation, here are my comments :

The NUG (NetCDF Users Guide) states that coordinate variables are one-dimensional variables that share the same name as their dimension. According to the NUG and to the CF conventions, these variables are implicitely recognized as coordinates for data variables that share the single dimension of the coordinate variable. But there is no such variable in the CDL example, lon and lat (once they have been restored to their full resolution) are auxiliary coordinate variables (i.e. they contain coordinate data but they have multiple dimensions and may have an arbitrary name) that must be explicitely identified as coordinates by each data variable using the coordinates attribute (see 3rd paragraph of http://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/cf-conventions.html#coordinate-system).
If the idea was to extend the implicit mapping of coordinate variables to auxiliary coordinate variables by simply using shared dimensions, then I'm afraid it might lead both people and software to see implicitely-defined relations that don't actually exist. For example, let's say the CDL example also contains a data variable that contains the alongtrack mean value of an hypothetical measurement for each scan and each channel of the M-band, it would be represented in the CDL as:
```
float mean_alongtrack_dummy(m_scan, m_channel)
```
It shares dimensions with m_radiance, so following an implicit rule of coordinate mapping based on shared dimensions this variable could be wrongly interpreted as an auxiliary coordinate variable for m_radiance. In order to avoid this kind of confusion you really have to add the coordinates attribute on your data variables.
There might be a CDL syntax that I am not aware of but I have never seen dimensions attached to a variable using a dimensions attribute so it might be a stretch to say that tp_interpolation shares full resolution dimensions with the field construct and compressed dimensions with tie points [see edit1 footnote]. It would probably be more accurate to say that tp_interpolation defines the mapping between full resolution and compressed dimensions. Aside from that, I kinda like the concise defintion for multi-mapping tie point dimensions with full resolution dimensions using a single attribute, I would just change the syntax to make it clear that this is a mapping (maybe tp_dim1:dim1A,dim1B tp_dim2:dim2A,dim2B)
Conventions should make interpretation of data more straightforward, but I think that this CDL adds one relatively complex preliminary step even before implicit relations between variables can be established: the header must be read and fully parsed to identify container variables that describe an interpolation (there is only tp_interpolation in the example but there could be more) and a virtual representation of the variables targetted by the interpolation container variables must be created in memory to associate them with the full-resolution version of their dimensions. Then the implicit mapping based on shared dimensions - which I still think is a bad idea - may occur.
The tp_interpolation container variable loses its genericity / reusability due to the existance of the location_tie_point attribute. In order to reuse tp_interpolation for other variables and still use this kind of notation, you will have to define attributes for every possible coordinate variable that people may want to subsample in the future: location_tie_point, time_tie_point, sun_angle_tie_point, viewing_angle_tie_point, depth_tie_point, beam_angle_tie_point, etc... Referencing the interpolation container variable in subsampled variables not only removes the preliminary step that I mentioned in 4., it also make the container reusable for any number of subsampled variables without having to introduce new keywords/attributes in the conventions each time users need a coordinate that we did not think of.
As you probably have guessed, I really like the way you describe tp_interpolation : offset: a single glance is enough to understand, without a doubt, how tie point and full resolution dimensions relate to each other.

I understand that you want to make a concise CDL that is easier to read for users and I agree that this is a goal that we must strive for. But in my mind, these files are mostly read by software, not people, so the CDL should aim to be machine-readable first, and then as human-readable as possible. So from my point of view, explicitely defining relations between variables is perfectly acceptable even if it introduces a few redundancies and produces a slightly more verbose CDL.

edit1: probably not relevant as this statement is only in the minutes of the meeting, not in your initial explanation

AndersMS commented 4 years ago

Thank you very much for the thorough review and commenting – this is highly valuable feedback.

I have provided replies to your points following your numbering scheme. The replies provide clarifications, corrections and propose changes in response to your comments.

I hope that with these clarifications, corrections and changes, the proposal has become better and hopefully sufficiently concise to be machine readable.

Reply to 1:
You are right, I used the term “coordinate variable” incorrectly in the text, it should be “auxiliary coordinate variable”, I will correct that.

I agree that once restored to full their resolution, lat and lon must be listed in each data variable using the coordinates attribute. I propose that we add the coordinates attribute in the same manner in the compact file. Would you agree?

Reply to 2:

This needs a more concise explanation. I prefer the term associated, as mapping in CF is used for geographical mapping. An association would require that one dimension from each of the dimension sets defined in the dimensions attribute are shared.

So If we have

tp_interpolation : dimensions = "tp_track (tie_point) - m_track (location) - i_track (location) tp_scan (tie_point) - m_scan (location) - i_scan (location)"

it would require (m_track, m_scan) or (i_track, i_scan) to be shared. In the present example, the cross combinations (m_track, i_scan) and (i_track, m_scan) are not relevant, but they are relevant and valid in the adjacent cells boundary issue.

That would prevent the wrong interpretation of your example.

In the proposed scheme we would never need to associate data variables with data variables or coordinate type variables with coordinate type variables.

A user wanting to fully expand a compact file could ignore the data variables and simply search for all container variables containing a dimensions attribute and then expand all coordinate type variables listed within the container variable.

If a user is only interested in expanding coordinate type variables for a particular data variable, then he would find the relevant container variable based on matching dimensions and then expand all coordinate type variables listed within the container variable matching the dimensions of the data variable.

Both of these procedures would prevent the wrong interpretation of your example.

I agree that once restored to full their resolution, lat and lon must be listed in each data variable using the coordinates attribute. I propose that we also add the coordinates attribute in the same manner in the compact file.

Reply to 3: We can discuss the name of the attribute. I just chose “dimensions”, but it might not be the best choice.

I would be happy with the syntax you propose: tp_dim1:dim1A,dim1B tp_dim2:dim2A,dim2B

It would then be like:

tp_interpolation : dimensions = "tp_track (tie_point): m_track (location), i_track (location) tp_scan (tie_point): m_scan (location), i_scan (location)"

Reply to 4:
Two alternative options for expanding a compact file are listed under point 2. All variables that needs expansion are explicitly listed in the container variable and so is the method and the dimensions of both compacted and expanded coordinate type variables. It is all local and reasonably straight forward.

Reply to 5:
I would support changing:

   interpolation:location_tie_points = "lat lon" ;
   interpolation:sensor_direction_tie_points = " interpolation:location_tie_points = "lat lon" ;"   
   interpolation:solar_direction_tie_points = "sol_azi_ang sol_zen_ang"

to something like interpolation:tie_points = "location(lat, lon), sensor_direction(sen_azi_ang, sen_zen_ang), solar_direction(sol_azi_ang, sol_zen_ang)”

`Reply to 6:
Thank you!

oceandatalab commented 4 years ago

Concise CDL is more readable for humans, for machines we need to create explicit links between information (better self-description / automatic discoverability), so it is likely that machine-readability results in a more verbose CDL.

We agree on points 1., 2., 3. and 6., but 4. and 5. are still a major issue for me.

Point 1 Yes, the lack of references to auxiliary coordinate variables in data variables was one of the main problems I had with this CDL. Since adding these references is ok with you, this problem is fixed.

Let us also not forget that we cannot use the coordinates attribute name to reference subsampled/compacted auxiliary coordinate variables otherwise we break backward compatibility (see https://github.com/erget/subsampled-coordinates/issues/6#issuecomment-634039460). We can just come up with another attribute name.

Point 2 I mentionned this simply to show that another problem would arise if the relation between data variables and auxiliary coordinate variables was implicit: there is no syntax difference between data and auxiliary coordinate variables, except that auxiliary coordinate variables are referenced in the coordinates attribute of at least one data variable, so without a coordinates-like attribute referencing auxiliary coordinate variables, data variables could be wrongly interpreted as auxiliary coordinate variables.

But here again, since we agree on following the NUG and having auxiliary coordinate variables referenced in a coordinates-like attribute of data variables, the problem is solved.

Point 3 Yes, we can discuss how we name things or the syntax we use in attributes later on, what matters is that we converge on the structure of the CDL.

Point 4 That is where we have different opinions. For me this is not straightforward because the tp_interpolation container variable is not referenced anywhere. For me it should be possible to get all the related information by following references starting from the variable you are actually interested in.

If I am interested in m_radiance, following the hints provided by the CDL syntax I reach a dead end: m_radiance -> m_radiance:coordinates -> lat -> ?

If I only want to read lat and lon to have the footprint of the file content, the CDL of these variables doesn't tell me that they are subsampled/compacted. So I only get a very low resolution footprint.

Point 5 The thing is, I am not sure I see the point of defining coordinates couples like that.

We already have (here I assume there is only one zone, but it remains true for multiple zones):

the size of the input grid (tie points dimensions)
the size of the output grid (full-resolution dimensions)
coefficients describing grid deformation (alignment/expansion)
coefficients describing the offset between input and output grid points
an interpolation method (bilinear, etc...)
values attached to each point of the input grid

Unless the formula is more complicated than I think and introduces some dependencies between variables, the full resolution lon can be computed from the compacted lon and the aforementioned parameters, without reading the lat variable at all.

And I don't think the container variable dedicated to interpolation is the right place to define coordinates subsystems (if this is something needed, then it should go in another container variable, maybe the "domain" container variable mentioned in @davidhassell data model).

But maybe I consider this to be an issue just because since the beginning I see the interpolation container variable as the description of a generic mathematical transform, so I have always been reluctant to add terms related to a specific set of variables (coordinates in this case) in its attributes.

Point 6 :)

And I add a new potential problem (edge case but we need to consider it if we are to define rules that users must follow to be compliant with the conventions):

Point 7 What would happen if the file contains several interpolation containers and the lat/lon variables are listed in more than one?

How do I choose the interpolation container variable that I should use to reconstruct lat/lon in my "file footprint" use case (i.e. without using a data variable)?

AndersMS commented 3 years ago

Reply to 1: I discussed the use of a coordinatesor a subsampled_coordinatesattribute for the subsampled coordinates with Daniel. We find the use of a coordinatesattribute misleading similarly to your analysis in #6 (comment). We are also not too keen on introducing a new subsampled_coordinatesattribute, and would as an alternative propose the we reintroduce the direct reference to the interpolation container variable in the data variable, like:

variables:
  // VIIRS M-Band
    float m_radiance(m_track, m_scan, m_channel) ;
    m_radiance : interpolation = "tp_interpolation time_interpolation" ;

From the interpolation container variable, the software that implements the new CF conventions would have full access to all subsampled coordinates and the interpolation methods for expanding to full resolution.

The expansion process would then generate the coordinate attribute in the full data variables as needed. Note that in my understanding the coordinate attribute is only for listing the spatial and time auxiliary coordinates. So the direction auxiliary coordinates would not be in there. Similarly, proper coordinates of the form lat(lat) would also not go in there, like those in our example https://github.com/erget/subsampled-coordinates/tree/master/NDVI_lat_lon_Example. Is that in line with your understanding?

Reply to 2: See proposal under 1.

Reply to 3: I propose splitting the dimensions attribute in a dimensions attribute and a grids attribute, see updated example included below. It looks clearer to me and would better support the mesh issue 5. Let me know what you think.

Reply to 4: We propose re-introducing the tp_interpolation container variable in the data variable, as discussed under 1 and as shown in the updated example included below.

Reply to 5: Two comments:

you do actually get a very different result of interpolating between two longitudes at equator compared to interpolating between the same two longitudes close to the North pole. So you do need both. The same applies to the direction coordinates.
the value of the interpolation:tie_points attribute is mainly read by the interpolation method. So I guess another mathematical transform could possibly accept a slightly different layout, for example without the brackets creating couple of coordinates?

So I would suggest to stay with this notation for now, but we could return to this later in the process.

Reply to 7: It is ok to have multiple containers. They would typically differ in their set of dimensions. However, I think they would not be permitted to reuse the same variable name if they are in the same name space. That would give NetCDF errors both for the compact coordinate variables and the full resolution coordinate variables.

Updated example Including the proposed changes as well as the direction and time coordinates coordinates that was left out of the shortened example:

dimensions :
  // VIIRS M-Band (750 m resolution imaging) 
  m_track = 768 ;
  m_scan = 3200 ;
  m_channel = 16 ;

  // VIIRS I-Band (375 m resolution imaging)
  i_track = 1536 ;
  i_scan = 6400 ; 
  i_channel = 5 ;

  // Tie points and interpolation zones (shared between VIIRS M-Band and I-Band)
  tp_track = 96 ;
  tp_scan = 205 ;
  track_interpolation_zone = 48 ;
  scan_interpolation_zone = 200 ;

  // Time, stored at scan-start and scan-end of each scan
  time_scan = 2;

variables:
  // VIIRS M-Band 
  float m_radiance(m_track, m_scan, m_channel) ;
    m_radiance : interpolation = "tp_interpolation time_interpolation" ;

  // VIIRS I-Band 
  float i_radiance(i_track, i_scan, i_channel) ;
    i_radiance : interpolation = "tp_interpolation time_interpolation" ;

  // Spatial grids and interpolation, supporting both VIIRS M-Band and I-Band
  char tp_interpolation ;
    tp_interpolation : dimensions = "tp_track : m_track : i_track   tp_scan : m_scan : i_scan"  // association of dimensions 
    tp_interpolation : grids = "(tp_track, tp_scan) (tie_point)   (m_track, m_scan) (location)   (i_track, i_scan) (location)"  // definition of grids and grid functions
    tp_interpolation : offsets = "m_track - tp_track = 0.5   m_scan - tp_scan = 0.5   i_track - tp_track = 0.5   i_scan - tp_scan = 0.5"  // definition of grid offsets in units of cells

    tp_interpolation : tie_points = "location(lat, lon)  sensor_direction(sen_azi_ang, sen_zen_ang)  solar_direction(sol_azi_ang, sol_zen_ang)”
    tp_interpolation : interpolation_indices = "m_track: m_track_indices  m_scan: m_scan_indices  i_track: i_track_indices  i_scan: i_scan_indices" ; // associate dimensions with indices
    tp_interpolation : interpolation_name = "bi_quadratic" ;
    tp_interpolation : interpolation_coefficients = "expansion_coefficient_track  alignment_coefficient_track expansion_coefficient_scan alignment_coefficient_scan" ;
    tp_interpolation : interpolation_flags = "interpolation_zone_flags" ;

  // Interpolation indices
  int m_track_indices(tp_track) ;
  int m_scan_indices(tp_scan) ;
  int i_track_indices(tp_track) ;
  int i_scan_indices(tp_scan) ;

  // Tie points
  float lat(tp_track, tp_scan) ;
    lat : standard_name = "latitude" ;
    lat : units = "degrees_north" ;
  float lon(tp_track, tp_scan) ;
    lon : standard_name = "longitude" ;
    lon : units = "degrees_east" ;
  float sen_azi_ang(tp_track, tp_scan) ;
    sen_azi_ang : standard_name = "sensor_azimuth_angle" ;
    sen_azi_ang:units = "degrees" ;
  float sen_zen_ang(tp_track, tp_scan) ;
    sen_zen_ang : standard_name = "sensor_zenith_angle" ;
    sen_zen_ang : units = "degrees" ;
  float sol_azi_ang(tp_track, tp_scan) ;
    sol_azi_ang : standard_name = "solar_azimuth_angle" ;
    sol_azi_ang : units = "degrees" ;
  float sol_zen_ang(tp_track, tp_scan) ;
    sol_zen_ang : standard_name = "solar_zenith_angle" ;
    sol_zen_ang : units = "degrees" ;

  // Interpolation coefficients and flags
  short expansion_coefficient_track(track_interpolation_zone, tp_scan) ;
  short alignment_coefficient_track(track_interpolation_zone, tp_scan) ;
  short expansion_coefficient_scan(tp_track, scan_interpolation_zone) ;
  short alignment_coefficient_scan(tp_track, scan_interpolation_zone) ;
  byte interpolation_zone_flags(track_interpolation_zone, scan_interpolation_zone) ;
    interpolation_zone_flags:valid_range = "1b, 7b" ;
    interpolation_zone_flags:flag_masks = "1b, 2b, 4b" ;
    interpolation_zone_flags:flag_meanings = "location_use_cartesian  sensor_direction_use_cartesian  solar_direction_use_cartesian" ;

  // Time interploation
  char time_interpolation ;
    time_interpolation : interpolation_name = "bi_linear" ;
    time_interpolation : tie_points = "time(t)" ;

  double t(tp_track, time_scan) ;
    t:long_name = "time" ;
    t:units = "days since 1990-1-1 0:0:0" ;

AndersMS commented 3 years ago

I forgot to comment on:

Point 5. ...And I don't think the container variable dedicated to interpolation is the right place to define coordinates subsystems (if this is something needed, then it should go in another container variable, maybe the "domain" container variable mentioned in @davidhassell data model).

I agree, it could be considered to split in two container variables, let's discuss. The information fits nicely together and I couldn't think of a use case where one would be used separately from the other. So I kept it as one container for now.

oceandatalab commented 3 years ago

Point 1 / Point2 / Point4 Well... This is a big step backwards from my point of view. And a disappointment because my use case (access to lat/lon without needing a data variable) has once again been ignored.

I already explained time and time again why I am against referencing the interpolation container variable from data variables and why I think referencing it from the subsampled variables makes more sense and is a superior choice from a technical point of view. There have been many discussions about this but no one has even tried to provide a logical or technical justification to convince me otherwise.

I just don't understand why you absolutely refuse to put a reference to the interpolation container variable in subsampled coordinate variables. If it is just a matter of redundancy then I am afraid we will never agree.

With the subsampling mechanism, coordinates can take four forms:

coordinate variables: they have one dimension and share its name
auxiliary coordinate variables: they don't share the name of any dimension and may have one or multiple dimensions
subsampled coordinate variables: same as coordinate variables, but subsampled
subsampled auxiliary coordinate variables: same as auxiliary coordinate variables, but subsampled

The NUG already defines how the CDL handles non-subsampled coordinates:

coordinate variables are implicitely linked to data variables through the dimension name
the coordinates attribute attached to data variables: it contains a list of variable names that are therefore interpreted as auxiliary coordinate variables

The subsampled_coordinates attribute has the same syntax as the coordinates attribute, hence it will be familiar to users and easy to document. If backward compatibility was not an issue we would not even need to add a new attribute and subsampled variables would be handled in the same way as their non-subsampled counterparts.

I am really surprised that you find it confusing because no one objected when we talked about this in https://github.com/erget/subsampled-coordinates/issues/6#issuecomment-634039460 and it has been mentioned at multiple occasions during the past meetings.

Regarding your question on direction variables, I think they are auxiliary coordinate variables and shoud be treated as such. If you limit coordinates to space and time, then your solution becomes even more confusing because the container variable does not say which of the expanded variables are coordinates.

Point 3 Ok for splitting information between dimensions and grids attributes.

Point 5

Ok, I imagined interpolation zones would be local enough to ignore this kind of issue, but you are right, we cannot make assumptions on the size of the zones so lat and lon need each other during their reconstruction.
I used the formula_terms attribute in my examples for passing arguments to the interpolation method as it is flexible and it is already defined in CF conventions.
If I remember correctly you provided an implementation of your interpolation method at some point but I cannot find it anymore. If possible, could you add a link or a reference somewhere in this repository please? I think it could help me (and possibly others) understand (and not forget) what parameters you need.
definitions of coordinates subsystems such as location(lat,lon) or sensor_direction(sen_azi_ang, sen_zen_ang) have intrinsically nothing to do with interpolation. They can potentially be used by some interpolation methods, and you explained that it is the case for your method. But semantically they definitely belong to a domain container and the CDL should reflect that fact.

Point 7 If this version of the CDL is the one proposed to the CF convention eventually, make sure you mention this constraint in the proposal.

AndersMS commented 3 years ago

Thank you for your clear and accurate comments and suggestions.

You are right; the coordinates attribute is for listing all auxiliary coordinates, not only the spatial and time auxiliary coordinates, as I incorrectly suggested.

Actually, the CF Convention permits including in the coordinates attribute both coordinates and auxiliary coordinates, see chapter 5, 5th paragraph of the CF Conventions document. That is convenient for us, as we would like to list both coordinates and auxiliary coordinates requiring interpolation.

I would like to propose the following updated overall approach:

that we include references to the container variable in both the data variables and the coordinate variables.
that we move all attributes describing the grid and interpolation scheme to the container variable, to avoid redundancy and the risk of inconsistencies.

The proposal is an attempt to reconcile your comments and the comments of @davidhassell here and here.

Both for a human and a computer reading the file, references to the container variable in both the data variables and the coordinate variables would mark these variables as being part of an interpolation construct, requiring special attention.

Regarding the second point, you had reservations regarding the notation:

tp_interpolation : tie_points = "location(lat, lon) sensor_direction(sen_azi_ang, sen_zen_ang) solar_direction(sol_azi_ang, sol_zen_ang)”

If we require all listed coordinates to have the standard_namedefined, then we do not need the parenthesis construction, and we can simplify the list to

tp_interpolation : tie_points = "lat, lon, sen_azi_ang, sen_zen_ang, sol_azi_ang, sol_zen_ang”

The coordinate variables can then be paired via their standard_nameattribute. For example the variables sol_azi_ang and sol_zen_ang are linked through their standard names solar_azimuth_angle and solar_zenith_angle.

After this change, the list effectively becomes our coordinates-like attribute.

On another topic, I realised that we do not need the dimensions attribute in the interpolation container. I had suggested:

tp_interpolation : dimensions = "tp_track : m_track : i_track tp_scan : m_scan : i_scan"

for association of the dimensions and

tp_interpolation : grids = "(tp_track, tp_scan) (tie_point) (m_track, m_scan) (location) (i_track, i_scan) (location)"

for definition of grids and grid functions.

However, the association of dimensions is clear from the grids attribute alone: the first dimensions of all the grids listed are associated with each other, and the second dimensions of all the grids listed are associated with each other, etc. By associated, we mean that they are along the same grid direction, say scan or track.

In the following, I have adjusted a couple of the attribute names to be better suited for the dual use of the mesh issue 5, and the subsample coordinates issue 37. Please consider the names as proposals for further discussion/agreement.

Finally, I would like to suggest that we have two container variables, one that is called gridand one that is called interpolation. The container variable interpolationis an extension of grid, in that it includes all the attributes of grid.

The container variable gridcontains all attributes required in support of the use cases discussed under issue 5. The container variable interpolationcontains all attributes required in support of the use cases discussed under this issue 37, or combined application of the use cases of the issues 5 and 37.

Container variable	`grid`	`interpolation`
Attribute	Attribute presence	Attribute presence	Function
`grids`	Required	Required	Defines the grid dimensions, the grid functions and the association of the grid dimensions.
`grid_offsets`	Required if grid offsets are present	Required if grid offsets are present	Defines the grid offsets in units of cells.
`grid_coordinates`	Required	Required	Lists the coordinates and auxiliary coordinates of the grids.
`interpolation_name`	-	Required	Name of interpolation method.
`interpolation_indices`	-	Required	Indices of interpolation tie points in full resolution grid.
`interpolation_coefficients`	-	Presence is method dependent	Interpolation coefficient variables supporting the interpolation method.
`interpolation_flags`	-	Presence is method dependent	Interpolation flag variables supporting the interpolation method.

Further to the Point 5: I suggest you have a word with @Lucile. From earlier meetings I know that she is aware of the technic. As one approaches the polar regions, you convert the polar coordinates to rectangular coordinates, say (x,y) or (x,y,z) and perform the interpolation in these coordinates. This coordinate transformation mathematically requires both coordinates.

Further to the Point 7: Our current interpolation container attributes permits generating multiple full resolution coordinates sets based on the same tie point set. The VIIRS M-band and I-Band is an example. One could imagine a user that would like to spilt M-band and I-Band in two separate interpolation containers, but referencing the same tie point variables.

That would appear to be a natural use case. However, it would require that the proposed attribute in the tie point coordinate variables, referencing the interpolation container, must permit the inclusion of multiple interpolation containers. But that would be consistent with the use of the attribute inside the data variables, where we also permit multiple interpolation containers to be referenced :

m_radiance : interpolation = "tp_interpolation time_interpolation" ;

If interpolation containers based on the same tie point coordinate variables would generate full resolution coordinates on the same dimensions, I think we would run into conflicts of having multiple definitions of longitude and latitude on the same grid. So that should be forbidden I guess.

Updated Example Here is the VIIRS M- and I-Band example updated accordingly. I will prepare additional examples once I have the feedback/comments on the updated approach and this example.

dimensions :
  // VIIRS M-Band (750 m resolution imaging) 
  m_track = 768 ;
  m_scan = 3200 ;
  m_channel = 16 ;

  // VIIRS I-Band (375 m resolution imaging)
  i_track = 1536 ;
  i_scan = 6400 ; 
  i_channel = 5 ;

  // Tie points and interpolation zones (shared between VIIRS M-Band and I-Band)
  tp_track = 96 ;
  tp_scan = 205 ;
  track_interpolation_zone = 48 ;
  scan_interpolation_zone = 200 ;

  // Time, stored at scan-start and scan-end of each scan
  time_scan = 2;

variables:
  // VIIRS M-Band 
  float m_radiance(m_track, m_scan, m_channel) ;
    m_radiance : interpolation = "tp_interpolation time_interpolation" ;

  // VIIRS I-Band 
  float i_radiance(i_track, i_scan, i_channel) ;
    i_radiance : interpolation = "tp_interpolation time_interpolation" ;

  // Coordinate grids and interpolation, supporting both VIIRS M-Band and I-Band
  char tp_interpolation ;
    tp_interpolation : grids = "(tp_track, tp_scan) (tie_point)   (m_track, m_scan) (location)   (i_track, i_scan) (location)"  // Defines the grid dimensions, the grid functions and the order and association of the grid dimensions.  
    tp_interpolation : grid_offsets = "m_track - tp_track = 0.5   m_scan - tp_scan = 0.5   i_track - tp_track = 0.5   i_scan - tp_scan = 0.5"  // Defines the grid offsets in units of cells.
    tp_interpolation : grid_coordinates = "lat, lon, sen_azi_ang, sen_zen_ang, sol_azi_ang, sol_zen_ang”  // Lists the coordinates and auxiliary coordinates of the grids.

    tp_interpolation : interpolation_name = "bi_quadratic" ;
    tp_interpolation : interpolation_indices = "m_track: m_track_indices  m_scan: m_scan_indices  i_track: i_track_indices  i_scan: i_scan_indices" ; // associate dimensions with indices
    tp_interpolation : interpolation_coefficients = "expansion_coefficient_track  alignment_coefficient_track expansion_coefficient_scan alignment_coefficient_scan" ;
    tp_interpolation : interpolation_flags = "interpolation_zone_flags" ;

  // Interpolation indices
  int m_track_indices(tp_track) ;
  int m_scan_indices(tp_scan) ;
  int i_track_indices(tp_track) ;
  int i_scan_indices(tp_scan) ;

  // Tie points
  float lat(tp_track, tp_scan) ;
    lat : standard_name = "latitude" ;
    lat : units = "degrees_north" ;
    lat : interpolation = "tp_interpolation" ;
  float lon(tp_track, tp_scan) ;
    lon : standard_name = "longitude" ;
    lon : units = "degrees_east" ;
    lon : interpolation = "tp_interpolation" ;
  float sen_azi_ang(tp_track, tp_scan) ;
    sen_azi_ang : standard_name = "sensor_azimuth_angle" ;
    sen_azi_ang : units = "degrees" ;
    sen_azi_ang : interpolation = "tp_interpolation" ;
  float sen_zen_ang(tp_track, tp_scan) ;
    sen_zen_ang : standard_name = "sensor_zenith_angle" ;
    sen_zen_ang : units = "degrees" ;
    sen_zen_ang : interpolation = "tp_interpolation" ;
  float sol_azi_ang(tp_track, tp_scan) ;
    sol_azi_ang : standard_name = "solar_azimuth_angle" ;
    sol_azi_ang : units = "degrees" ;
    sol_azi_ang : interpolation = "tp_interpolation" ;
  float sol_zen_ang(tp_track, tp_scan) ;
    sol_zen_ang : standard_name = "solar_zenith_angle" ;
    sol_zen_ang : units = "degrees" ;
    sol_zen_ang : interpolation = "tp_interpolation" ;

  // Interpolation coefficients and flags
  short expansion_coefficient_track(track_interpolation_zone, tp_scan) ;
  short alignment_coefficient_track(track_interpolation_zone, tp_scan) ;
  short expansion_coefficient_scan(tp_track, scan_interpolation_zone) ;
  short alignment_coefficient_scan(tp_track, scan_interpolation_zone) ;
  byte interpolation_zone_flags(track_interpolation_zone, scan_interpolation_zone) ;
    interpolation_zone_flags : valid_range = "1b, 7b" ;
    interpolation_zone_flags : flag_masks = "1b, 2b, 4b" ;
    interpolation_zone_flags : flag_meanings = "location_use_cartesian  sensor_direction_use_cartesian  solar_direction_use_cartesian" ;

  // Time interploation
  char time_interpolation ;
    time_interpolation : interpolation_name = "bi_linear" ;
    time_interpolation : tie_points = "time(t)" ;

  double t(tp_track, time_scan) ;
    t : long_name = "time" ;
    t : units = "days since 1990-1-1 0:0:0" ;
    t : interpolation = time_interpolation ;

AndersMS commented 3 years ago

@davidhassell; FYI, I just posted the above, where I reference your comments.

AndersMS commented 3 years ago

I just realised that the in the example contained in the above comment, the interpolation of time (char time_interpolation) no longer fits the proposed scheme of things, as it doesn't have the attributes required by the table. We will need a bit more thinking regarding the interpolation of time....

oceandatalab commented 3 years ago

Thank you for taking the time to design this new iteration Anders, it is much more in line with what we need for our use case. I will start numbering new topics (from 8) because there are many subjects mentionned in your message.

Point 1 / Point 2 / Point 4 Including a reference to the interpolation container variable in both the data variable and the subsampled variables may be overkill, but I can live with it :) This is an acceptable compromise from my point of view as there is a clear thread of information to follow, whether you start from the data variable or from the subsampled coordinate variable.

Point 5 I am well aware of this technique and used it several times in the past when dealing with data close to the poles. but I vaguely remember that during a meeting you said that your interpolation method could (conditional) be able to remove the need to switch to another projection. Since there is no hint about reprojecting spatial coordinates in the description of the interpolation container variable I thought that maybe you validated this hypothesis and managed to interpolate directly in lat/lon with sufficient precision, in which case you would have been able to interpolate lat and lon independently within small interpolation zones, hence my initial comment. But I was not sure about it and I wanted to check if the attributes of the container variable conveyed enough information, that is why I asked if you had a reference implementation of your method. If I understood you correctly this time, your method includes an optional reprojection step, in which case I think the interpolation container variable needs an attribute to define the projection method to use prior/after interpolating values, maybe a : grid_mapping?

Point 7 It might be confusing to have the same notation for attaching mutliple interpolation containers to a data variable:

m_radiance : interpolation = "tp_interpolation time_interpolation" ;

and for defining alternative grids for subsampled coordinate variables:

lat : interpolation = "tp_interpolation_mband tp_interpolation_iband";

as the meaning would not be the same. Maybe keep the empty space ` separator for the first case and prefer a pipe|` separator to denote alternatives?

m_radiance : interpolation = "tp_interpolation time_interpolation" ;
...
lat : interpolation = "tp_interpolation_mband | tp_interpolation_iband";

Point 8 Regarding the use of : standard_name to link coordinate variables together instead of using parenthesis, I think that is a good idea, an elegant solution and it might entice people to follow good practice (i.e. actually provide the : standard_name attribute).

Point 9 About removing the : dimensions attribute because information is already available in the : grids attribute: I agree with you, having both is probably redundant.

Point 10 Defining the interpolation container as an extension of grid container makes sense and looks perfectly fine to me.

New discussion topics below:

Point 11 As data variables and subsampled coordinate variables may all have a : interpolation attribute, the only way to differentiate data variables from subsampled coordinate variables is that the latter are listed in : grid_coordinates attribute of the interpolation container variable. So the proposal must make it clear that all subsampled coordinate variables related to an interpolation container must appear in the : grid_coordinates attribute, otherwise the CDL should be considered invalid / non-compliant.

Point 12 This one is related to Point 5 and is probably the last aspect that might be a problem for me: the interpolation container variable does not state explicitely which variables are always required by the interpolation method and which variables are only needed if you want to un-compact them.

In your latest example, let's say I just want to restore lon to its full resolution for M band, I follow the : interpolation attribute, reach the tp_interpolation container variable and then I have two possible interpretations of the CDL:

nothing in the CDL clearly says that I need lat to interpolate lon, so I might try to load these variables
- lon,
- expansion_coefficient_track,
- alignment_coefficient_track,
- expansion_coefficient_scan,
- alignment_coefficient_scan,
- interpolation_zone_flags,
- m_scan_indices,
- m_track_indices
and then pass them to the interpolation method: it won't work because apparently I needed lat too, but this dependency was not explicitely described in the CDL.
I consider all variables listed in : grid_coordinates as dependencies, so I load the following variables
- lat,
- lon,
- sen_azi_ang,
- sen_zen_ang,
- sol_azi_ang,
- sol_zen_ang,
- expansion_coefficient_track,
- alignment_coefficient_track,
- expansion_coefficient_scan,
- alignment_coefficient_scan,
- interpolation_zone_flags,
- m_scan_indices,
- m_track_indices
and then pass them to the interpolation method: it will work, but it will perform useless computations and use up memory for variables that I was not interested in (the solar and sensor angles).

AndersMS commented 3 years ago

Glad we are converging :)

Point 1 / Pont 2 / Point 3 Good!

Point 5 The references for the method that we developed at EUMETSAT for VIIRS are these:

Compact VIIRS SDR Product Format User Guide, http://www.eumetsat.int/website/wcm/idc/idcplg?IdcService=GET_FILE&dDocName=PDF_DMT_708025&RevisionSelectionMethod=LatestReleased&Rendition=Web.

A tie-point zone group compaction schema for the geolocation data of S-NPP and NOAA-20 VIIRS SDRs to reduce file sizes in memory-sensitive environments, https://doi.org/10.1016/j.acags.2020.100025.

You are right, as part of our present work on issue 37, I discovered an improvement to the method compared to what we did for VIIRS. This improvement will mean that we can move the latitudes where we switch from interpolation directly in latitude/longitude to interpolation in rectangular coordinates further North and South, without loosing accuracy. However, there will still be regions around the poles, where we need the rectangular coordinates for interpolation, they are just significantly smaller. The gain is computational speed. I hope we can determine and validate these latitude limits as part of our further work, possibly with the help of you an Lucile @oceandatalab.

So far I was thinking of the choice of interpolation coordinates and any need for conversion to and from these coordinates as a method internal thing. However, in the case of the proposed bi_quadratic interpolation method, the

interpolation_zone_flags : flag_meanings = "location_use_cartesian sensor_direction_use_cartesian solar_direction_use_cartesian"

actually indicates the coordinates to be used (from the example in our current thread). This simplifies the code required for expansion and ensures that that all users expands the data in a consistent manner.

Point 7 I would say that in both cases the interpolation attribute contains a list of interpolation containers conforming to the same definition. I agree that the meaning is slightly different. I guess you could say that a data variable makes use of one or more interpolation containers, whereas a coordinate variable is used by one or more interpolation containers. But I think this is clear from the context and that we can keep the identical notation.

Another choice we could make would be to say that a coordinate variable only can be used by one interpolation container, but I am afraid we could block other use case by doing do.

In the case of the VIIRS I- and M-band, the recommended and cleanest solution is to use a single interpolation container, as it avoids duplication of several shared attributes.

Point 8 Good!

Point 9 Good!

Point 10 Good!

Point 11 I fully agree, we need to be accurate on that.

Maybe something like:

All coordinates of a data variable, which are stored as subsampled coordinates, must be listed in the grid_coordinates attribute of one and only one of the interpolation containers named in the data variable’s interpolationattribute;
All subsampled coordinate variables must have an interpolationattribute naming the interpolation containers that make use of the variable and these interpolation containers must list the subsampled coordinate in their grid_coordinatesattribute;

where coordinates refer to both coordinates an auxiliary coordinates.

Point 12. That is an interesting point and will probably need a bit more discussion.

So far, we always expanded all coordinates.

Possibly, it could work in the following way:

When calling the interpolation method, you specify the coordinates to be expanded, for example just longitude. The algorithm would the load, via the NetCDF API, the latitude and longitude tie points and the interpolation flags, coefficients etc. needed. It then interpolate longitude (and latitude if needed for coordinate conversion/interpolation) and writes, again via the NetCDF API, just the longitude to a location specified by the caller.

But there are probably many variation to this.

Normally the processing would have to be done segment by segment, as the full data set might be too large to have in memory. Also it should be done in a way suitable for multithreaded processing.

So all in all a complex matter.

AndersMS commented 3 years ago

In the example of the updated proposal, the time interpolation details were not fully aligned with the updated proposal.

In the existing VIIRS product format, time is provided once at the start of each scan and once at the end of each scan. Recall that VIIRS has 16 sensors in the M-Band and 32 sensors in the I-Band, making simultaneous observations aligned in the track direction, for each position in the scan motion.

Here are two alternative ways that we could do time interpolation in the case of the VIIRS imager example – I propose we support both ways. Probably nobody really needs time stamps at the level of image pixels, but here it is - just to demonstrate the capabilities.

Bilinear Interpolation If identical time stamps are stored at two tie points at the scan start and identical time stamps are stored at two tie points at the scan end, as shown in the figure, one can interpolate using a bilinear method. Using the two identical time stamps will ensure that each set of the 16 pixels in the M-Band and 32 pixels in the I-Band, aligned in the track direction, will get the same time stamps.

The CDL would be like:

dimensions :

  // VIIRS M-Band (750 m resolution imaging) 
  m_track = 768 ;
  m_scan = 3200 ;

  // VIIRS I-Band (375 m resolution imaging)
  i_track = 1536 ;
  i_scan = 6400 ;

  // Tie points (shared between VIIRS M-Band and I-Band)
  tp_track = 96 ;  // twice each scan
  tp_scan = 205 ;

  // Time, stored twice at scan-start and twice at scan-end of each scan
  time_scan = 2;   // scan-start and scan-end 

variables:

  // Coordinate grids and interpolation, supporting both VIIRS M-Band and I-Band time stamp
  char time_interpolation ;
    time_interpolation : grids = "(tp_track, time_scan ) (tie_point)   (m_track, m_scan) (location)   (i_track, i_scan) (location)"  
    time_interpolation : grid_coordinates = "time”  
    time_interpolation : interpolation_name = "bi_linear" ;
    time_interpolation : interpolation_indices = "m_track: m_track_indices  m_scan: m_scan_indices  i_track: i_track_indices  i_scan: i_scan_indices" ; 

  // Interpolation indices
  int m_track_indices(tp_track) ;
  int m_scan_indices(time_scan) ;
  int i_track_indices(tp_track) ;
  int i_scan_indices(time_scan) ;

  // Time tie-points
  double time(tp_track, time_scan) ;
    time : long_name = "time" ;
    time : units = "days since 1990-1-1 0:0:0" ;
    time : interpolation = time_interpolation ;

Step-Linear Interpolation If we just want to store one time stamp at the scan start and one time stamp at the scan end, as shown in the figure, one must use a step interpolation function in the track direction and linear interpolation in the scan direction to achieve the same results as above. The step interpolation maintains the same value until the next time stamp is encountered.

The CDL would be like:

dimensions :

  // VIIRS M-Band (750 m resolution imaging) 
  m_track = 768 ;
  m_scan = 3200 ;

  // VIIRS I-Band (375 m resolution imaging)
  i_track = 1536 ;
  i_scan = 6400 ; 

  // Time, stored once at scan-start and once at scan-end of each scan
  time_track = 48 ; //  each scan
  time_scan = 2;  // scan-start and scan-end 

variables:

  // Coordinate grids and interpolation, supporting both VIIRS M-Band and I-Band time stamp
  char time_interpolation ;
    time_interpolation : grids = "(time_track , time_scan ) (tie_point)   (m_track, m_scan) (location)   (i_track, i_scan) (location)"  
    time_interpolation : grid_coordinates = "time”  
    time_interpolation : interpolation_name = "step_linear" ; // Step interpolation in first dimension, linear interpolation in second dimension.
    time_interpolation : interpolation_indices = "m_track: m_track_indices  m_scan: m_scan_indices  i_track: i_track_indices  i_scan: i_scan_indices" ; 

  // Interpolation indices
  int m_track_indices(time_track) ;
  int m_scan_indices(time_scan) ;
  int i_track_indices(time_track) ;
  int i_scan_indices(time_scan) ;

  // Time tie-points
  double time(time_track, time_scan) ;
    time : long_name = "time" ;
    time : units = "days since 1990-1-1 0:0:0" ;
    time : interpolation = time_interpolation ;

Comments welcome! :-)

davidhassell commented 3 years ago

Hello,

Could someone explain the need for the grid offsets? Given that a tie points variable contains cell locations and cell bounds for each given tie point cell, I can't make the connection with an offset. You put the tie point corodiantes in the right place, and simply apply the specified interpolation technique, no?

Thanks.

davidhassell commented 3 years ago

As @AndersMS has referenced (for which thanks), I have my arguments for not referencing the interpolation container on the coordinate variables, these arguments still apply even if the interpolation container is also on the data variable, but with the added disadvantage of increasing the chance of inconsistencies.

AndersMS commented 3 years ago

@davidhassell

Yes, I will try to explain the offsets.

We have two tie point schemes, tentatively called Cell Centred Tie Points and Offset Tie Points, see slide included below.

The first is the most straight forward.

The second uses tie points that are offset from the original grid points. It permits supporting multiple full resolution grids from the same tie point set. An example is VIISR I-band (375 m) and M-Band (750m).

AndersMS commented 3 years ago

@davidhassell

In the process of recomputing the full resolution grid, the offsets are taken into account when interpolating between the tie points. Within an interpolation zone, the first full resolution grid point is typically (0.5, 0.5) cells away from the tie-point.

It makes it look a bit like the mesh issue 5, which is why we think there could be a potential synergy with that issue.

davidhassell commented 3 years ago

Re. a separate domain variable.

That issue is independent of subsampled coordinates. The subsampled coordinated question should be solved without it, and when (not if!) a domain variable is created for CF, it can be used with subsampled coordinates. The domain variable will have to work for all cell-locating metadata.

(When the domain variable does get discussed, I shall be suggesting that it can not be used to replace the metadata that is currently placed on data variables, i.e you can put a domain variable in a file, but you may not reference that domain variable from a data variable. Doing so would profoundly break backwards compatibility.)

davidhassell commented 3 years ago

Thanks, @AndersMS - I knew I'd ssen a picture like that somewhere before! However, why not just move the tie points, rather than having to specify an offset?

AndersMS commented 3 years ago

@davidhassell

Imagine you have a tie point zone containing 16x16 VIIRS M-band pixels and 32x32 I-Band pixels, then none of those pixels have co-located pixel centres.

So there is nowhere to put the shared tie-point, other than somewhere offset.

AndersMS commented 3 years ago

@davidhassell

In this case an M-band pixel contains exactly four I-band pixels. So the M-band pixel centre is at the I-Band pixel boundary crossing.

AndersMS commented 3 years ago

Regarding issues 5, the the main potential for synergy would not be the subsampling itself, but could be the notation we are proposing (the grid part, but not the interpolation part here) to deal with associating dimensions and grids, including the offsets grids, such as the cell grid and the boundary grid of the issue 5.

AndersMS commented 3 years ago

@davidhassell See some further thoughts on issue 5 in this comment.

davidhassell commented 3 years ago

Hi @AndersMS, why can't you have two tie point variables - one for the M band data variables and one for the I band data variables, if I understand correctly? This is what is currently done for, e.g., staggered grids.

I know that using the SGRID technique could same a little bit of space, but perhaps it is better to do it the current way for simplicity, and any succinctness that can be gained from a parameterised staggering can be implemented as an enhancement at a later stage. I am not very familiar with SGRID and so am not aware of how widely used and stable it is as a standard (I know that UGRID is very close (for some time!) to being accepted into CF, though).

AndersMS commented 3 years ago

Hi @davidhassell ,

having two separate sets of ties points (of each 8 variables) for M- and I-band would be a possibility, taking up a bit more space. With the current proposal, we would then have also two interpolation containers, so all in all a bit more to keep consistent. We would also loose the explicitly relationship between the M- and I-band grids - the two bands covers separate sets of wavelength channels and are from time to time utilised together.

During the past work we have also looked at instruments channels that are offset in the focal plane, leading to an offset in the geolocation data, see slide 12 here. We will be looking into addressing that with offsets.

So we would like to integrate the idea of offsets from the beginning. And we wished at least to check if there would be a potential for synergy with the issue 5. If the notation of the two would well aligned, the issue 5 grids could even be compressed with the subsampling of this issue 37.

AndersMS commented 3 years ago

If there are no offsets involved, the attribute can be left out...

AndersMS commented 3 years ago

Hi Sylvain @oceandatalab

If you are looking a the paper that I referenced above, please note that for our ongoing work, I am proposing a simplified approach for deriving the pixel expansion coefficient (section 2.3.2) and the pixel alignment coefficient (section 2.3.3). So you don't need to spent time on that.

Rather than the somewhat complex geometrical derivation, I will be proposing that we derive the coefficients directly from the location data available in the full resolution product. You can think of it as a curve fitting to the actual data using our bi-quadratic method.

This is also what will bring the improvement discussed under Point 5 here. The method will in part absorb the errors introduced by using latitude/longitude as interpolation variables at higher latitudes. Further to this, it will absorb some of the errors that are introduced by assuming the Earth to spherical when converting to and from cartesian coordinates locally within the tie point interpolation zone, but this is less signifiant than the former improvement.

Also note that the way we are storing the tie point zones and the mapping to the full resolution using the interpolation indices variables is more elegant and versatile than what we did for VIIRS.

AndersMS commented 3 years ago

Hi @davidhassell

You mention the domain variable in your comment above.

Would there be somewhere where I could read about the idea of the domain variable or could you possibly share with us your thoughts on how we could benefit from it?

Thank you in advance.

davidhassell commented 3 years ago

Hi @AndersMS,

Would there be somewhere where I could read about the idea of the domain variable or could you possibly share with us your thoughts on how we could benefit from it?

In short, I don't think that the existence of a domain variable would help this proposal! It may be that is such a thing were to exist it could provide an alternative encoding, but the encoding based around data variables still needs to exist.

When the time comes (spoilers alert!) I will propose a "domain variable" that is essentially a data variable without a data array. It's role is to act as container for location metadata that (like a data variable) allows the metadata to be treated holistically.

It would be a scalar variable of arbitrary type that may have many of the special attributes that a data variable has (e.g. coordinates and grid_mapping attributes), with the same meanings; but will not support attributes that only describe the physical nature of the data (such as cell_methods and ancillary_variable attributes).

In a dataset, a domain variable could exist with or without the presence of data variables. If data variables are present, it would, in my view, not be allowed for a data variable to reference a domain variable to describe its domain, i.e. the only way to attach domain to a data variable is the current, implicit way via dimension names, coordinates attributes, etc. To replace the current framework with a reference to domain variable would be to introduce a massive backwards incompatiblity, for which there would have to be an compelling use case (which I haven't seen yet, but I've not talked to many people about this).

If this were to be an acceptable proposal, then enhancements to CF wouldn't really have to worry about it - a proposal would makes changes to the structure of a data variable as normal, and the domain variable simply inherits those with an identical syntax.

Hope that helps, David

davidhassell commented 3 years ago

Hi @AndersMS, further to your recent comments on offsets (https://github.com/erget/subsampled-coordinates/issues/10#issuecomment-686550222), you mention that you would need "two interpolation containers".

I presume that each data variable would only need one interpolation container, though. Also, (and I need to go back and check this) I think we had in earlier versions of this proposal that the tie-point specific variables and attributes would live on the data variable, just for this reason - so that one interpolation container could be used for many data variables, just like grid mapping variables can be. I'll see if I can find reference to that ....

AndersMS commented 3 years ago

Yes, in the earlier version we had the resolution specific interpolation_indicesand interpolation_offsets attributes in the data variables m_radianceand i_radiance respectively. All shared tie point coordinate variables and interpolation coefficients and flags were referenced from the container variable. The shared tie points coordinate sets and interpolation coefficients and flags are made possible through the offset attribute.

In the current version, we have all attributes down in the shared container and just the reference to the container in the data variable. I start liking that approach compared to the earlier version, as everything is gathered in the container variable. That would be closer to the grid mapping pattern and any data variable on either M-Band or I-Band resolution can reference that single shared container via a single attribute, like i_radiance : interpolation = "tp_interpolation".

Yes, if we do not have the offset attribute, all data variables on the M-Band resolution (like m_radiance )would reference a single shared container and all data variables on the I-Band resolution (like i_radiance) would reference another single shared container. Those two containers would each reference their own tie points coordinate variables and interpolation coefficients and flags.

Hope that is clear...

AndersMS commented 3 years ago

Hi @davidhassell

Thank you for the detailed explantation of the domain variable, this was very helpful an interesting! I fully appreciate that this would not impact our current proposal.

Just to check if I understood the idea, would this classical example:

dimensions:
  rlon = 128 ;
  rlat = 64 ;
  lev = 18 ;
variables:
  float T(lev,rlat,rlon) ;
    T:long_name = "temperature" ;
    T:units = "K" ;
    T:coordinates = "lon lat" ;
    T:grid_mapping = "rotated_pole" ;
  char rotated_pole ;
    rotated_pole:grid_mapping_name = "rotated_latitude_longitude" ;
    rotated_pole:grid_north_pole_latitude = 32.5 ;
    rotated_pole:grid_north_pole_longitude = 170. ;
  float rlon(rlon) ;
    rlon:long_name = "longitude in rotated pole grid" ;
    rlon:units = "degrees" ;
    rlon:standard_name = "grid_longitude";
  float rlat(rlat) ;
    rlat:long_name = "latitude in rotated pole grid" ;
    rlat:units = "degrees" ;
    rlat:standard_name = "grid_latitude";
  float lev(lev) ;
    lev:long_name = "pressure level" ;
    lev:units = "hPa" ;
  float lon(rlat,rlon) ;
    lon:long_name = "longitude" ;
    lon:units = "degrees_east" ;
  float lat(rlat,rlon) ;
    lat:long_name = "latitude" ;
    lat:units = "degrees_north" ;

in the world of domain variables be something like:

dimensions:
  rlon = 128 ;
  rlat = 64 ;
  lev = 18 ;
variables:

  char myDomain ;
    myDomain:coordinates = "lat lon rlat rlon" ;
    myDomain:grid_mapping = "rotated_pole" ;
  char rotated_pole ;
    rotated_pole:grid_mapping_name = "rotated_latitude_longitude" ;
    rotated_pole:grid_north_pole_latitude = 32.5 ;
    rotated_pole:grid_north_pole_longitude = 170. ;

  float T(lev,rlat,rlon) ;
    T:long_name = "temperature" ;
    T:units = "K" ;
    T:coordinates = "lon lat" ;
  float rlon(rlon) ;
    rlon:long_name = "longitude in rotated pole grid" ;
    rlon:units = "degrees" ;
    rlon:standard_name = "grid_longitude";
  float rlat(rlat) ;
    rlat:long_name = "latitude in rotated pole grid" ;
    rlat:units = "degrees" ;
    rlat:standard_name = "grid_latitude";
  float lev(lev) ;
    lev:long_name = "pressure level" ;
    lev:units = "hPa" ;
  float lon(rlat,rlon) ;
    lon:long_name = "longitude" ;
    lon:units = "degrees_east" ;
  float lat(rlat,rlon) ;
    lat:long_name = "latitude" ;
    lat:units = "degrees_north" ;

where the grid_mapping is applied between (lat, lon) and (rlat, rlon), even if T does not have the attribute T:grid_mapping = "rotated_pole" ; ?

davidhassell commented 3 years ago

Hi @AndersMS

Re. The domain variable example - looks good, but not quite what I had in my mind - the data variable T should be the same in both cases:

dimensions:
  rlon = 128 ;
  rlat = 64 ;
  lev = 18 ;
variables:

  char myDomain ;
    myDomain:coordinates = "lat lon rlat rlon" ;
    myDomain:grid_mapping = "rotated_pole" ;
  char rotated_pole ;
    rotated_pole:grid_mapping_name = "rotated_latitude_longitude" ;
    rotated_pole:grid_north_pole_latitude = 32.5 ;
    rotated_pole:grid_north_pole_longitude = 170. ;
float T(lev,rlat,rlon) ;
    T:long_name = "temperature" ;
    T:units = "K" ;
    T:coordinates = "lon lat" ;
    T:grid_mapping = "rotated_pole" ;
  float rlon(rlon) ;
    rlon:long_name = "longitude in rotated pole grid" ;
    rlon:units = "degrees" ;
    rlon:standard_name = "grid_longitude";
  float rlat(rlat) ;
    rlat:long_name = "latitude in rotated pole grid" ;
    rlat:units = "degrees" ;
    rlat:standard_name = "grid_latitude";
  float lev(lev) ;
    lev:long_name = "pressure level" ;
    lev:units = "hPa" ;
  float lon(rlat,rlon) ;
    lon:long_name = "longitude" ;
    lon:units = "degrees_east" ;
  float lat(rlat,rlon) ;
    lat:long_name = "latitude" ;
    lat:units = "degrees_north" ;

So, the presence of a domain variable does not serve to enhance our understanding of the data variable, rather it's there as a convenience to those who want to access the domain independently.

AndersMS commented 3 years ago

@davidhassell Thank you for explaining your thoughts. That would make sense, so Tand myDomain must be kept consistent.

davidhassell commented 3 years ago

Hi @AndersMS

Re interpolation containers and offsets (https://github.com/erget/subsampled-coordinates/issues/10#issuecomment-687215480), thanks I think I'm clear now. The approach you are now suggesting is reminiscent of geometry containers (CF section 7.5) which also references coordinate-like variable that do not span the data variable's dimensions.

I haven't yet understood how a single interpolation container can apply to two data variable, each with a different grid. How do you know how to (or whether you should) apply the given offset information? I haven't fully understood the proposed syntax (e.g. "m_track - tp_track = 0.5 m_scan - tp_scan = 0.5 i_track - tp_track = 0.5 i_scan - tp_scan = 0.5")

Another related though is that an offset of 0.5 needs clarifying. For example, is if half way between the mid points of adjacent the full resolution cells, as defined by their bounds? I don't think that we can rely on the tie point coordinates being in those locations.

AndersMS commented 3 years ago

@davidhassell If you will be in the teleconference starting in a few minutes, I will do an attempt to explain the points you raise. If you are not, I will write after the teleconference.

davidhassell commented 3 years ago

Hi @AndersMS - many thanks for your presentation at the telco - very informative! It's helped coalesce my thoughts.

I'm concerned that we've drifted off into a "not very CF" view of the world.

I think we can have:

shared tie-points between different resolution data (E.g. M and I band)
a single non-complicated interpolation container
enhanced data discovery.

My main points of concern are:

I don't find it useful to define items/relationships in the interpolation container that are not all used by each parent data variable. E.g. all of the "i" grid stuff is not relevant to the "m" grid stuff, and vice versa. I not sure is a good idea to have to pick out by inspection out the parts that are relevant to each data variable. Imaine if you were allowed to name unrelated coordinate variables in the coordinates attribute.
I don't see what the "grids" attribute adds for us. Happy to be informed.
The tie points need to be grouped, e.g we don't want to bi-linearly interpolate lat with sen_zen_ang
Explicit is better than implicit, if possible. I.e. There is a lot of merit in being able to see that a data variable has which coordinates just by looking at its attributes, rather than a 2-step redirection process. This not to say that all misdirection is bad! But a lot of data discovery can take place if you know that there are, say lats and lons, regardless of their values.

So ....

dimensions :
  // VIIRS M-Band (750 m resolution imaging) 
  m_track = 768 ;
  m_scan = 3200 ;
  m_channel = 16 ;

  // VIIRS I-Band (375 m resolution imaging)
  i_track = 1536 ;
  i_scan = 6400 ; 
  i_channel = 5 ;

  // Tie points and interpolation zones (shared between VIIRS M-Band and I-Band)
  tp_track = 96 ;
  tp_scan = 205 ;
  track_interpolation_zone = 48 ;
  scan_interpolation_zone = 200 ;

  // Time, stored at scan-start and scan-end of each scan
  time_scan = 2;

variables:
  // VIIRS M-Band 
  float m_radiance(time_scan, m_track, m_scan, m_channel) ;
    m_radiance:interpolation_tie_points = m_track: m_scan: (lat lon) (sen_azi_ang  sen_zen_ang)
                                       (sol_azi_ang sol_zen_ang) time_scan: m_track: t” ;
    m_radiance:interpolation_tie_point_indices = "m_track: m_scan_indices m_track: m_track_indices
                                               time_scan: m_track: t_indices" ;
    m_radiance:interpolation = "m_track: m_scan: tp_interpolation time_scan: m_track: time_interpolation" ;

  // VIIRS I-Band 
  float i_radiance(i_track, i_scan, i_channel, time_scan) ;
    i_radiance:interpolation_tie_points = i_track: i_scan: (lat lon) (sen_azi_ang  sen_zen_ang)
                                    (sol_azi_ang sol_zen_ang) (sol_azi_ang sol_zen_ang) time_scan: i_track: t” ;
    i_radiance:interpolation_tie_point_indices = "i_track: i_scan_indices i_track: i_track_indices" ;
    i_radiance:interpolation = "i_track: i_scan: tp_interpolation time_scan: i_track: time_interpolation" ;

  // Coordinate grids and interpolation, supporting both VIIRS M-Band and I-Band
char tp_interpolation ;
    tp_interpolation : offsets = "tp_track: 0.5 tp_scan: 0.5" ;
    tp_interpolation : interpolation_name = "bi_quadratic" ;
    tp_interpolation : interpolation_coefficients = "expansion_coefficient_track  
                                       alignment_coefficient_track  expansion_coefficient_scan
                                       alignment_coefficient_scan" ;
    tp_interpolation : interpolation_flags = "interpolation_zone_flags" ;

  // Time interploation 
  char time_interpolation ;
    time_interpolation : interpolation_name = "bi_linear" ;

  // Interpolation indices
  int m_track_indices(tp_track) ;
  int m_scan_indices(tp_scan) ;
  int i_track_indices(tp_track) ;
  int i_scan_indices(tp_scan) ;

  // Tie points
  float lat(tp_track, tp_scan) ;
    lat : standard_name = "latitude" ;
    lat : units = "degrees_north" ;
    lat : interpolation = "tp_interpolation" ;
  float lon(tp_track, tp_scan) ;
    lon : standard_name = "longitude" ;
    lon : units = "degrees_east" ;
    lon : interpolation = "tp_interpolation" ;
  float sen_azi_ang(tp_track, tp_scan) ;
    sen_azi_ang : standard_name = "sensor_azimuth_angle" ;
    sen_azi_ang : units = "degrees" ;
    sen_azi_ang : interpolation = "tp_interpolation" ;
  float sen_zen_ang(tp_track, tp_scan) ;
    sen_zen_ang : standard_name = "sensor_zenith_angle" ;
    sen_zen_ang : units = "degrees" ;
    sen_zen_ang : interpolation = "tp_interpolation" ;
  float sol_azi_ang(tp_track, tp_scan) ;
    sol_azi_ang : standard_name = "solar_azimuth_angle" ;
    sol_azi_ang : units = "degrees" ;
    sol_azi_ang : interpolation = "tp_interpolation" ;
  float sol_zen_ang(tp_track, tp_scan) ;
    sol_zen_ang : standard_name = "solar_zenith_angle" ;
    sol_zen_ang : units = "degrees" ;
    sol_zen_ang : interpolation = "tp_interpolation" ;

  // Interpolation coefficients and flags
  short expansion_coefficient_track(track_interpolation_zone, tp_scan) ;
  short alignment_coefficient_track(track_interpolation_zone, tp_scan) ;
  short expansion_coefficient_scan(tp_track, scan_interpolation_zone) ;
  short alignment_coefficient_scan(tp_track, scan_interpolation_zone) ;
  byte interpolation_zone_flags(track_interpolation_zone, scan_interpolation_zone) ;
    interpolation_zone_flags : valid_range = "1b, 7b" ;
    interpolation_zone_flags : flag_masks = "1b, 2b, 4b" ;
    interpolation_zone_flags : flag_meanings = "location_use_cartesian  sensor_direction_use_cartesian  solar_direction_use_cartesian" ;

  double t(tp_track, time_scan) ;
    t : long_name = "time" ;
    t : units = "days since 1990-1-1 0:0:0" ;
    t : interpolation = time_interpolation ;

Notes:

Each data variable explicitly tells which types of coordinates it has
The interpolation container is bound to particular tie point dimensions. Tie point dimension are related to data variable dimensions by the interpolation_tie_point_indices attribute of the data variable. (There is some misdirection here, but that's OK by me, because you only have to do it if you're serious about recreating the full resolution coordinates - and by that stage that's the least of your worries!)
interpolation container offsets need only be defined in terms of the tie point variable dimensions
I hope I've got the time stuff right - I was taking my cues from the fact the time tie point variable is two dimensional. Not sure what the time tie point indices are (perhaps they were missing from the CDL I copied? Or perhaps they're not necessary because we assume that they're at the "end points"?)
The "offset" attribute of the interpolation container is optional and defaults to 0
all of the data variable's interpolation_* attributes relate stuff to dimension of the data variable, using the established notation described by cell methods for (multiple) dimensions.
I can't think of another example in CF of grouping elements within a string so I made up, for now, a the "(lat lon)" syntax of the interpolation_tie_points attribute
The penalty (if you view it as such) for all this is that the all data variables have two extra attributes, but as alluded to, this is very useful for data discovery, as well as for software trying to make head or tail of this all.

I hope that this makes sense - keen to send it off in case you get a chance to look at it tonight (those in Europe!).

Let's discuss!

AndersMS commented 3 years ago

@davidhassell , thank you, I will take a look straight away...

oceandatalab commented 3 years ago

The first thing that strikes me is... that it is still not possible to reconstruct lat, lon and other coordinates without the m_radiance or i_radiance variables as the mapping between subsampled and full dimensions is defined in the :interpolation_* attributes that you tied to the data variables.

But I agree that trying to define several grids in the same container variables may be confusing and only saves a little space at the cost of an increased complexity.

I think having a container for the I band and another container for the M band would solve all the issues: the :interpolation_* would go back in these container variables so data variables would only need the :interpolation attribute and the coordinates variables would also have access to all the information they need for their reconstruction. Each interpolation container variable could also have the :grid_coordinates attribute that Anders had in his previous example, so data variables would have an indirect access to their subsampled coordinates variables expressed in a backward-compatible way with the same syntax as the :coordinates attribute.

Regarding your comment on lat not needing sen_zen_ang, I also agree that we need a way to say which variables need each other during the interpolation (that is also related to the discussion subject named Point12)

AndersMS commented 3 years ago

Hi @davidhassell

This is a very nice proposal in that it is easier to read and understand, technically simpler with fever redirections, while maintaining the proposed set of features.

I agree that we do not need the gridsattribute, if it is acceptable to associate the tie point dimensions with the full resolution dimensions via the interpolation_tie_point_indices attribute. In the mesh variable issue, there are several exchanges on how to establish an appropriate association of dimensions, including some of your own comments. That was the background for making the association more explicit. But your proposal is fine with me, especially with the m_track: m_scan notation for pairing dimensions at the same resolution.

Grouping the tie points makes it more structured and readable. If you go all the way back here, we had

tp_interpolation : tie_points = "location(lat, lon) sensor_direction(sen_azi_ang, sen_zen_ang) solar_direction(sol_azi_ang, sol_zen_ang)”

which is similar to what you propose. But then we argued that we could pair the coordinates via their standard_name, like sensor_azimuth_angle with sensor_zenith_angle, and removed the grouping.

I agree we can get away with a single pair of offsets, they can be shared between the bands. We may have to refine this or find an alternative mechanism when we look into instruments where different channels have different offsets.

I like having references to the full resolution dimensions only in the data variables and references to the tie point dimensions only in the interpolation container variable. It separates the two domains nicely.

The attribute interpolation_tie_pointscould alternatively be called interpolation_coordinates.

Nice with the link to the CF cell method notation, that was also a point brought up by @oceandatalab I think.

The self-containedness of the interpolation container variable is no longer there, as pointed out by @oceandatalab, I am not sure how to deal with that. Introducing an interpolation container for I-band and one for M-band would lead to some redundancy and would not give the same ease of data discovery.

@oceandatalab : Consider the following line of argumentation:

Considering that there could be many data variables and many related coordinate variable sets in a single NetCDF CF file, one could argue that it would be natural to start at the data variable end. Through the data variable attributes, one can establish the data type and where the time coordinates can be found. This context information would add value to an examination of the coordinate data variables and enable to distinguish between different coordinate data variables sets, if there are multiple sets in the file.
Reading the data variable attributes can be done efficiently and without reading into memory the data variables themselves.

Would that possibly justify the detour via the data variables, even if the main interest is to examine the coordinate data variables?

oceandatalab commented 3 years ago

I think we have a different understanding of what "discoverability" means.

Introducing an interpolation container for I-band and one for M-band is better for data discoverability because it makes the CDL declarative, no logic/choice/interpretation involved, a robot just has to follow the track of attributes to get all the metadata related to a variable.

It also makes documentation for the convention a lot easier to write, and certainly a lot easier to understand too, which is an important aspect that I think we have to take into account in our discussions.

If you keep the :interpolation_tie_points on the data variables, then you will have to repeat it for all the variables that are defined on the same grid, whereas you would only have to define it once in the container variable. Given all our previous discussions and your insistence on minimizing redundancy, I fail to see how this is more acceptable than having two container variables.

As for requiring a data variable to reconstruct coordinates, you are making assumptions on how people will use the data, on what entry point they will choose when reading the file and more generally on what they are interested in. In my opinion that is not a good approach, the CDL should only describe data in a non-opiniated / neutral way otherwise your are just telling potential users that your files are just not meant for them if they come up with a use case that you did not think of. If I read a CF file that follows the current CF convention, I can get lat and lon directly without worrying about the relation they have with data variables: why would you make files that follow future CF convention more cumbersome to use?

So no, from my point of view it doesn't justify the detour. I thought we had finally converged on this question...

davidhassell commented 3 years ago

Lots of points here - thanks. I'll try to answer them in a series of comments (with no particular grouping) so I don't get confused ...

@oceandatalab (https://github.com/erget/subsampled-coordinates/issues/10#issuecomment-689034991)

The first thing that strikes me is... that it is still not possible to reconstruct lat, lon and other coordinates without the m_radiance or iradiance variables as the mapping between subsampled and full dimensions is defined in the :interpolation* attributes that you tied to the data variables.

I think that this is beyond the scope of this proposal. When the domain variable is implemented in CF, you will be able to reconstruct the full coordinates in the absence of a data variable (see https://github.com/erget/subsampled-coordinates/issues/10#issuecomment-687165568).

I think having a container for the I band and another container for the M band would solve all the issues: the :interpolation_* would go back in these container variables so data variables would only need the :interpolation attribute and the coordinates variables would also have access to all the information they need for their reconstruction. Each interpolation container variable could also have the :grid_coordinates attribute that Anders had in his previous example, so data variables would have an indirect access to their subsampled coordinates variables expressed in a backward-compatible way with the same syntax as the :coordinates attribute.

For me, this is an example of "not very CF". The logical nature of the interpolation is the same for both the I band and M band variables, so there should be a single container for describing it. This is similar to the grid mapping case: the grid mapping is defined once, but can be used for data on all staggers of an Arakawa C grid, each of which have different coordinates.

davidhassell commented 3 years ago

@AndersMS (https://github.com/erget/subsampled-coordinates/issues/10#issuecomment-689133230)

Grouping the tie points makes it more structured and readable. If you go all the way back here, we had

tp_interpolation : tie_points = "location(lat, lon) sensor_direction(sen_azi_ang, sen_zen_ang) solar_direction(sol_azi_ang, sol_zen_ang)”

which is similar to what you propose. But then we argued that we could pair the coordinates via their standard_name , like sensor_azimuth_angle with sensor_zenith_angle, and removed the grouping.

Sorry - I had missed that. I not so keen on relying on standard names for identifying the groupings because standard names are optional (even for lon and lat coordinates), and the pairing would have to defined in a controlled vocabulary (so that software can work it out) - and that CV could never be exhaustive.

I agree we can get away with a single pair of offsets, they can be shared between the bands. We may have to refine this or find an alternative mechanism when we look into instruments where different channels have different offsets.

I would say that in this latter case, a new interpolation container would be required, as the nature of the interpolation has changed (by virtue of the offsets changing).

If you keep the :interpolation_tie_points on the data variables, then you will have to repeat it for all the variables that are defined on the same grid, whereas you would only have to define it once in the container variable. Given all our previous discussions and your insistence on minimizing redundancy, I fail to see how this is more acceptable than having two container variables.

Yes, that's right, which is the part of my proposal that increases verbosity. However, this is no different, I feel, to listing variables with the coordinate attribute on every data variable, which is standard practice. It also makes it easier to "see ata glance" which coordinates a data variable has. This last point is subjective, but I'm thinking of library software and archive retrieval processes - "get me all of the data with solar_azimuth_angle coordinates" is facilitated by putting them on the data variable, and the software involved does not need to know about interpolation containers to do it (it does need to know about the interpolation_tie_points attribute, but that is a much easier modification for existing software)

The attribute interpolation_tie_points could alternatively be called interpolation_coordinates.

Absolutely. I make little claim that my suggested names are suitable or meaningful.

The self-containedness of the interpolation container variable is no longer there, as pointed out by @oceandatalab, I am not sure how to deal with that. Introducing an interpolation container for I-band and one for M-band would lead to some redundancy and would not give the same ease of data discovery.

I think that this will be solved by the future domain variable.

The main reason why the domain variable hasn't been implemented in the last couple of years is, I think, due to a lack of a clear use case with vocal supporters. This and we could be that! It's a separate issue, but there's nothing stopping us proposing it now ....

davidhassell commented 3 years ago

@oceandatalab (https://github.com/erget/subsampled-coordinates/issues/10#issuecomment-689171635)

I think we have a different understanding of what "discoverability" means.

Introducing an interpolation container for I-band and one for M-band is better for data discoverability because it makes the CDL declarative, no logic/choice/interpretation involved, a robot just has to follow the track of attributes to get all the metadata related to a variable.

It also makes documentation for the convention a lot easier to write, and certainly a lot easier to understand too, which is an important aspect that I think we have to take into account in our discussions.

You are absolutely right that there are different types of discoverability - the ones I had in mind are described in my previous comment (https://github.com/erget/subsampled-coordinates/issues/10#issuecomment-689406387). I disagree that what you propose eases understanding - storing all of the domain information outside of the variable to which it applies makes it harder to see the domain of the data I think the discovery you have in mind will be solved by the proposed domain variable, i.e. a well defined container that describes the grid using the same rules as all other grid discovery, but with the the baggage of a data array and attributes to describe its physical nature.

If you keep the :interpolation_tie_points on the data variables, then you will have to repeat it for all the variables that are defined on the same grid, whereas you would only have to define it once in the container variable. Given all our previous discussions and your insistence on minimizing redundancy, I fail to see how this is more acceptable than having two container variables.

Sorry - I answered this mistaking it for Anders' comment. What I wrote in the previous post was:

Yes, that's right, which is the part of my proposal that increases verbosity. However, this is no different, I feel, to listing variables with the coordinate attribute on every data variable, which is standard practice. It also makes it easier to "see ata glance" which coordinates a data variable has. This last point is subjective, but I'm thinking of library software and archive retrieval processes - "get me all of the data with solar_azimuth_angle coordinates" is facilitated by putting them on the data variable, and the software involved does not need to know about interpolation containers to do it (it does need to know about the interpolation_tie_points attribute, but that is a much easier modification for existing software)

As for requiring a data variable to reconstruct coordinates, you are making assumptions on how people will use the data, on what entry point they will choose when reading the file and more generally on what they are interested in. In my opinion that is not a good approach, the CDL should only describe data in a non-opiniated / neutral way otherwise your are just telling potential users that your files are just not meant for them if they come up with a use case that you did not think of. If I read a CF file that follows the current CF convention, I can get lat and lon directly without worrying about the relation they have with data variables: why would you make files that follow future CF convention more cumbersome to use?

I thoroughly sympathise with this approach, but unfortunately that is not how CF was conceived. In CF the data variable is king. The domain variable is the answer to this! It won't demote the data variable, but will allow for the holistic grouping of coordinates (and other cell information) in an independent abstract manner. Remember that the domain variable will apply to all grids, not just the ones under discussion here, so it is a complete solution to the issue that also affects other communities.

So no, from my point of view it doesn't justify the detour. I thought we had finally converged on this question...

I apologise for not being around for the weeks when this was being discussed, These are themes that I have been discussing on the CF issue and in the off-list discussions for some time and wish I could have mentioned them sooner, if only I had not been away.

I think that we would benefit from a another view point from someone outside of the remote sensing community. I will see if I can recruit someone.

AndersMS commented 3 years ago

@davidhassell

Grouping the tie points makes it more structured and readable. If you go all the way back here, we had

tp_interpolation : tie_points = "location(lat, lon) sensor_direction(sen_azi_ang, sen_zen_ang) solar_direction(sol_azi_ang, sol_zen_ang)”

which is similar to what you propose. But then we argued that we could pair the coordinates via their standard_name , like sensor_azimuth_angle with sensor_zenith_angle, and removed the grouping.

Sorry - I had missed that. I not so keen on relying on standard names for identifying the groupings because standard names are optional (even for lon and lat coordinates), and the pairing would have to defined in a controlled vocabulary (so that software can work it out) - and that CV could never be exhaustive.

1) It is a bit more complicated than that. When we interpolate lat/lon or zen/azi angles, we have to revert to interpolation in cartesian coordinates when getting close to the singular points. That would be close to lat=+/-90 for lat/lon and close to zen=0 for zen/azi. So we need to know what pair of coordinates is what and we need to know which of the two coordinates are lat and lon and zen and azi, respectively. Couldn't we require that tie points have standard name attributes, considering that we are introducing a new scheme? If not, we would need something like:

tp_interpolation : tie_points = "location(lat, lon) sensor_direction(sen_azi_ang, sen_zen_ang) solar_direction(sol_azi_ang, sol_zen_ang)

2) I understand that standard names are optional. In general, how would a tool find out what is lat and lon, if no standard names are attached?

davidhassell commented 3 years ago

Hi @AndersMS ,

1) OK, I hadn't appreciated that. I'll have a think about it.

2) For lat and lon, have units of degrees_east or degrees_north (or equivalent strings) is sufficient for identification (https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#latitude-coordinate)

I'm pushed for time for the rest of today, but will try to join in any discussion towards to the end of the day. Thanks

AndersMS commented 3 years ago

Hi @davidhassell

I agree we can get away with a single pair of offsets, they can be shared between the bands. We may have to refine this or find an alternative mechanism when we look into instruments where different channels have different offsets.

I would say that in this latter case, a new interpolation container would be required, as the nature of the interpolation has changed (by virtue of the offsets changing).

Thinking a bit more about it, I am not sure we can get away with a single pair of offsets:

The offset is measured in units cells of the full resolution, so I- or M-band full resolution cells respectively, which would indicate that it should be an attribute of the data variable and not of the proposed simplified interpolation container, which is on the tie point dimensions.
The fact that the numerical offset value is 0.5 for both VIIRS bands is a coincidence. Our colleagues at SMHI in Sweden applied the VIIRS tie point scheme for the Aqua MODIS instrument. MODIS has three resolutions: 1km, 500m and 250m. If one would choose the corner of the 1 km pixel as the first tie point, then the offsets would be for 1 km: (0.5, 0.5) 500 m: (0.5, 1.0) and 250 m: (0.5, 2.0), see this graphical representation of the MODIS grids.
The best would be if we could find a single concept for the offsets, which can be applied to all channels, a sets of channels, like the M-Band and I-Band in our example, or to individual channels, like in the GCOM-W AMSR2 example brought in by @TomLav. I will take a look at the GCOM-W AMSR2 example this afternoon, to see if things could be arranged to work together.

AndersMS commented 3 years ago

Hi @davidhassell and @oceandatalab ,

@davidhassell wrote:

I think that we would benefit from a another view point from someone outside of the remote sensing community. I will see if I can recruit someone.

Although he is not outside of the remote sensing community, you could consider approaching Martin Raspaud from SMHI, he was the one to apply the VIIRS scheme to MODIS. And he has been involved in the CF Conventions work, as you probably know.

AndersMS commented 3 years ago

@davidhassell

In this comment, you wrote:

Re. The domain variable example - looks good, but not quite what I had in my mind - the data variable T should be the same in both cases:

meaning that you expect all attributes of the domain variable to be copied in the related data variables.

Considering that @oceandatalab commented:

Given all our previous discussions and your insistence on minimizing redundancy

and considering that we already have several interpolation related attributes proposed for the data variable:

Could we permit new schemes that are domain variable aware to not copy all the attributes on both the domain and the data variable?

Like in:

variables:

  // VIIRS M-Band Domain
   char  m_domain ;
    m_radiance:interpolation_tie_points = m_track: m_scan: (lat lon) (sen_azi_ang  sen_zen_ang)
                                       (sol_azi_ang sol_zen_ang) time_scan: m_track: t” ;
    m_radiance:interpolation_tie_point_indices = "m_track: m_scan_indices m_track: m_track_indices
                                               time_scan: m_track: t_indices" ;
    m_radiance:interpolation = "m_track: m_scan: tp_interpolation time_scan: m_track: time_interpolation" ;

  // VIIRS M-Band 
  float m_radiance(time_scan, m_track, m_scan, m_channel) ;
    m_radiance:interpolation_tie_points = m_track: m_scan: (lat lon) (sen_azi_ang  sen_zen_ang)
                                       (sol_azi_ang sol_zen_ang) time_scan: m_track: t” ;

That would minimize redundancy and provide @oceandatalab the data variable free access, so to speak. I kept the interpolation_tie_points attribute in the both the data and the domain variable for reasons of data discovery.

oceandatalab commented 3 years ago

Hi guys,

Most of my previous comments (and arguments) are to be taken in the context of reading coordinate variables independently from data variables because, well, that is my use case. That is why I push for having all the information for performing the interpolation directly in the interpolation container (which is the only variable referenced by coordinate variables).

@davidhassell (https://github.com/erget/subsampled-coordinates/issues/10#issuecomment-689393530)

I think that this is beyond the scope of this proposal. When the domain variable is implemented in CF, you will be able to reconstruct the full coordinates in the absence of a data variable (see #10 (comment)).

This is acceptable if the coordinate variables have a reference to the domain container variable instead of a reference to the interpolation container otherwise the problem remains the same. As a side note it would mean that I'm screwed if both the proposal for the domain variable and the proposal for interpolation are not integrated in the CF conventions at the same time.

For me, this is an example of "not very CF". The logical nature of the interpolation is the same for both the I band and M band variables, so there should be a single container for describing it. This is similar to the grid mapping case: the grid mapping is defined once, but can be used for data on all staggers of an Arakawa C grid, each of which have different coordinates.

My comment was about addressing both my use case and your concern about interpolation container variables including information that only apply to some data variables (https://github.com/erget/subsampled-coordinates/issues/10#issuecomment-689016228):

I don't find it useful to define items/relationships in the interpolation container that are not all used by each parent data variable. E.g. all of the "i" grid stuff is not relevant to the "m" grid stuff, and vice versa. I not sure is a good idea to have to pick out by inspection out the parts that are relevant to each data variable. Imaine if you were allowed to name unrelated coordinate variables in the coordinates attribute.

For my use case :interpolation_tie_point_indices and interpolation_tie_points, i.e. band-specific parameters, would need to move back to the interpolation container variable. Since band-specific information would be back in the interpolation container, it would justify (in my opinion) the split into two container variables, otherwise you would have to interleave I-band and M-band values in the attributes (your remark).

@davidhassel (https://github.com/erget/subsampled-coordinates/issues/10#issuecomment-689417044)

You are absolutely right that there are different types of discoverability - the ones I had in mind are described in my previous comment (#10 (comment)). I disagree that what you propose eases understanding - storing all of the domain information outside of the variable to which it applies makes it harder to see the domain of the data I think the discovery you have in mind will be solved by the proposed domain variable, i.e. a well defined container that describes the grid using the same rules as all other grid discovery, but with the the baggage of a data array and attributes to describe its physical nature.

This is again related to the choice of attributes attached to the interpolation container variable (see previous comment): splitting into two containers would definitely improve readability if a single container means that you need to interleave the I-band and M-band information in the attributes.

Sorry - I answered this mistaking it for Anders' comment. What I wrote in the previous post was:

Yes, that's right, which is the part of my proposal that increases verbosity. However, this is no different, I feel, to listing variables with the coordinate attribute on every data variable, which is standard practice. It also makes it easier to "see ata glance" which coordinates a data variable has. This last point is subjective, but I'm thinking of library software and archive retrieval processes - "get me all of the data with solar_azimuth_angle coordinates" is facilitated by putting them on the data variable, and the software involved does not need to know about interpolation containers to do it (it does need to know about the interpolation_tie_points attribute, but that is a much easier modification for existing software)

I agree, this is akin to what we did in previous iterations with the :subsampled_coordinates attribute.

I thoroughly sympathise with this approach, but unfortunately that is not how CF was conceived. In CF the data variable is king. The domain variable is the answer to this! It won't demote the data variable, but will allow for the holistic grouping of coordinates (and other cell information) in an independent abstract manner. Remember that the domain variable will apply to all grids, not just the ones under discussion here, so it is a complete solution to the issue that also affects other communities.

c.f.first comment of this message.

@davidhassell (https://github.com/erget/subsampled-coordinates/issues/10#issuecomment-689444977)

For lat and lon, have units of degrees_east or degrees_north (or equivalent strings) is sufficient for identification (https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#latitude-coordinate)

Good to know, I did not spot this when I read the conventions. I think I have seen NetCDF files that contain variables representing directional data that use these units so they could be misinterpreted as lon/lat.