erget / subsampled-coordinates

Repository for storing CDL demonstrating subsampled coordinates in CF-netCDF
Apache License 2.0
0 stars 3 forks source link

New format proposal #10

Open oceandatalab opened 4 years ago

oceandatalab commented 4 years ago

I open this issue to discuss the new format proposed by Anders in https://github.com/cf-convention/discuss/issues/37#issuecomment-679290210

Here is a copy of the CDL for the VIIRS example:

dimensions :
  // VIIRS M-Band (750 m resolution imaging) 
  m_track = 768 ;
  m_scan = 3200 ;
  m_channel = 16 ;

  // VIIRS I-Band (375 m resolution imaging)
  i_track = 1536 ;
  i_scan = 6400 ; 
  i_channel = 5 ;

  // Tie points and interpolation zones (shared between VIIRS M-Band and I-Band)
  tp_track = 96 ;
  tp_scan = 205 ;
  track_interpolation_zone = 48 ;
  scan_interpolation_zone = 200 ;

variables:
  // VIIRS M-Band 
  float m_radiance(m_track, m_scan, m_channel) ;

  // VIIRS I-Band 
  float i_radiance(i_track, i_scan, i_channel) ;

  // Tie point based interpolation container for location, supporting both VIIRS M-Band and I-Band
  char tp_interpolation ;
    tp_interpolation : dimensions = "tp_track (tie_point) - m_track (location) - i_track (location)  tp_scan (tie_point) - m_scan (location) - i_scan (location)"  // association of grid dimensions and definition of grid functions (location or cell boundary)
    tp_interpolation : offset = "m_track - tp_track = 0.5   m_scan - tp_scan = 0.5   i_track - tp_track = 0.5   i_scan - tp_scan = 0.5"  // definition of grid offset in units of cells

    tp_interpolation : interpolation_indices = "m_track: m_track_indices  m_scan:m_scan_indices  i_track: i_track_indices  i_scan:i_scan_indices" ; // associate dimensions with indices
    tp_interpolation : interpolation_name = "bi_quadratic" ;
    tp_interpolation : interpolation_coefficients = "expansion_coefficient_track alignment_coefficient_track expansion_coefficient_scan alignment_coefficient_scan" ;
    tp_interpolation : interpolation_flags = "interpolation_zone_flags" ;
    tp_interpolation : location_tie_points = "lat lon" ;

  // Interpolation indices
  int m_track_indices(tp_track) ;
  int m_scan_indices(tp_scan) ;
  int i_track_indices(tp_track) ;
  int i_scan_indices(tp_scan) ;

  // Tie points
  float lat(tp_track, tp_scan) ;
    lat : standard_name = "latitude" ;
    lat : units = "degrees_north" ;
  float lon(tp_track, tp_scan) ;
    lon : standard_name = "longitude" ;
    lon : units = "degrees_east" ;

  // Interpolation coefficients and flags
  short expansion_coefficient_track(track_interpolation_zone, tp_scan) ;
  short alignment_coefficient_track(track_interpolation_zone, tp_scan) ;
  short expansion_coefficient_scan(tp_track, scan_interpolation_zone) ;
  short alignment_coefficient_scan(tp_track, scan_interpolation_zone) ;
  byte interpolation_zone_flags(track_interpolation_zone, scan_interpolation_zone) ;
    interpolation_zone_flags:valid_range = "1b, 7b" ;
    interpolation_zone_flags:flag_masks = "1b, 2b, 4b" ;
    interpolation_zone_flags:flag_meanings = "location_use_cartesian  sensor_direction_use_cartesian  solar_direction_use_cartesian" ;

And the relevant comments from the meeting minutes https://github.com/cf-convention/discuss/issues/37#issuecomment-680038841:

CDL approach

@AndersMS proposes using implicit references as outlined in NUG to associate field constructs with coordinates. This utilised shared dimensions to imply connections. In our case, the field construct shares the full resolution dimension with the interpolation container variable, whereas the interpolation container variable shares the compacted dimension with the tie-points. @erget posits that this has no impact on the CF Data Model because the full set of coordinates can be reconstructed for every data point, so that the lack of explicitly encoded coordinates in the netCDF file can be described logically as an encoding issue. @oceandatalab would prefer explicit to implicit references. Also noted: Explicit references would require potentially cumbersome updates and store information redundantly. There is a trade-off here between explicitness & conciseness. In all cases, the proposed approach allows coordinates to be reconstructed without having to unpack a complete field construct, thus fulfilling one of the use cases that we had not known exactly how to address before.

davidhassell commented 4 years ago

Hi @AndersMS (https://github.com/erget/subsampled-coordinates/issues/10#issuecomment-689482178)

Thinking a bit more about it, I am not sure we can get away with a single pair of offsets:

The offset is measured in units cells of the full resolution, so I- or M-band full resolution cells respectively, which would indicate that it should be an attribute of the data variable and not of the proposed simplified interpolation container, which is on the tie point dimensions. The fact that the numerical offset value is 0.5 for both VIIRS bands is a coincidence. Our colleagues at SMHI in Sweden applied the VIIRS tie point scheme for the Aqua MODIS instrument. MODIS has three resolutions: 1km, 500m and 250m. If one would choose the corner of the 1 km pixel as the first tie point, then the offsets would be for 1 km: (0.5, 0.5) 500 m: (0.5, 1.0) and 250 m: (0.5, 2.0), see this graphical representation of the MODIS grids. The best would be if we could find a single concept for the offsets, which can be applied to all channels, a sets of channels, like the M-Band and I-Band in our example, or to individual channels, like in the GCOM-W AMSR2 example brought in by @TomLav. I will take a look at the GCOM-W AMSR2 example this afternoon, to see if things could be arranged to work together.

Thanks for this example. I think I've got it now.I'm going to be deeply unpopular and say that for the three offsets case there should be three separate interpolation variables.

A feature of this proposal was that is was not to be remote-sensing specific, i.e. it could be used for any compression needs if the desired. I think that combining features such as as multiple offsets, only a subset of which apply to certain data parent variables is a complication that is not required.

In fact, it makes more sense to me to put the offsets back on to the data variable (even more unpopular, I know!) . Then we have, even more so than before, one nice and simple interpolation container ("function") which has its "input" parameters defined on a data-variable-by-data-variable basis. then the interpolation_offsets data variable attribute would relate data variable dimensions to offsets: "i_track: 0.5 i_scan: 0.2" ;.

In all of this there is minimal encoding complexity and no redundancy, which I think are important considerations.

Another thing to perhaps (?) consider is that whilst we know that the I_band and M_band variables are related, CF nor generic software does. Relating them via a listing of offsets to dimensions which are related to the data variable dimension in the interpolation container seems to complicated. This case is a bit similar, I think, to the case of orthogonal wind components, each on a different grid stagger. Connecting physically related data variables (be they vector components, real and imaginary parts of complex numbers, or MODIS resolutions) is, as yet, an unsolved problem - and not for lack of trying (@TomLav having made valiant efforts in the past) - and I suggest that we steer well clear of that in this proposal.

davidhassell commented 4 years ago

Hi @Anders, @oceandatalab (replying to https://github.com/erget/subsampled-coordinates/issues/10#issuecomment-689514953 - about putting some of the tie point variable references on the domain variable)

Firstly, it may be that the community is OK with replacing domain-type attributes on a data variable with a reference to a domain variable. It's not my current preference, but that is what the discussion is for.

I don't see the advantage in moving some attributes, whose values are data variable specific, to a domain variable - you'd just need a new, incomplete domain variable for each data variable with different dimensions, and we're jumping the gun by making assumptions about the nature of the domain variable. We should proceed as if it doesn't exist, but enage in the conversation on its implementation so that it can meet this (by then?) exiting use case.

davidhassell commented 4 years ago

Hi @oceandatalab and all (https://github.com/erget/subsampled-coordinates/issues/10#issuecomment-689525487)

Most of my previous comments (and arguments) are to be taken in the context of reading coordinate variables independently from data variables because, well, that is my use case. That is why I push for having all the information for performing the interpolation directly in the interpolation container (which is the only variable referenced by coordinate variables).

I would not support referencing the interpolation container from coordinate variables. This would require a change to the CF data model to somehow support coordinate constructs existing independently of data variables, and being linked to other coordinate constructs. That is the purpose of the abstract domain construct (to become the domain variable) - to bind coordinate (and other related) constructs together in a meaningful manner. This is not to say that this is the only way in which this could be done, but it is the way that CF has chosen to do it.

I'm all for starting the domain variable issue as soon as possible (next week, even).

I'm happy to ask Martin Raspaud if he'd like to take a look, and I've arranged to talk to Jonathan Gregory next week. Where do people think he best place to drop in to the thread is? Here, in the CF issue https://github.com/cf-convention/discuss/issues/37, somewhere else?

oceandatalab commented 4 years ago

@davidhassell https://github.com/erget/subsampled-coordinates/issues/10#issuecomment-689733370

I don't see the advantage in moving some attributes, whose values are data variable specific, to a domain variable - you'd just need a new, incomplete domain variable for each data variable with different dimensions

The attributes would not be data variable specific but grid specific. You would need a domain variable for the M-band grid and another one for the I-band grid. If there can only be one domain variable per file and it applies to all grids, then it is not as useful as I thought for the I-band/M-band case.

We should proceed as if it doesn't exist

I agree, discussions about the domain variable proposal should belong to an dedicated issue and the interpolation proposal should not make any assumption on the availability of this kind of variable.

@davidhassell https://github.com/erget/subsampled-coordinates/issues/10#issuecomment-689738229

I would not support referencing the interpolation container from coordinate variables.

The CDL you proposed yesterday does reference the interpolation container from coordinate variables (https://github.com/erget/subsampled-coordinates/issues/10#issuecomment-689016228). ;)

Maybe I can convince you otherwise, please consider this: let's ignore the "coordinate" aspect of the proposal temporarilly, how would you define a subsampled data variable (which in my mind would be a totally valid use case for the interpolation method too)? I think that's quite difficult without an attribute that refers to the interpolation container variable.

If you agree with that, why would you proceed differently just because the subsampled variable is actually a coordinates variable?

Where do people think he best place to drop in to the thread is? Here, in the CF issue cf-convention/discuss#37, somewhere else?

Tricky question ^_^ We have at least three approaches for the CDL, information are all over the place and it would be necessary to read all the comments in order to understand where we're at and how we got there.

Maybe:

davidhassell commented 4 years ago

@oceandatalab

The CDL you proposed yesterday does reference the interpolation container from coordinate variables (#10 (comment)). ;)

You are right! I forgot to delete those attributes when I cut-and-pasted the original CDL to modify. Apologies. So we should have had, e.g.

 float sen_zen_ang(tp_track, tp_scan) ;
    sen_zen_ang : standard_name = "sensor_zenith_angle" ;
    sen_zen_ang : units = "degrees" ;

I won't redo the whole CDL, as there were other aspects which were not quite right, or under debate.

Sorry for the confusion.

AndersMS commented 4 years ago

Good morning @davidhassell

I wrote:

The offset is measured in units cells of the full resolution, so I- or M-band full resolution cells respectively, which would indicate that it should be an attribute of the data variable and not of the proposed simplified interpolation container, which is on the tie point dimensions.

and you wrote:

In fact, it makes more sense to me to put the offsets back on to the data variable (even more unpopular, I know!) . Then we have, even more so than before, one nice and simple interpolation container ("function") which has its "input" parameters defined on a data-variable-by-data-variable basis. then the interpolation_offsets data variable attribute would relate data variable dimensions to offsets: "i_track: 0.5 i_scan: 0.2" ;.

so we fully agree on this and it doesn't make you unpopular! I think you overlooked my argument.

In this case, there would one data variable for each of the MODIS 1km, 500m and 250m band, each with different dimensions. They would all reference and share the same interpolation container, which in turn would reference a single shared set of tie point coordinates.

That's very elegant! :-)

I will think a bit more a potential modification to support the GCOM-W AMSR2 example of @TomLav.

AndersMS commented 4 years ago

Regarding a potential future domain variable:

@davidhassell wrote: We should proceed as if it doesn't exist, but engage in the conversation on its implementation so that it can meet this (by then?) exiting use case.

@oceandatalab wrote I agree, discussions about the domain variable proposal should belong to an dedicated issue and the interpolation proposal should not make any assumption on the availability of this kind of variable.

My main interest in the domain variable was to see if it potentially could serve the coordinate variable use case brought up by @oceandatalab.

Clearly it is a separate issue, but the present issue could potentially benefit from it and we should engage in the conversation on its implementation.

davidhassell commented 4 years ago

Hi @AndersMS

In this case, there would one data variable for each of the MODIS 1km, 500m and 250m band, each with different dimensions. They would all reference and share the same interpolation container, which in turn would reference a single shared set of tie point coordinates.

Thanks for your patience - I'm gradually understanding the whole of the problem space.

I think that it's OK to reference the tie point dimensions from the interpolation container, but not the tie point variables themselves, which should be on the data variable. This is because There may be multiple tie point variables for each tie point dimension, not all of which are used by all data variables.

This also makes sense because the interpolation coefficients (e.g. expansion_coefficient_track need to be on the interpolation container, and they need to know the tie point dimension, too.

This does indeed make thing much nicer, as I think we can remove nearly all of the dimension mappings from the data variable:

(I'm just playing about with different data variable and interpolation container attribute names)

variables:
  // VIIRS M-Band 
  float m_radiance(time_scan, m_track, m_scan, m_channel) ;
    m_radiance : subsampled_coordinates = lat lon sen_azi_ang  sen_zen_ang
                                       sol_azi_ang sol_zen_ang  t” ;
    m_radiance : interpolation_offsets = "m _track: 0.5 m_scan: 0.5" ;
    m_radiance : subsample_indices = "m_track: m_scan_indices m_track: m_track_indices" ;
    m_radiance : interpolation = "tp_interpolation time_interpolation" ;

  // VIIRS I-Band 
  float i_radiance(i_track, i_scan, i_channel, time_scan) ;
    i_radiance : subsampled_coordinates = lat lon sen_azi_ang  sen_zen_ang
                                    sol_azi_ang sol_zen_ang sol_azi_ang sol_zen_ang t” ;
    i_radiance : subsample_indices = "i_track: i_scan_indices i_track: i_track_indices" ;
    i_radiance : interpolation_offsets = "i _track: 0.5 i_scan: 0.5" ;
    i_radiance : interpolation = "tp_interpolation time_interpolation" ;

  // Coordinate grids and interpolation, supporting both VIIRS M-Band and I-Band
char tp_interpolation ;
    tp_interpolation : subsampled_coordinates = "lat lon sen_azi_ang  sen_zen_ang
                                                    sol_azi_ang sol_zen_ang sol_azi_ang sol_zen_ang"
    tp_interpolation : interpolation_name = "bi_quadratic_1" ;
    tp_interpolation : interpolation_coefficients = "expansion_coefficient_track  
                                       alignment_coefficient_track  expansion_coefficient_scan
                                       alignment_coefficient_scan" ;
    tp_interpolation : interpolation_flags = "interpolation_zone_flags" ;

  // Interpolation indices
  int m_track_indices(tp_track) ;
  int m_scan_indices(tp_scan) ;
  int i_track_indices(tp_track) ;
  int i_scan_indices(tp_scan) ;

  // Tie points
  float lat(tp_track, tp_scan) ;
  float lon(tp_track, tp_scan) ;
  float sen_azi_ang(tp_track, tp_scan) ;
  float sen_zen_ang(tp_track, tp_scan) ;
  float sol_azi_ang(tp_track, tp_scan) ;
  float sol_zen_ang(tp_track, tp_scan) ;

  // Interpolation coefficients and flags
  short expansion_coefficient_track(track_interpolation_zone, tp_scan) ;
  short alignment_coefficient_track(track_interpolation_zone, tp_scan) ;
  short expansion_coefficient_scan(tp_track, scan_interpolation_zone) ;
  short alignment_coefficient_scan(tp_track, scan_interpolation_zone) ;
  byte interpolation_zone_flags(track_interpolation_zone, scan_interpolation_zone) ;
    interpolation_zone_flags : valid_range = "1b, 7b" ;
    interpolation_zone_flags : flag_masks = "1b, 2b, 4b" ;
    interpolation_zone_flags : flag_meanings = "location_use_cartesian  sensor_direction_use_cartesian  solar_direction_use_cartesian" ;

  // Time interploation
  char time_interpolation ;
    time_interpolation:tie+point_dimensions = "tp_track, time_scan" ;
    time_interpolation : interpolation_name = "bi_linear" ;
    time_interpolation : tie_points = "time(t)" ;

  double t(tp_track, time_scan) ;

For me this is good because the nature of the coordinates are signalled from the data variable (we know that the data has latitude and longitude cells be very simple inspection) and the encoding is just getting simpler. The mapping of data variable dimensions to tie point dimension is still possible, and not complicated for software, via the mapping on the data variable of data dimensions to the indices variables (e.g. "m_track: m_scan_indices m_track: m_track_indices" ;)

AndersMS commented 4 years ago

@davidhassell

I added the offsets attributes to your latest example CDL. Does that look ok to you?

On the names, I have slight preference for starting them all with interpolation_*, to signal that those attributes are all related to the same construct.

AndersMS commented 4 years ago

Hi @davidhassell

You wrote:

I hope I've got the time stuff right - I was taking my cues from the fact the time tie point variable is two dimensional. Not sure what the time tie point indices are (perhaps they were missing from the CDL I copied? Or perhaps they're not necessary because we assume that they're at the "end points"?)

I can understand if the time interpolation is not fully clear since my comment has got some inconsistencies.

I will come up with an improved example later.

davidhassell commented 4 years ago

@AndersMS

I added the offsets attributes to your latest example CDL. Does that look ok to you?

Thanks - looks good to me, I didn't intend to miss them out.

davidhassell commented 3 years ago

Hi all,

I've just had a chat with Jonathan Gregory, where I was bringing him up to speed on the where we're at. He liked our approach in terms of functionality and how it fits into CF logically, but questioned why the tie points are referenced from the interpolation variable (given that they have to be listed from the data variable in any event) and I admit to struggling to justify it!

That some combinations of tie points need to considered simultaneously didn't do it, as these will be described in the method, so listing them on the container is superfluous.

Giving the superset of tie point variables so that a domain-without-data can be stored also didn't seem right, as you can just create a domain variable that does provide the superset (see below).

In that light, connecting the tie points to the interpolation variable (as we have suggested in the past) worked well for us, using the same sort of syntax as used for connecting cell methods to dimensions.

I think this has zero redundancy, and, with the domain variable, provides the required functionality. How does that sound?

// VIIRS I-Band 
  float i_radiance(i_track, i_scan, i_channel, time_scan) ;
    i_radiance:tie_points = "lat: lon: tp_interpolation t: time_interpolation" ;  // connect tie point to interpolation
                                                                                  //variable using existing machinary
    i_radiance:tie-point_indices = "i_track: i_scan_indices i_track: i_track_indices" ;
    i_radiance:tie_point_offsets = "i _track: offset i_scan: offset" ;

// interpolation variable
char tp_interpolation ;
    tp_interpolation:interpolation_name = "bi_quadratic_1" ;
    tp_interpolation:interpolation_coefficients = "expansion_coefficient_track  
                                       alignment_coefficient_track  expansion_coefficient_scan
                                       alignment_coefficient_scan" ;
    tp_interpolation:interpolation_flags = "interpolation_zone_flags" ;

You'll still be able to connect the superset of tie points in a domain variable, if you want to:

// domain variable
char domain1 ;
    domain1:domain_dimensions = "i_track i_scan i_channel time_scan" ;
    domain1:tie_points = "lat: lon: sen_azi_ang: sen_zen_ang: sol_azi_ang: sol_zen_ang: tp_interpolation t: time_interpolation" ;
    domein1:tie_point_indices = "i_track: i_scan_indices i_track: i_track_indices" ;
    domain1:tie_point_offsets = "i _track: offset i_scan: offset" ;

Thanks, and looking forward to see what you think, David

AndersMS commented 3 years ago

Hi @davidhassell

That's also quite elegant.

Before discussing in detail, I would like to understand how our CDL example would look in the existing CF scheme, i.e. without our new tie points and interpolation.

When starting our work we said that we were compacting coordinates, see for for example this presentation.

Coordinates can include a wide range of geophysical quantities, as explained in the CF Convention Document, chaper 4:

The commonest use of coordinate variables is to locate the data in space and time, but coordinates may be provided for any other continuous geophysical quantity (e.g. density, temperature, radiation wavelength, zenith angle of radiance, sea surface wave frequency) or discrete category (see Section 4.5, "Discrete Axis", e.g. area type, model level number, ensemble member number) on which the data variable depends.

What I would like to understand is which of two following CDLs would be the correct "full resolution" version of our CDL example, the difference being in the coordinates attribute:

A:

dimensions :
  // VIIRS M-Band (750 m resolution imaging) 
  m_track = 768 ;
  m_scan = 3200 ;
  m_channel = 16 ;

variables:
  // VIIRS M-Band 
  float m_radiance(m_track, m_scan, m_channel) ;
    m_radiance : coordinates = "lat lon sen_azi_ang sen_zen_ang sol_azi_ang sol_zen_ang  t” ;

  float lat(m_track, m_scan) ;
  float lon(m_track, m_scan) ;
  float sen_azi_ang(m_track, m_scan) ;
  float sen_zen_ang(m_track, m_scan) ;
  float sol_azi_ang(m_track, m_scan) ;
  float sol_zen_ang(m_track, m_scan) ;

  double t(m_track, m_scan) ;
    t : long_name = "time" ;
    t : units = "days since 1990-1-1 0:0:0" ;

B:

dimensions :
  // VIIRS M-Band (750 m resolution imaging) 
  m_track = 768 ;
  m_scan = 3200 ;
  m_channel = 16 ;

variables:
  // VIIRS M-Band 
  float m_radiance(m_track, m_scan, m_channel) ;
    m_radiance : coordinates = "lat lon t” ;

  float lat(m_track, m_scan) ;
  float lon(m_track, m_scan) ;
  float sen_azi_ang(m_track, m_scan) ;
  float sen_zen_ang(m_track, m_scan) ;
  float sol_azi_ang(m_track, m_scan) ;
  float sol_zen_ang(m_track, m_scan) ;

  double t(m_track, m_scan) ;
    t : long_name = "time" ;
    t : units = "days since 1990-1-1 0:0:0" ;
erget commented 3 years ago

Hi @davidhassell , I also find this interesting but I think it's important that we find a way to address the goal of the subsampled coordinates work independently of the domain variable, which has not yet been implemented. My worry is that we'll end up proposing 2 solutions that are both complex, but in different ways. It feels at the moment like we may be drifting into the gravitational pull of multiple possible improvements.

FWIW I feel that the goal of this work is to find a way to provide only a limited subset of coordinates, which the user can interpolate in a known way in order to reconstruct the full set of coordinates. In the context of remote sensing, azimuth and bearing angles are coordinates; of course this need not be a requirement for every use case of this approach. Therefore @AndersMS I think example A is the correct one here ;) - the coordinates should in my opinion be linked explicitly with the data variables that call them up.

Although a domain variable may be extremely helpful in this regard, I would prefer to address that concern in a separate proposal so that it can be reviewed in light of its data formatting aspects, rather than in the context of coordinate interpolation.

ajelenak commented 3 years ago

I think both examples are correct. The B example provides the minimal number of (aux.) coordinates required for all domain axes of the m_radiance field. I think what goes in the coordinates attribute is up to the data creator as long as all the parent variable's dimensions are covered. NetCDF coordinate variables are excluded as they do not need to be explicitly listed.

davidhassell commented 3 years ago

Hi @erget

it's important that we find a way to address the goal of the subsampled coordinates work independently of the domain variable, which has not yet been implemented. My worry is that we'll end up proposing 2 solutions that are both complex, but in different ways. It feels at the moment like we may be drifting into the gravitational pull of multiple possible improvements.

I'm afraid that I don't think this approach will get us to where we want to be. A partial implementation of a domain variable will almost certainly not be compatible with the domain variable, when it arrives, and the domain variable will actually be easy to implement. This is because the data model already has an implicit, abstract domain construct, so the clear way forward is to formalise this to turn this into an explicit "top level" construct implemented by the fully general domain variable that has been described in this thread. This is quite easy, as the implicit domain is already well defined and accepted.

To make subsampled coordinates work independently of any data variable will require a change to the data model, however it is done, so I think that the domain variable option is the only one available to us.

I shall be proposing this domain variable, on the CF issue tracker, at the beginning of next week (or even Friday this week if I have time), so I don't think that we need to worry about a long period between the these two enhancements.

Does that sound OK?

@AndersMS I agree with what others have said: the data creator links all applicable coordinate variables to a data variable via the coordinates attribute. There is no minimum or maximum amount. So, assuming that the sensor angles apply to m_radiance, A is OK.

AndersMS commented 3 years ago

@davidhassell

I've just had a chat with Jonathan Gregory, where I was bringing him up to speed on the where we're at. He liked our approach in terms of functionality and how it fits into CF logically, but questioned why the tie points are referenced from the interpolation variable (given that they have to be listed from the data variable in any event) and I admit to struggling to justify it!

I am happy with having the tie points referenced from the data variable and not from the interpolation variable. I find your proposed syntax very elegant, combining the tie points and the applicable interpolation variable:

m_radiance:tie_points = "lat: lon: sen_azi_ang: sen_zen_ang: sol_azi_ang: sol_zen_ang: tp_interpolation t: time_interpolation" ;

I agree with what others have said: the data creator links all applicable coordinate variables to a data variable via the coordinates attribute. There is no minimum or maximum amount. So, assuming that the sensor angles apply to m_radiance, A is OK.

Good, then this example, which I just posted, should be ok? It lists all coordinates on both the m_radiance and i_radiance data variables, and get's the job done without the need for a domain variable.

What I believe @erget wishes, is a solution that would also work without a domain variable. That is what we have now with this example.

And the solution would benefit from a potential future domain variable.

Would that be a correct summary?

erget commented 3 years ago

@AndersMS yes, you summarised my worry about being able to propose a solution that works without a domain variable. However, @davidhassell you have done some to soothe my worried mind - if you will be proposing this very soon, I wouldn't mind taking advantage of it, but I want to separate the review of the 2 issues so that we get the right expertise and don't overload those who are looking at the proposal. I agree fully - using a domain variable instead of the domain variable is silly; I was thinking of using no domain variable. But if we have one soon, then all the better.

davidhassell commented 3 years ago

Hi @AndersMS & @erget

I think that the review of the two issues is sure to draw in others out of the woodwork, as there are other interested parties in a domain variable (e.g. those involved with streaming services, GIS applications, the data model itself, ...).

Even a solution that would also work without an explicit domain variable would necessitate a change to the current data model, as it is still logically combining domain information in the absence to be used in the absence of a data variable.

I will start drafting the domain variable issue now .... it would be useful if someone could supply a sentence or two describing your use case - thanks.

erget commented 3 years ago

@davidhassell here's a few words on the use case:

Short form

We propose an approach to reduce the volume of coordinate data by using parametrised interpolation on tie-points. For some remote sensing data products, this can lead to a 50x data volume reduction factor. Several such approaches exist, but they are only minimally standardised, so that specialised software is needed in order to use data whose coordinates has been thus reduced. Our proposal

  1. combines the strengths of existing approaches into a single algorithm, and
  2. can be encoded in a way that is compatible with the CF Conventions.

Long form

Increasing data volumes present a challenge for all types of Earth science data, whether it is derived from models, in-situ, or remote sensing observations. The sheer volume of the data required in order to study the Earth and its systems is one of the most difficult obstacles to overcome in order to produce scientific insights and policy recommendations.

Data reduction is one of the lowest-hanging fruits in the toolkit of big data processing approaches. The gains that can be made in this area while preserving the integrity of the data in question are potentiated each time the data transmitted, stored, and processed. The number of use cases in which this is the case grows as data products become more accessible through the opening of innovative new data services and the availability of novel processing techniques, such as the application of artificial intelligence, which make it possible to exploit data products in new disciplines.

Although these savings are relevant for all types of data, they are of particular interest in the area of remote sensing. This is because in remote sensing data products are often produced such that a multitude of coordinate metadata is required to interpret the observations in question. In contrast to e.g. model data, the position of an observing platform is generally not static, and coordinates are unique to a given observation. Thus they and cannot be re-used across different data granules of the same type. Therefore the full set of coordinate data needed to interpret any given data point often includes

  • latitude
  • longitude
  • satellite zenith angle
  • satellite azimuth angle
  • solar zenith angle
  • satellite azimuth angle

In some cases moon zenith and azimuth angles are also needed for the data to be utilised. It is for this reason that coordinate data in Earth observation data products can comprise up to 90% of a given data product, meaning that90% of the product’s data volume is devoted to providing metadata that is needed to interpret the "essential" 10% of observational data.

Thus it is unsurprising that several methods already exist to reduce the volume of coordinate data in remote sensing data products. For some data products, this can lead to a 50x data volume reduction factor. Despite the existence of several useful approaches, however, these approaches remain only minimally standardised, so that specialised software is often needed in order to make use of the data. The Earth sciences community and other scientific communities that make use of Earth observation data would benefit greatly from a common approach to reducing coordinate data volumes while preserving the integrity and interpretability of the data products.

We propose a standardised approach that

  1. combines the strengths of existing approaches into a single algorithm, and
  2. can be encoded in a way that is compatible with the CF Conventions.
davidhassell commented 3 years ago

Thanks, @erget, this is great - thanks.

davidhassell commented 3 years ago

Are @ajelenak's super-sampled coordinates still a use case? You could argue that it is still a form of compression, as, assuming we need the high resolution coordinates anyway, it still allows us to not store the target domain coordinates.

As far as a I can tell, what we have for subsampled coordinates will work OK for super-sampled ones, but a difference in the interpretation of the tie point indices: If the tie point coordinate dimension is larger than its target dimension, then the tie point indices relate to the tie point dimension (rather than the target dimension, which is the subsampled case).

I have tried this out with some CDL, but it sounds plausible ....

erget commented 3 years ago

I don't see a reason to exclude this use case.

davidhassell commented 3 years ago

Good, me neither. It just needs a to be addressed in the detail, that's all (I should have said "I have not yet tried this out with some CDL", and that is still True!)

AndersMS commented 3 years ago

It think that would work and would be a nice feature.

We then have to be careful with the definition of the offsets. I would suggest that they should be measured in unit of cells of the finer resolution, if the the two are different.

davidhassell commented 3 years ago

I've made a start at CF-ising some of our ideas:

[[compression-by-interpolation, Section 8.3, "Compression by Coordinate Interpolation"]]
=== Compression by Coordinate Interpolation

For some applications the coordinates of a data variable can require
considerably more storage than the data itself. Space may be saved in
the netCDF file by the storing coordinates at a lower resolution than
the data which they describe. The uncompressed coordinate and
auxiliary coordinate variables can be reconstituted by interpolation
from the lower resolution coordinate values, to the domain of the
data. This process will likely result in losses in accuracy (as
opposed to precision) in the uncompressed variables, due to rounding
errors in the interpolation calculations, but it assumed that these
will be small enough to not be of concern to user of the uncompressed
dataset.

The lower resolution coordinates are stored in a __tie point
variable__. This terminology is chosen to ackowledge that, whilst the
values of a tie point variable may be a subset of the uncompressed
coordinate values, they can also be different to some or all of them.

The method used to uncompress the tie point variables is described by
an __interpolation variable__ that acts as a container for the
attributes that define the interpolation technique that should be
used. The variable should be a scalar (i.e. it has no dimensions) of
arbitrary type, and the value of its single element is immaterial.

This obviously just a work in progress, and I realise that I will rapidily run out of language when the text moves from the general overview (started above) to the specifics ... but it's a start ...

davidhassell commented 3 years ago

Hi - I was wondering about how we can tell which interpolation zone dimensions correspond to which tie point dimensions.

This matters because we need to know the dimension order of the coefficicent variables relative to the tie point variables.

In https://github.com/cf-convention/discuss/issues/37#issuecomment-693319829 we have:

// Tie points and interpolation zones (shared between VIIRS M-Band and I-Band)
  tp_track = 96 ;  // 48 VIIRS scans
  tp_scan = 205 ;
  track_interpolation_zone = 48 ;
  scan_interpolation_zone = 200 ;

But if tp_track and tp_scan have the same size, and track_interpolation_zone and scan_interpolation_zone also have the size, then are we in trouble? E.g.

  tp_track = 96 ;
  tp_scan = 96 ;
  track_interpolation_zone = 48 ;
  scan_interpolation_zone = 48 ;
AndersMS commented 3 years ago

Hi @davidhassell

Hi - I was wondering about how we can tell which interpolation zone dimensions correspond to which tie point dimensions.

Would a solution be to change:

  // Interpolation variable
  char tp_interpolation ;
    tp_interpolation:interpolation_name = "bi_quadratic_1" ;
    tp_interpolation:interpolation_coefficients = "expansion_coefficient_track alignment_coefficient_track expansion_coefficient_scan alignment_coefficient_scan" ;
    tp_interpolation:interpolation_flags = "interpolation_zone_flags" ;
  // Interpolation variable
  char tp_interpolation ;
    tp_interpolation:interpolation_name = "bi_quadratic_1" ;
    tp_interpolation:interpolation_coefficients = "(tp_track, tp_scan): expansion_coefficient_track alignment_coefficient_track expansion_coefficient_scan alignment_coefficient_scan" ;
    tp_interpolation:interpolation_flags = "(tp_track, tp_scan): interpolation_zone_flags" ;

using a notation similar to the one for the tie_point_indicesand tie_point_offsetattributes?

davidhassell commented 3 years ago

Hi @AndersMS

Good idea. I'm inclined (a minor change) to drop the parentheses, though and re-use the "colon" notation that is already standard, in this proposal and elsewhere in CF:

// Interpolation variable
  char tp_interpolation ;
    tp_interpolation:interpolation_name = "bi_quadratic_1" ;
    tp_interpolation:interpolation_coefficients = "tp_track: tp_scan: expansion_coefficient_track alignment_coefficient_track expansion_coefficient_scan alignment_coefficient_scan" ;
    tp_interpolation:interpolation_flags = "tp_track: tp_scan: interpolation_zone_flags" ;

We'd need to state in the conventions that the order of dimensions of each coefficient variable must be a) the same and b) match the order given (i.e. "tp_track: tp_scan:" forces t the variables to have dimensions (tp_track, tp_scan)).

I'm happy insisting on a dimension order for these variable, as other "related" variables in CF already have the same restriction (e.g. coordinate bounds variables must have te same dimension order as their parent coordinates).

AndersMS commented 3 years ago

Hi @davidhassell

Good. The "colon" notation is fine with me, in particular if this is in line with standard notation in CF.

The sentence

(i.e. "tp_track: tp_scan:" forces t the variables to have dimensions (tp_track, tp_scan)).

is probably a bit too strong.

In our example we have:

  // Interpolation coefficients and flags
  short expansion_coefficient_track(track_interpolation_zone, tp_scan) ;
  short alignment_coefficient_track(track_interpolation_zone, tp_scan) ;
  short expansion_coefficient_scan(tp_track, scan_interpolation_zone) ;
  short alignment_coefficient_scan(tp_track, scan_interpolation_zone) ;

So the notation either associates dimensions (track_interpolation_zonewith tp_trackand scan_interpolation_zonewith tp_scan) or state that they are identical.

Would you agree?

davidhassell commented 3 years ago

Ah! I see, now. That's good for me.

AndersMS commented 3 years ago

It is then equivalent to the way we associate tie point dimensions with target dimensions, in this example m_track with tp_track:

m_radiance: tie_point_indices = "m_track: m_track_indices  m_scan: m_scan_indices  time_scan: m_time_scan_indices" ;

int m_track_indices(tp_track) 
davidhassell commented 3 years ago

Hello, I was reading slide 14 of the presentation that describes interpolation zones and interpolation groups. On the right hand side of the slide it says "Interpolation within each interpolation zone using the interpolation method", but do you not interpolate zones within a group simultaneously? E.g. in this example, zones A and B would be interpolated simultaneously using the six leftmost tie points:

.     .     ..     .
   A     B      C
.     .     ..     .

This is not an encoding question, rather on interpretation of the final product.

Thanks.

davidhassell commented 3 years ago

Another question - what about periodic coordinates, such as longitude, that wrap-around the whole Earth. This is never an issue for remote sensing, I imagine. But it could come up in general. The issue is for the interpolation implementation to know that it has to account for tie points either side of the the "edges" of the earth (seen as a rectangular array).

I think that it may be enough to say in the text that if the first and last tie points (or their appropriate bounds) along the tie point dimension are same modulo 360, then wrap-around interpolation should be used.

Thanks, David

AndersMS commented 3 years ago

Hi @davidhassell,

You wrote:

Hello, I was reading slide 14 of the presentation that describes interpolation zones and interpolation groups. On the right hand side of the slide it says "Interpolation within each interpolation zone using the interpolation method", but do you not interpolate zones within a group simultaneously?

There are two levels to this:

The main purpose of the groups is to safeguard against geometric discontinuities in the grid that may be present even if the data is stored in a single two dimensional array.

Did that answer the question?

davidhassell commented 3 years ago

Hi @AndersMS, re. interpolation zones:

Thanks - I think I get it now. It might be better, in that case, if it was stated that each zone's interpolation must be carried independently of the others, just to remove any ambiguities that could arise from perhaps, negative offsets (I'm wholly sure that this is a problem, but it seems like it could at least cause confusion).

AndersMS commented 3 years ago

Hi @davidhassell,

You wrote:

what about periodic coordinates, such as longitude, that wrap-around the whole Earth.

I does actually come up in remote sensing also. There are several ways to deal with it.

In the VIIRS example we set the interpolation_zone_flags to "location_use_cartesian" for all interpolation zones that are spanning the date line (180 deg longitude). The cartesian coordinates doesn't have the wrap-around issue. This approach would be consistent with the style of putting complexity and decision making on the data provider side and keeping the user side algorithms as simple as possible.

Another method that could be applied by a method is to check if differences in tie point longitudes are larger than 180 degrees, and if so, subtract/add 360 degrees from those longitudes before interpolating.

Finally, we could also consider a notation to mark a dimension that have wrap-around behaviour. See also the 3rd bullet on this slide from the mesh discussion at the 2020 CF meeting.

AndersMS commented 3 years ago

Hi @davidhassell, regarding interpolation zones:

It might be better, in that case, if it was stated that each zone's interpolation must be carried independently of the others

Thank you for the proposal, I agree and I will include that in the text.

Yes, a negative offset could be confusing, but it would not change the above principle of independent interpolation.

davidhassell commented 3 years ago

Yes, a negative offset could be confusing, but it would change the above principle of independent interpolation.

Just to be sure, are you saying that if the offset is negative then the zone interpolations are not independent?

AndersMS commented 3 years ago

No, it was a typo, I have corrected it now:

Yes, a negative offset could be confusing, but it would not change the above principle of independent interpolation.

AndersMS commented 3 years ago

The offsets are geometric offsets and they don't change the zone and group logic.

davidhassell commented 3 years ago

Thanks the wrap-around information, @AndersMS. I need to think a bit more to try to understand these situations better - I'm mainly thinking of the case when the first and last tie point longitudes (i.e. at either ends of the tie point dimension) are in fact adjacent in physical space. But I'm finishing for the day, now.....

davidhassell commented 3 years ago

Hello,

In the general case when spherically regridding 2-D lats and lons to another set of 2-D lats and lons, it is necessary to know which axis is "X" and which is "Y". Their order in the arrays doesn't matter, as long as their identities are known.

I presume that this extends to, say scan and track dimensions - you need to know which is which. Is that right?

davidhassell commented 3 years ago

Hi,

In the example CDL the lat and lon tie point variables are two dimensional (X and Y). Could they also have, say, a time dimension? In that case how would we know which of the three dimensions are the two dimensions that need to horizontally interpolated?

Also, if time interpolation is in play, we'd need to know which is the time dimension, and would have to add lat: lon: time: time_interpolation, I presume (?).

I'm sure that this is possible, but I can't quite see it this morning.

Thnaks, David

AndersMS commented 3 years ago

Hi @davidhassell

You are asking very good questions!

In the general case when spherically regridding 2-D lats and lons to another set of 2-D lats and lons, it is necessary to know which axis is "X" and which is "Y". Their order in the arrays doesn't matter, as long as their identities are known.

I presume that this extends to, say scan and track dimensions - you need to know which is which. Is that right?

In our original method for VIIRS data (https://doi.org/10.1016/j.acags.2020.100025) we do indeed need to know what is scan and track. The reason is that we model the scanning geometry of the instrument.

In the improved method I am proposing for our present work, we do not need to know what is track and scan, as we are only looking at the lat/lon locations, without assumptions on how they were generated from a scanning instrument.

In response to your question I have looked a bit into general purpose bi-linear and bi-quadratic (and in principle higher order) interpolation methods. They are all invariant to the order of dimensions, so they yield the same interpolation results if you apply them to a (X, Y) grid and a (Y, X) grid, apart from the results being swapped accordingly. So you wouldn't need to know.

I have to check if the bi-quadratic method that I explained recently (slide 11-14 here), actually has this property too. I will come back to you on that.

So if we limit ourselves to general purpose interpolation methods, we could do without having to associate the dimensions of the grid with particular dimension of the interpolation method.

However, if we want to remain open for interpolation methods that rely on dimension like scan/track, then we would need a CDL notation to cover that case. Probably an attribute in the interpolation variable would do.

Possibly my preference would be that, for now, we limit ourselves to general purpose interpolation methods, not needing this mapping.

What do you think?

AndersMS commented 3 years ago

Hi @davidhassell

In the example CDL the lat and lon tie point variables are two dimensional (X and Y). Could they also have, say, a time dimension? In that case how would we know which of the three dimensions are the two dimensions that need to horizontally interpolated?

In the example you reference we already have time as an auxiliary coordinate variable. I guess you are now referring to having time as a coordinate variable. So you would have a lat/lon grid per time value, each of which would need separate interpolation? I guess that would work, but all the values stored in the interpolation variable would need that time dimension as well.

Also, if time interpolation is in play, we'd need to know which is the time dimension, and would have to add lat: lon: time: time_interpolation, I presume (?).

Can we actually have time twice, both as a coordinate variable and an auxiliary coordinate variable? I am not sure I fully understand the case.

I guess that if we require our coordinates to have standard name (as we said a method could be specified to require), then that would let us tell what is time and what is spatial coordinates.

erget commented 3 years ago

@AndersMS @davidhassell I agree with Anders' comment that

Possibly my preference would be that, for now, we limit ourselves to general purpose interpolation methods, not needing this mapping.

As long as we don't have a concrete use case in mind I would prefer to focus on what we know we need to do. Forward compatibility is good, but it looks like this will be solvable if and when we need to solve it, and like we don't need to solve special-case interpolation methods right now. What do you think?

AndersMS commented 3 years ago

@erget @davidhassell

but it looks like this will be solvable if and when we need to solve it, and like we don't need to solve special-case interpolation methods right now

I would agree that it could be added later and I would support to let it rest for now.

However, it would probably be good to discuss the topic briefly now, in order to demonstrate that we would have an elegant way to add this later.

Correcting what I wrote above:

Probably an attribute in the interpolation variable would do.

I think that attribute would have to be on the data variable, as the association of dimensions would be different for different resolution, say VIIRS I and M band. For the VIIRS example, it could be something like:

 m_radiance:tie_point_dimensions = "m_track: track  m_scan: scan" ;
....
 i_radiance:tie_point_dimensions = "i_track: track  i_scan: scan" ;

where trackand scanare dimension names known by the interpolation method "bi_quadratic_1".

davidhassell commented 3 years ago

Hi @AndersMS, @erget, et al.,

Thanks for taking me up on these questions. Your answers have helped me a lot:

With regards the (X,Y)/(Y,X) I think you're right and we can ignore it. For unstructured grids (such as tripolar which, deep down, I was fixating on) you do need to know, but you can't subsample and interpolate those grids anyway, so they are not a valid use case. So, I think we're OK on that front, with no changes.

With the time stuff, what I was getting was if you had 3-D lats and lons: (T, Y, X). In this case we need to know in any event which dimension is time - whether time is being interpolated or not. However, I think this example does not relate to track/scan case, as for that we have lat(track, scan), lon(track, scan), time(track, scan), i.e. no 1-D time axis. Is that right? (I hadn't fully appreciated this a few days ago - sorry!)

Can we solve all this by saying that the tie_point variables must only span the dimensions that are being interpolated, with no others? It would follow then that the interpolation method cannot be "compound", i.e. cannot be the union of two independent interpolations. Non-compound interpolations are good, because we avoid trouble relating to dimension identification, order of application, etc.

I think that this constraint would be OK for the general case. Would it be also OK for the remote sensing use cases?

All the best, David

AndersMS commented 3 years ago

Hi @davidhassell

For unstructured grids (such as tripolar which, deep down, I was fixating on) you do need to know, but you can't subsample and interpolate those grids anyway, so they are not a valid use case. So, I think we're OK on that front, with no changes.

I would find it interesting to see how well we could interpolate a tripolar grid. But it is not a high priority to try that out right now and you might well be right that it wouldn't work.

Just to keep the matter semi-open, did you like the notation proposed above?