New Attributes for Data Variables?

erget / subsampled-coordinates

Repository for storing CDL demonstrating subsampled coordinates in CF-netCDF

Apache License 2.0

0 stars 3 forks source link

New Attributes for Data Variables? #6

Open ajelenak opened 4 years ago

ajelenak commented 4 years ago

The purpose of this issue is to determine what new attributes for data variables are needed. These variables in CF hold scientific data discretized within a domain and are represented by the Field construct in the CF data model.

There seems to be enough agreement for a new attribute, similar to the grid_mapping attribute. One proposed name for it is tie_point_interpolation. Its value is the name of a container variable which describes the interpolation method for computing coordinate data at the same domain resolution as the field construct to which this attribute is assigned.

Are any additional new attributes needed?

Whenever a field construct depends on multidimensional (rank > 1) coordinates, or a dimension (rank = 1) coordinate is named differently than its dimension, such variables must be listed in a coordinates attribute. This means that every subsampled coordinate will have to be included in this attribute. Or a new one with the same role as the coordinates attribute.

The following short example illustrates using just the coordinates attribute assigned to the swath_data variable (a field construct):

dimensions :
    time = UNLIMITED;
    scan = 512;
    band = 5;
    press = 15;
    sub_time = 100;
    sub_scan = 64;

variables :
    float swath_data(time, scan, press, band);
        swath_data : coordinates = "time lat lon";

    float lat(sub_time, sub_scan);
        lat : standard_name = "latitude";
        lat : units = "degrees_north";

    float lon(sub_time, sub_scan);
        lon : standard_name = "longitude";
        lon : units = "degrees_east";

    double time(sub_time);
        time : standard_name = "time";
        time : units = "<units> since <datetime string>";
        time : calendar = "gregorian";

If using a new attribute, the above would become:

dimensions :
    time = UNLIMITED;
    scan = 512;
    band = 5;
    press = 15;
    sub_time = 100;
    sub_scan = 64;

variables :
    float swath_data(time, scan, press, band);
        swath_data : subsampled_coordinates = "time lat lon";

    float lat(sub_time, sub_scan);
        lat : standard_name = "latitude";
        lat : units = "degrees_north";

    float lon(sub_time, sub_scan);
        lon : standard_name = "longitude";
        lon : units = "degrees_east";

    double time(sub_time);
        time : standard_name = "time";
        time : units = "<units> since <datetime string>";
        time : calendar = "gregorian";

The new attribute, here named subsampled_coordinates, is to be used only for subsampled coordinates that otherwise qualify for inclusion in the coordinates attribute. One reason for the new attribute is because neither of the time, lat, or lon coordinates depend on any of the swath_data's dimensions. So far in CF, variables listed in the coordinates attribute always shared at least one common dimension.

erget commented 4 years ago

Prima facie this makes sense to me but of course these would somehow need to be linked to the container variable so that it's clear how to bring the subsampled coordinates into full resolution.

ajelenak commented 4 years ago

It is simpler to just keep using the coordinates attribute. The presence of an interpolation container attribute could serve as a hint that some of the variables listed in coordinates might be subsampled domains. Those variables will have a new attribute (subsample_dimension, interpolation_dimension; name TBD) that declares domain axis their subsampling data applies to. We need to verify whether this approach would be acceptable for the CF.

oceandatalab commented 4 years ago

I agree that reusing the coordinates attribute is probably the right approach (if it is acceptable for the CF), otherwise it would become difficult to handle variables with a mix of full and subsampled coordinates.

Adding a subsampled_dimension attribute on subsampled coordinate variables to indicate the dimensions they should expand to would be in line with the issue regarding reusability of interpolation containers https://github.com/erget/subsampled-coordinates/issues/5

oceandatalab commented 4 years ago

Just to complete with information discussed a few minutes ago:

As explained by @davidhassell during the meeting, reusing coordinates would break backwards compatibility for software that only support older versions of the CF convention.

So in order to keep this compatibility and still be able to define subsampled variables as coordinates, maybe a solution would be to have both:

a new standard attribute named subsampled_coordinates
a coordinates attribute containing only full coordinate variables so older software would still be able to read the data variables without error but would only be able to locate them with coordinates provided on full dimensions.

For example (simplistic, there would obviously be better ways to describe this kind of data):

dimensions :
    time = UNLIMITED;
    lat = 720;
    lon = 1440;
    sub_lat = 10;
    sub_lon = 20;
    aux = 15;

variables :
    float grid_data(time, lat, lon, aux);
        grid_data : coordinates = "time"
        grid_data : subsampled_coordinates = "time lat lon";

    float lat(sub_lat);
        lat : standard_name = "latitude";
        lat : units = "degrees_north";
        lat : interpolation = "interpolation_doesnotmatter"; 

    float lon(sub_lon);
        lon : standard_name = "longitude";
        lon : units = "degrees_east";
        lon : interpolation  = "interpolation_doesnotmatter";

    double time(time);
        time : standard_name = "time";
        time : units = "<units> since <datetime string>";
        time : calendar = "gregorian";

    char interpolation_doesnotmatter;
        interpolation : description = "not the subject of this issue"

Software that do not support new CF versions would:

read the grid_data variable without error as the CDL remains valid for previous versions of the CF convention
detect the availability of a coordinates standard attribute
read coordinate variables listed in the coordinates attribute (so, exclusively non-subsampled variables)
locate data on the time axis (but not on lat or lon)

Software that implement new CF conventions would:

read the grid_data variable without error
detect that there are both a coordinates and a subsampled_coordinates standard attributes
use the subsampled_coordinates attribute since it provides at least as much information as the coordinates attribute, but potentially more
read coordinate variables listed in subsampled_coordinates, whether they are subsampled or not,
expand subsampled coordinate variables using the designated interpolation method
locate data on the time, lat and lon axes.

This method adds an overhead (one additional attribute for each data variable linked to a set of coordinates) that could disappear once/if backward compatibility is discarded in later versions of the CF convention.

ajelenak commented 4 years ago

I have add an example that is in line with the above summary.

davidhassell commented 4 years ago

Hi, I think that there will be resistance to referencing the interpolation container from the subsampled coordinates, rather than from the data variable. I think that this is preferable:

dimensions :
    time = UNLIMITED;
    lat = 720;
    lon = 1440;
    sub_lat = 10;
    sub_lon = 20;
    aux = 15;

variables :
    float grid_data(time, lat, lon, aux);
        grid_data : coordinates = "time"
        grid_data : subsampled_coordinates = "lat lon";
        grid_data : interpolation  = "interpolation_doesnotmatter";

    float lat(sub_lat);
        lat : standard_name = "latitude";
        lat : units = "degrees_north";

    float lon(sub_lon);
        lon : standard_name = "longitude";
        lon : units = "degrees_east";

    double time(time);
        time : standard_name = "time";
        time : units = "<units> since <datetime string>";
        time : calendar = "gregorian";

    int tie_points_lon(sub_lon) ;
        tie_points_lon:interpolation_dimension = "lon" ;

    int tie_points_lat(sub_lat) ;
        tie_points_lat:interpolation_dimension = "lat";

    char interpolation_doesnotmatter;
        interpolation : description = "not the subject of this issue"

The reasons for this are that

this how other container variables are encoded (not a particularly strong argument on its own, but it gives consistency);
the variable lon is only a subsampled coordinate in the context of the data variable;
the variable lon can not apply the interplation independently because it doesn't know about the the other, linked subsampled coordinates, lat in this case.

oceandatalab commented 4 years ago

Numbering my comments so it is easier to reply:

lon and lat are independant in your example, otherwise their dimensions would be (sub_lat, sub_lon)
Supposing lat and lon depend on each other and we want to keep the interpolation container variable as reusable as possible (like a function), then we need an attribute materializing this dependency (function arguments). For example:
```
float lat(sub_lat, sub_lon);
lat : standard_name = "latitude";
lat : units = "degrees_north";
lat : interpolation = "interp_bilinear_container";
lat : interpolation_terms = "v1 : lat v2 : lon"
```

float lon(sub_lat, sub_lon); lon : standard_name = "longitude"; lon : units = "degrees_east"; lon : interpolation = "interp_bilinear_container"; lon : interpolation_terms = "v1 : lat v2 : lon"

char interp_bilinear_container; interp_bilinear_container : standard_name = "biliinear"


3. Let's say the `time` variable is also subsampled, it does not depend on `lat` or `lon` so there are two independant interpolations to perform (one for `lat`/`lon` and one for `time`).

My understanding is that with your approach it would either mean that:
 -  the `interpolation` attribute accepts several values but in that case you need to define which coordinate variables are targeted by each interpolation method (keeping in mind that the container variable cannot refer to other variables in order to remain generic/reusable), so you need more interpolation-related attributes on each data variables using these coordinates.
 - or there is a single value for in the `interpolation` attribute but in that case the method described in the container variable has to handle the interpolation of all subsampled coordiantes (i.e. `'time`, `lat` and `'lon`) altogether, which adds complexity in the definition of the interpolation container variable.

4. I understand that mimicking the behavior of existing constructs can faciliate acceptance of the proposal, but for me It makes much more sense to keep the interpolation-related attributes in the subsampled coordinate variables: they are the ones we "compressed" and need to be reconstructed, not the data variables that reference the coordinate variables.

davidhassell commented 4 years ago

Let's say the time variable is also subsampled, it does not depend on lat or lon so there are two independant interpolations to perform (one for lat/lon and one for time).

Good point!

This is easily dealt with in the same way that different coordinate variables can have different grid_mappings (http://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#grid-mappings-and-projections):

dimensions :
    time = UNLIMITED;
    lat = 720;
    lon = 1440;
    sub_lat = 10;
    sub_lon = 20;
    aux = 15;

variables :
    float grid_data(time, lat, lon, aux);
        grid_data : coordinates = "time"
        grid_data : subsampled_coordinates = "lat lon time";
        grid_data : interpolation  = "interpolation_XY: lat lon interpolation_T: time";

    float lat(sub_lat);
        lat : standard_name = "latitude";
        lat : units = "degrees_north";

    float lon(sub_lon);
        lon : standard_name = "longitude";
        lon : units = "degrees_east";

    double time(time);
        time : standard_name = "time";
        time : units = "<units> since <datetime string>";
        time : calendar = "gregorian";

    int tie_points_lon(sub_lon) ;
        tie_points_lon:interpolation_dimension = "lon" ;

    int tie_points_lat(sub_lat) ;
        tie_points_lat:interpolation_dimension = "lat";

    char interpolation_XY;
        interpolation : description = "not the subject of this issue"

    char interpolation_T;
        interpolation : description = "not the subject of this issue"

AndersMS commented 4 years ago

Often the lat/lon or scan/track interpolation has to be done first and time and viewing angles depends on this first interpolation. So, for efficiency, it would be good to have also a way to bundle these in a single interpolation container.

I would support adding references to the index mapping variables in the data variable to improve re-usability of the interpolation container.

Here is a VIIRS example with two different data variable resolutions, M-Band at 750m and I-Band at 375m. I left out all the viewing angles and interpolation coefficients for clarity:


dimensions :

    // VIIRS M-Band 
    m_track = 768 ;
    m_scan = 3200 ;
        m_channel = 16 ;

    // VIIRS I-Band 
        i_track = 1536 ;
    i_scan = 6400 ;
        i_channel = 5 ;

    // Tie points
    tp_track = 96 ;
    tp_scan = 205 ;

    // Time, stored at scan-start and scan-end of each scan
        time_scan = 2;

variables:

    // VIIRS M-Band 
    float m_radiance(m_track, m_scan, m_channel) ;
     m_radiance : interpolation = "interpolation_all" ;
         m_radiance : interpolation_indices = "m_track_indices m_scan_indices" ;

    int m_track_indices(tp_track) ;
         m_track_indices:interpolation_dimension = "m_track" ;

    int m_scan_indices(tp_scan) ;
         m_scan_indices:interpolation_dimension = "m_scan" ;

    // VIIRS I-Band 
    float i_radiance(i_track, i_scan, i_channel) ;
     i_radiance : interpolation = "interpolation_all" ;
         i_radiance : interpolation_indices = "i_track_indices i_scan_indices" ;

    int i_track_indices(tp_track) ;
         i_track_indices:interpolation_dimension = "i_track" ;

    int i_scan_indices(tp_scan) ;
         i_scan_indices:interpolation_dimension = "i_scan" ;

    // Reusable interpolation container, shared by VIIRS M-Band and I-Band

    char interpolation ;
    interpolation:tie_point_interpolation_name = "bi_quadratic_method1" ;
    interpolation:location_tie_points = "lat lon" ;
    interpolation:time_interpolation_name = "bi_linear" ;
    interpolation:time = "t" ;

    // Tie points

    float lat(tp_track, tp_scan) ;
        lat:standard_name = "latitude" ;
        lat:units = "degrees_north" ;

    float lon(tp_track, tp_scan) ;
        lon:standard_name = "longitude" ;
        lon:units = "degrees_east" ;

    double t(tp_track, scan_time) ;
        t:long_name = "time" ;
        t:units = "days since 1990-1-1 0:0:0" ;

ajelenak commented 4 years ago

I think that either interpolation_indices attribute (new name: _compactindices?) or something like compacted_coordinates = "lat lon time" (new name!) will convey the same information.

The m_radiance data variable depends on the m_track and m_scan dimensions which are mentioned in the interpolation_dimension (new name: compacted_dimension?) attributes of the m_track_indices and m_scan_indices variables. They share the same tp_track and tp_scan dimensions as the lat, lon, and t compacted coordinates.

AndersMS commented 4 years ago

I would agree that compacted_dimension is nice and descriptive. However, by comparison to the grid mapping terminology

    grid_mapping = ...
    grid_mapping_name = ...

I think I prefer that all names start with interpolation_ for the attributes in the data set:

    interpolation = ....
    interpolation_indices = ...

and the attributes of the indices variable:

   interpolation_dimension = ....

This make them appear as part of the same concept, which they are.

What do you think?