covjson / specification

CoverageJSON specification
https://covjson.org/spec/
45 stars 6 forks source link

Move rangeAxisOrder from Domain to Range? #56

Closed letmaik closed 8 years ago

letmaik commented 8 years ago

I know we have discussed this before, but the issue came up again while I was writing some examples and texts for the cookbook. I noticed how strange it is that a stand-alone Domain object (see playground) is forced to include "rangeAxisOrder". This is not only irrelevant for the domain but it also hinders reuse of the Domain document in other contexts than CoverageJSON (for example, someone might reuse Domain but have a different outer Coverage structure or not even use coverages).

I thought about moving this property to the Coverage level, but this also doesn't make much sense, since when the domain and ranges are referenced by URLs, then there would be a weird "rangeAxisOrder" property which refers to axis names that are not visible yet. However, once you load the range, you already loaded the domain, and having axisOrder in the range then makes more sense. With "rangeAxisOrder" in each range object you then could argue that there is repeated data, but I think this is ok because the axisOrder is only necessary when there are multiple axes with more than one element (grid axes) and those ranges are big anyway, so this property will not make much difference. You could even then extend range objects to optionally be stand-alone nd-arrays by including a "shape": [20,10] property, making it very reusable for other purposes.

I know that I was against this at the beginning for simplicity reasons, but I think my concerns were a bit too big, and the advantages of doing it otherwise outweigh them by far. I looked at what I would have to change in my library that exposes covjson as JS API and the changes are so minimal and trivial that I think it is not an issue in general. In fact, it will make things more flexible, also thinking ahead in terms of different range encodings like sparse ndarray encodings where you would not need rangeAxisOrder. And you could even then mix both types together and it would make sense (e.g. dense encoding for measurements, and sparse encoding for nodata-flags (if the amount of nodata is small for example)).

letmaik commented 8 years ago

I think it would be actually quite nice if we can make the range objects more reusable, and I think I would start by renaming the "type" from "Range" to something more general, for example:

{
  "type" : "NdArray",
  "dataType": "float",
  "axisNames": ["y", "x"],
  "shape": [2, 2],
  "values" : [
    17.3, 18.2,
    18.1, 19.4
  ]
}

When used within a coverage, then the constraint would be that the axis names of the ndarray must correspond to the domain axes, and the shape must match.

How sparse ndarrays could look like:

{
  "type" : "SparseNdArray",
  "dataType": "float",
  "axisNames": ["y", "x"],
  "shape": [2, 2],
  "values" : [
    0, 0, 17.3, // y, x, value
    1, 1, 19.4
  ]
}
jonblower commented 8 years ago

I like this approach. Questions/comments are:

  1. Given the addition of axisNames and shape, this means that a client could make a simple visualisation of the range without any knowledge of the domain, which could be quite handy, particularly for quick-checks and debugging.
  2. Would the "SparseNdArray" actually be an array of arrays, rather than a 1d array?
  3. One possible objection to removing rangeAxisOrder from the domain is that it might make certain optimisations difficult. With a large range, the client may wish to choose from a range of possible data-reading (subsetting) strategies. The choice of strategy may depend on the axis order, and the client may not want to access the whole range to find this. At the risk of introducing another layer of complexity, I wonder if it might be possible for a client to download the metadata for the range without the values array?
letmaik commented 8 years ago

To the second question, I thought of both ways, but in the end a 1D array is more space efficient in terms of character count, and you can handle it as a normal nd array if you already have that machinery. For the example above, you access the sparse entries by creating a view on that 1D array with axes "itemnumber" and "itemmembers" and the shape would be [item count, original axis count + 1]. So, for the 10th item, you have x = sparse[9, 0] and y = sparse[9, 1] and val = sparse[9, 2]. I think JavaScript engines would also be more efficient at handling a single big 1D array compared to nested arrays, but that's just speculation and it may change with new optimizations. So maybe this is premature optimization from myself. There's still the space efficiency though. By the way, you have the same argument for dense ndarrays of course. You could also have nested actual arrays.

To the third comment, I think you have to be more concrete. The axis ordering is just an encoding detail in my opinion. If you need a subset, then you don't think about that ordering, as you are at a different abstraction level. I can't think of a case where I would do my server queries differently based on some axis ordering.

letmaik commented 8 years ago

As a test I implemented this in covjson-reader in a backwards-compatible way (supporting both formats), and updated the covjson playground.

As a test, you can paste in:

{
  "type" : "Coverage",
  "profile" : "GridCoverage",
  "domain" : {
    "type" : "Domain",
    "profile" : "Grid",
    "axes": {
      "x": { "values": [-10,-5,0] },
      "y": { "values": [40,50] },
      "z": { "values": [5] },
      "t": { "values": ["2010-01-01T00:12:20Z"] }
    },
    "referencing": [{
      "components": ["y","x","z"],
      "system": {
        "type": "GeodeticCRS",
        "id": "http://www.opengis.net/def/crs/EPSG/0/4979"
      }
    }, {
      "components": ["t"],
      "system": {
        "type": "TemporalRS",
        "calendar": "Gregorian"
      }
    }]
  },
  "parameters" : {
    "ICEC": {
      "type" : "Parameter",
      "unit" : {
        "symbol" : "fraction"
      },
      "observedProperty" : {
        "label" : {
          "en": "Sea Ice Concentration"
        }
      }
    }
  },
  "ranges" : {
    "ICEC" : {
      "type" : "NdArray",
      "dataType": "float",
      "axisNames": ["t","z","y","x"],
      "shape": [1, 1, 2, 3],
      "values" : [ 0.5, 0.6, 0.4, 0.6, 0.2, null ]
    }
  }
}

Or just the domain object without axis order which now works in the playground:

{
  "type" : "Domain",
  "profile" : "Grid",
  "axes": {
    "x" : { "start": 7, "stop": 14, "num": 4 },
    "y" : { "start": 54, "stop": 48, "num": 4 }
  },
  "referencing": [{
    "components": ["x","y"],
    "system": {
      "type": "GeodeticCRS",
      "id": "http://www.opengis.net/def/crs/OGC/1.3/CRS84"
    }
  }]
}

There are several smaller things we would have to change in the spec, and this also affects the common profiles (since Domain profile now can't enforce an axis ordering, but rather this has to be done with the coverage profiles, so GridCoverage instead of Grid. just a matter of moving the constraints around in the document).

I would say that axisNames and shape is required in all cases, even for domains with only non-varying axes. Do you think that's ok?

jonblower commented 8 years ago

I would say that axisNames and shape is required in all cases, even for domains with only non-varying axes. Do you think that's ok?

Yes, I think it's good to be explicit about these.

Referring to an earlier comment:

To the third comment, I think you have to be more concrete. The axis ordering is just an encoding detail in my opinion. If you need a subset, then you don't think about that ordering, as you are at a different abstraction level. I can't think of a case where I would do my server queries differently based on some axis ordering

Fair enough. On the server side, if you make some assumptions about the order in which data are stored on disk you can make queries faster by requesting data in contiguous order (because "hopping around" on the disk is inefficient). Sometimes it's more efficient to request a large block of contiguous data rather than several small blocks. Knowing the axis order in which data are stored on disk helps to make this decision. But it's entirely possible that the axis order on disk won't correspond to the order in the JSON document, so I'm happy to forget that comment!

letmaik commented 8 years ago

I changed the spec accordingly now (see 2916560baa43fabf63a0b5286c02b21da8978947). Also, I've restructured the common profiles specification: https://github.com/Reading-eScience-Centre/coveragejson/blob/master/profiles.md Have a look and let me know if anything seems wrong or weird.

jonblower commented 8 years ago

A couple of comments about the profiles.md document:

  1. I'm not clear on the difference between the "Grid" and "GridCoverage" profiles
  2. When you say, "A coverage with GridCoverage profile must have NdArray objects with axisNames of either "t","z","y","x"..." do you mean that the axisNames must appear in that order, or can they be permuted?
  3. At the risk of opening a can of worms, I'm still not quite comfortable with the use of the word "profile" here. I can see the logic, but it could get confused with "vertical profile". Also, to my mind, a "profile" (in the sense of a specification) denotes a larger thing, e.g. a specialisation of a specification that is targeted to a particular community. I think in our cases these "profiles" are more like "feature types" or "subtypes".
letmaik commented 8 years ago

Hmm, let's see.

  1. "Grid" is specific to Domain objects only and can only be used on them. On the other hand, "GridCoverage" is for Coverage objects and says that the domain of such a coverage must have a "Grid" profile and a specific axis ordering in the range NdArray objects.
  2. Yes, I mean that the axisNames values must appear in that order, should probably rephrase that to make it more clear. Right?
  3. Let's do that in another issue: https://github.com/Reading-eScience-Centre/coveragejson/issues/58
jonblower commented 8 years ago
  1. Oops, sorry about that, I was being thick. It's too early for me to spot that sort of thing...
  2. Do the axes have to be in that order? Usually they are (it's a CF recommendation rather than a stipulation) and I can see that it could be helpful to mandate this, but sometimes data files are not stored in this way, and forcing a CovJSON server to reorder them might be inefficient. It's a balance I guess between flexibility and simplicity.
letmaik commented 8 years ago

Well, if we just recommend an ordering is there still a point then? Clients couldn't depend on it without then having to check if the order follows the recommended order and otherwise fail if they are hard-coded for that order. The balance is between more work for servers vs more work for clients, potentially. But on the other hand, my libraries completely abstract over that and support whatever order is given. Not sure.

In the end I see the following scenarios for simple hard-coded clients that would use such fixed guaranteed order:

  1. The domain is very simple, either just an X/Y grid or a single axis like for trajectories or timeseries. Then a hard-coded client which uses a given grid coverage for example (like displaying hourly updated wind speeds on a map) would not bother with being generic but just access values like values[i*nx + j] where i is the y-axis index, and j the y-axis index. In the trajectory or time series case, just values[i].
  2. The client wants to do calculations between the range values of two parameters of a grid coverage. For example, two bands of a satellite image to derive NDVI. In that case, the client could rely on a fixed axis order and hence value order and just do the calculations on the raw 1D arrays. Of course, it is highly unlikely anyway that the range values of two parameters in one grid would be stored in a different order. So this is probably not a real concern, just something that would not be forbidden if the order is just recommended.

Maybe as a compromise we could say that the orderings are recommended but there can't be different axis orderings for different parameters of a single coverage? Then clients could still do the second scenario without worrying, and the first scenario is probably fine too, maybe we're overthinking this.

jonblower commented 8 years ago

Well, if we just recommend an ordering is there still a point then?

Yes - a simple client could choose to support only the recommended ordering (in the knowledge that it should work with most datasets) but fail otherwise. Not unusual in the CF world when scientists write "good but not bomb proof" scripts.

But on the other hand, my libraries completely abstract over that and support whatever order is given.

Right, so it's certainly possible to support arbitrary order. How hard was this to achieve?

Maybe as a compromise we could say that the orderings are recommended but there can't be different axis orderings for different parameters of a single coverage?

Yes. (I don't see this as a compromise but an entirely sensible constraint.) Does this imply that the axisNames property ought to be promoted from the individual range objects to the top-level ranges object? That way it would be explicit that all ranges must share this ordering.

letmaik commented 8 years ago

But on the other hand, my libraries completely abstract over that and support whatever order is given.

Right, so it's certainly possible to support arbitrary order. How hard was this to achieve?

Well, the client interface operates via axis names exclusively now. To achieve that I first used a JS library called ndarray which simply provides a view on a 1D array together with a shape in terms of a get(i,j,k,...) function (very efficient, uses dynamically compiled code). So that freed me from having to do index calculations myself. On top of that get function I implemented a function where you give indices not by a specific order but by names, so get({x: i, y: j, ...}), based on a mapping from axis number to axis name. And doing that as efficiently as possible and generically was a bit tricky. First I had that one:

function createRangeGetFunction (ndarr, axisOrder) {
  const axisCount = axisOrder.length
  return obj => {
    let indices = new Array(axisCount)
    for (let i=0; i < axisCount; i++) {
      indices[i] = axisOrder[i] in obj ? obj[axisOrder[i]] : 0
    }
    return ndarr.get(...indices)
  }
}

However creating temporary arrays like that every time you fetch a value is too inefficient. To fix that I dyamically compile a function from source code once (per range):

function createRangeGetFunction (ndarr, axisOrder) {
  let ndargs = ''
  for (let i=0; i < axisOrder.length; i++) {
    if (ndargs) ndargs += ','
    ndargs += `'${axisOrder[i]}' in obj ? obj['${axisOrder[i]}'] : 0`
  }
  let fn = new Function('ndarr', `return function ndarrget (obj) { return ndarr.get(${ndargs}) }`)(ndarr)
  return fn
}

So when you call that, you essentially get back something like:

function ndarrget(obj) {
  return ndarr.get(obj['x'], obj['y'])
}

which is twice as fast on Chrome.

Since we have a fully self-describing NdArray now (with shape and axis names), it may be worth putting this functionality into a small library so that people can start quicker without having to use all the other stuff we have (like forcing them to use the complete JS coverage data API which you get when using the covjson-reader library).

Maybe as a compromise we could say that the orderings are recommended but there can't be different axis orderings for different parameters of a single coverage?

Yes. (I don't see this as a compromise but an entirely sensible constraint.) Does this imply that the axisNames property ought to be promoted from the individual range objects to the top-level ranges object? That way it would be explicit that all ranges must share this ordering.

I wouldn't do that since the axisNames property is specific for that NdArray encoding and there may be other encodings which are defined differently. Also, then the NdArray object wouldn't be self-contained anymore.

jonblower commented 8 years ago

it may be worth putting this functionality into a small library

Sounds good, if you have time. Sounds like a useful reusable bit of code.

I wouldn't do that since the axisNames property is specific for that NdArray encoding and there may be other encodings which are defined differently.

So you envisage that in the same coverage there may be a mixture of range types (NdArray, SparseArray)? I guess this could happen - it may be better to use a sparse array for things like masks but NdArrays for dense data.

(Which makes me wonder, should we call NdArray a DenseArray? I think the sparse arrays are also logically nd, right?)

the NdArray object wouldn't be self-contained anymore.

Yes, that's a good argument for keeping the axisNames property in the range object. I guess that a validator utility could check that the different range objects use consistent axis orders.

letmaik commented 8 years ago

So you envisage that in the same coverage there may be a mixture of range types (NdArray, SparseArray)? I guess this could happen - it may be better to use a sparse array for things like masks but NdArrays for dense data.

Yes, but that's not really what I was getting at. Even without a mixture you could invent a new range encoding in the future which doesn't need an ordered axis names array. So the point is that axisNames is a specific necessity for the NdArray that we currently have. And sure, since we don't have anything else it is a bit redundant to repeat that in each range, but this makes it also flexible and self-contained and will cause less trouble in the future I think. In terms of bytes, we shouldn't care about it, especially when gzipping.

(Which makes me wonder, should we call NdArray a DenseArray? I think the sparse arrays are also logically nd, right?)

Well, I thought about that, and yes, sparse arrays are also nd arrays by nature, that is, if you actually regard a sparse encoding as an nd array again. You could also regard it as a simple mapping from multiple axis indices to values (a function), then the array term disappears.

I just think that dense NdArrays will be a lot more common and will be the default case, so I didn't want to make them sound too complicated. You can always have a SparseNdArray even though it is not exactly right from an interface/inheritance point of view. I think it really doesn't matter. And I like the fact that "Nd" is in the name, so if, then I would go for DenseNdArray and SparseNdArray. But as I said, if you don't regard a sparse encoding as an nd array, you could also name it something different like .... ok, can't think of anything good, StructuredFlatArray...

jonblower commented 8 years ago

OK that's all fine I think. I don't feel strongly about the nomenclature of the arrays, but we could just call them "Array" and "SparseArray", given that it's probably clear that the data they represent are multidimensional. But I guess "NdArray" helps to disambiguate a bit.

letmaik commented 8 years ago

OK, I would leave it as NdArray then, sounds a bit more hip.

What about the axis ordering then? Should it be a recommendation only? I guess it's fine.

It would read like that then:

A coverage with GridCoverage profile should have NdArray objects with "axisNames" of either ["t","z","y","x"], ["z","y","x"], ["t","y","x"], or ["y","x"], depending on which axes are defined in the domain.

Does putting the axis names in an array make it clear that the axis order is meant to be like that?

jonblower commented 8 years ago

I think the order of axes should be a recommendation (not mandatory), which would mean that your text would need to be edited to make this clear. I think that if you put the axis names in an array it looks like the order is mandated.

letmaik commented 8 years ago

I changed it, and mention the order recommendation only once at the beginning:

The axis ordering in "axisNames" of NdArray objects should follow the order "t", "z", "y, "x", "composite", leaving out all axes that do not exist.