Subsetting - Githubissues

letmaik commented 9 years ago

The API currently does not support loading a subset of either domain or range.

For ranges, adding support would be easy, similar to the DAP standard. The client would load the full domain and then loads a range subset (implicitly based on actual coordinates) by array indices constraints (e.g. every 5th grid cell within indices 0 to 10000). Handling indices is possible since the domain is loaded and defines the shape and semantics of the axes which then easily allows client-side subsetting like "get me everything in that geographical bounding box". This could look like cov.loadRange('salinity', {xIdx: [0,10000, 5], yIdx:[0,10000,5], zIdx: [0,0], tIdx: [null,null]}).

However, for cases like very long time series, the domain itself is already too large to load at once. Creating a JS API for subsetting by domain coordinates is tricky though since domains are kept generic and nothing is known before-hand about them. For primitive-value axes this may not be a big problem as it means just exposing the axis extents (e.g. 0-100meters depth) and possibly array shape before loading the domain. Then this could look like cov.loadDomain({z: [0,100]}). However, for more complex domain types, like CoverageJSON's MultiPolygon, it is not clear how a subsetting call would look like, probably this is possible by index only, which would not be very useful.

An alternative could be to implement subsetting at a separate layer specific to the used format/protocol/domain type, and each such subsetted coverage data could then be exposed as a standard Coverage object which doesn't know anything about subsetting. The advantage is that the Coverage API stays simple. The disadvantage is that it prevents building generic and efficient mapping clients for arbitrary coverages, since a single logical coverage would then be split up into subsetted coverages and cannot be handled as a single entity by clients.

Questions are:

Did I miss anything? Is generic coordinate-based domain subsetting possible after all?
Should index-based range subsetting be added without adding coordinate-based domain subsetting?

Further notes: The Web Coverage Service standard defines domain and range subsetting differently. Range subsetting is purely the selection of the parameter(s). Everything else is domain subsetting and this happens with actual coordinates (not indices) which are either numbers or strings (it mentions dates as example). This is based on the fact that clients have fetched the GetCapabilities document which describes the domain, axes, axis names, CRS etc.

Possible solution: Range subsetting is as in WCS, so just cov.loadRange(paramKey) as is now. For domain subsetting, a new method cov.subset(subsetSpec) is added which returns a new Coverage object with the domain subsetted to the given subset specification object. The format of subsetSpec is defined by the domainType, which can also define that subsetting is not supported. For the CoverageJSON domain types except polygon* the subsetSpec format could just be {'y': [-90,0], 'x': [-10,10]} with values in native crs coordinates (corresponds to WCS "DimensionTrim"), and for selecting a single axis element: {'z': -50} (corresponds to WCS "DimensionSlice"). For the MultiPolygon domain, we could either say that subsetting is not possible, or base it on indices, like {'polygonIdx': [0,4,7,8,9]} which in most cases would mean that the client would first load the full domain with loadDomain() and then do subsetting based on the indices.

This solution seems to be in line with WCS and coverages of such a service could be wrapped with our API.

What about OPeNDAP? I see OPeNDAP just as a supplier for the range values, which means that subsetting would only work on indices. As in the MultiPolygon example above this would internally mean first loading the full domain and then subsetting it (=querying OPeNDAP). I see no way around it, except having some intermediate service which can translate coordinates to indices so that coordinates can be used directly for subsetting (without having to load the domain first).

An issue with the above solution is that subsetting is based on CRS coordinates. That means that the CRS info must be available without doing a loadDomain() first. Which is tricky since CRS is a concept tightly coupled to the domain type. And some domains may have multiple CRSs for different axes.

jonblower commented 9 years ago

Not sure I've fully digested this but here are some first thoughts:

Long timeseries of data are likely (but not guaranteed) to be equally spaced in time. We could compress the domain by allowing a start/stop/period syntax.
Could we allow both subsetting by index (OPeNDAP-style) and by coordinate (WCS-style)? There are pros and cons to both:
- WCS-style is arguably simpler for the client but there are issues around floating-point comparison and partial-cell selection - also you don't know exactly how much data you might be getting back from the request. What do you get back from the request - a full Coverage or just a range array? The range array alone may not be understandable on its own I guess.
- OPeNDAP-style will work for any Coverage in the same way, but the client has to do more work to understand the form of the domain. But you have very fine control over what data are selected (you know exactly what to expect from the response and it can just be a range array).

letmaik commented 9 years ago

I agree that OPeNDAP-style is very useful when actually implementing clients and when the domain can be loaded upfront, especially thinking of grids with several time and/or vertical coordinates. With WCS, raw network data would always have to include the subsetted domain, yes, there is some overhead.

Your time series compression idea may actually be good for semi-periodic time series. So, you have a list of start/stop/period definitions, which is useful, when the period in the time series switched at some point. I think I could live with that.

I think that I favour the OPeNDAP way, especially because it works for any domain type, and it doesn't depend on CRS definitions. It pushes work onto the client, but they have to understand the domain and axes anyway, so this is not a big concern I think.

jonblower commented 9 years ago

Your time series compression idea may actually be good for semi-periodic time series. So, you have a list of start/stop/period definitions

Yes - in fact ncWMS can already do this and there is Java code in ncWMS/EDAL for inspecting a list of DateTimes and returning a set of start/stop/period definitions.

I agree that OPeNDAP-style is slightly favoured, and we should implement this first. I guess we should allow for implementing the WCS way too in future - they shouldn't be exclusive. (I guess WCS subsets coverages, OPeNDAP subsets ranges.)

letmaik commented 9 years ago

In WCS the domain is exactly defined by DescribeCoverage metadata (with the gml domain types). So, since you need that data anyway to figure out what the coverage exactly is and do the subset query, I think this fits with our API. A WCS wrapper would spit out Coverage objects and when you do loadDomain() it would load DescribeCoverage. Then, if you do loadRange with subsetting by index, it could somehow map the indices to coordinates (possibly with some buffer around the extents due to the issues you mentioned), send the query, and extract the exact data again. It may be a bit idealistic though, since WCS will generally not have easy formats to digest, except.... it could return CoverageJSON :p The other way around, subsetting the domain by coordinates would return a new Coverage object which fits WCS better since then you get a fresh domain based on that subset. For OPeNDAP this works as well. It's really just the WCS subset by index which is tricky.

I'm not sure whether range subsetting should return the subsetted domain as well. Implementations will get more complicated if you just get the subsetted range values and you would have to manage the context (subsetted domain) yourself. I have to think about all that a bit more.

letmaik commented 9 years ago

Some more thoughts... I think whether you subset by index or coordinate doesn't make a conceptual difference, it is always domain subsetting, not range subsetting. The range has as "axis" really just the parameters on which you could subset. Sure, you can extract some subset of the range values, but what does that mean? The link to the domain gets lost and has to be reestablished.

So, I think domain subsetting should always return a new subsetted Coverage object. In implementations you would then typically keep the original Coverage object, load its domain, and if the corresponding ranges would be too huge for direct consumption (multiplication of all .shape numbers > x), then you could request a domain subsetted Coverage object which you would keep in addition and which may represent the current subset that is displayed. So, for big grids this could be a single time/vertical step.

I propose to have subset() for domain subsetting, and keep loadRange(paramKey) for range subsetting.

letmaik commented 9 years ago

What about (within Coverage):

subsetByIndex(constraints)

If defined, allows to subset the coverage by domain indices. If this function is not defined, then this operation is not supported.

Parameters

constraints - An object which describes the subsetting constraints. Every property of it refers to an axis name as defined in Domain.names, and its value must either be an integer, an array of integers, or an object with start, stop, and optionally step (defaults to 1) properties whose values are integers. All integers must be non-negative, step must not be zero. A simple integer constrains the axis to the given index, an array to a list of indices, and a start/stop/step object to a range of indices: If step=1, this includes all indices starting at start and ending at stop (exclusive); if step>1, all indices start, start + step, ..., start + (q + r - 1) step where q and r are the quotient and remainder obtained by dividing stop - start by step.

Example

var subset = cov.subsetByIndex({t: 4, z: {start: 10, stop: 20}, x: [0,1,2] })
console.log(subset.bbox)

The identifiers of the axes should probably be exposed as well, even if they are known anyway by an implementation by knowing the domain type, but they provide context for humans and make it more obvious how the index-based subset operation is connected to the domain dimensions. This should go into the Domain object which already has shape, and could be names: ["t","z","y","x"] in the same order as shape. The coordinate-based one may instead rely on custom fields and is less rigorous in its definition for arbitrary domain types.

EDIT: subsetByIndex has to return a Promise of course, since it can be asynchronous...

letmaik commented 9 years ago

One thing to consider when returning Coverage objects for domain subsetting is that constructing the new domain may actually be a challenging operation. For example, suppose there is a regular grid domain type, or similarly for the time series with multiple periods, then this might change the domain type (regular grid becomes a rectilinear grid if there's no common spacing), or restructure domains (two periods in a time series could become different numbers of periods). The possible change of the domain type worries me a bit.

If you would attach the index subsetting directly to the range loading operation and just return the values, then the client would be responsible to do the domain reconstruction by itself, implicitly more or less.

jonblower commented 9 years ago

Makes sense to me. I think the change of domain type is a possible technical challenge but is conceptually fine. It's certainly nice to be able to work with complete Coverage objects as much as possible.

A common use case might be to extract a timeseries or vertical profile from a grid. How might this work, in terms of domain types? The result could be a PointSeries or Profile, or it could be a grid with a set of 1-length dimensions.

Maybe subsetByIndex returns a domain of the same type (i.e. a grid, even if it's a different kind of grid), but there could be another operation subsetToPointSeries or something like that (on appropriate coverages only), where it is more obvious that the domain type could change. Could be worth talking to @guygriffiths about what EDAL does here.

letmaik commented 9 years ago

Given our discussions over at https://github.com/Reading-eScience-Centre/coveragejson/issues/24 it is obvious that it depends on how you define your domain types (general Grid with horizontal part (that can be Curvilinear, regular etc.) vs specific grids for each subtype: RegularGrid, CurvilinearGrid) whether it is possible to get the same domain type back on subsetting. With our new general Grids, it would again be a Grid, but I think this should not be enforced in the API since it may be impossible sometimes. Instead, I propose to add a new method "castTo(domainType)" (maybe different name) which returns a new Coverage object with the domain having the requested type, or if not possible, throws an exception. With that, you don't have to add methods like "subsetToPointSeries" as well.

jonblower commented 9 years ago

Sounds reasonable, although it would be good to explore when you might subset a Grid to get something other than a grid. For example, you might extract a trajectory from a grid. What would happen in this case? Presumably you could not get a Grid back from such an operation? subsetToTrajectory might make a little more sense then, maybe.

letmaik commented 9 years ago

Well, how would you define the subset parameters for Grid -> Trajectory? Probably a list of coordinates to extract from the Grid. It seems to me this is a non-standard and more complex operation, and maybe out of scope for such a generic API. I'll have to think about it a little more. But something like subsetToTrajectory feels wrong as it includes a specific coverage type.

jonblower commented 9 years ago

I think the subset parameters for Grid -> Trajectory would be the domain of the trajectory. There is a valid use case for this - extracting data from a numerical model along the line of a ship track. It could be important for WP9. So I think it's something we should support, although I don't know what the best mechanism is.

letmaik commented 9 years ago

Ok, this type of subsetting is then based on actual coordinates of the other domain, and then the question of what you pick comes up again. For example, if your grid is denser than the trajectory, and the trajectory has bounds given for its positions, then if you say you want to get a trajectory out of the grid, which grid domain elements would it have? I think this is more complex than such a generic API should do. It is so special that I would outsource that to external libraries providing such functions. This API is the absolute core and shouldn't be too complex, otherwise it gets hard for library authors to support it. I know it can be an optional operation, still, I think it doesn't directly belong here. As they say, KISS.

jonblower commented 9 years ago

But you'll need functionality like this somewhere in order to do intercomparisons. The most common case is that the trajectory is made up of points (I can't think of a situation where it would have bounds) and then you would just sample the grid at these points.

Maybe the API could support a method like grid.extractPoints(points) that a higher-level library could use to extract points and assemble a trajectory coverage.

letmaik commented 8 years ago

I'll close this since basic subsetting (subsetByIndex, subsetByValue, and the latter for collections as well) is supported now. If we need advanced subsetting operations that should be part of this API then this should go into a new issue.

Reading-eScience-Centre / coverage-jsapi

Subsetting #3

subsetByIndex(constraints)

Parameters

Example