Closed letmaik closed 8 years ago
This would also make it easier to release CoverageJSON more quickly, and let CoverageCBOR (?) follow if there is a need for it and we've proven that it actually performs better in certain situations (see #41).
If you think it's better to separate them then this is fine with me, particularly since the real-world benefits of CBOR are not entirely clear yet. But how much difference do you think there will be between the versions? Would they not be very similar, apart from some differences in array encoding?
They would be very similar yes, and I wouldn't copy-paste everything, just describe the differences, which are currently: structural+data type (range encoding, e.g. offset/factor), or just data type (typed array encoding in domain without change of structure).
So the main changes are in the range, and perhaps in the domain (e.g. curvilinear grids that benefit from a more efficient encoding of coordinate values)?
Do you think we could restrict this just to the range values? This would simplify and isolate the changes. It's not clear to me that the domain encoding will benefit greatly from CBOR.
Another example would be if the domain contains big explicit numeric axes, like rectilinear ones. Also, the bounds currently have to be given explicitly if they are given (no regular encoding).
One of the requirements of a CoverageCBOR format would be that arrays can be encoded as typed arrays (an extension of CBOR not widely supported yet). With this requirement, I don't see a reason why the use of typed arrays should be restricted to a particular area. The only one I can think of is that typed arrays in JS don't have all the functions of normal arrays yet and browser support varies, and this could lead to subtle bugs/exceptions. That's probably a good reason for restriction though... OK, I agree, since the main savings can be made in the range data, we should focus on that first.
Yes, although we could benefit from some efficiency in encoding the domain, I feel that these benefits would probably be quite marginal (even for bounds, these are only 1D axes). The largest "domain" objects are likely to be the curvilinear coordinates, but we are deferring this for now. I would suggest that we focus on the range encoding, where the range is encoded in a separate document.
Why do you say "in a separate document"? I don't think we should differentiate whether it's embedded or not.
Am 28.01.2016 um 09:16 schrieb Jon Blower:
Yes, although we could benefit from some efficiency in encoding the domain, I feel that these benefits would probably be quite marginal (even for bounds, these are only 1D axes). The largest "domain" objects are likely to be the curvilinear coordinates, but we are deferring this for now. I would suggest that we focus on the range encoding, where the range is encoded in a separate document.
— Reply to this email directly or view it on GitHub https://github.com/Reading-eScience-Centre/coveragejson/issues/44#issuecomment-176077479.
Maybe I haven't understood CBOR properly, but I thought that a document would either be encoded entirely in text-JSON or binary-JSON. So a text-JSON document could hold the domain, parameters etc but contain a link to a binary-JSON document holding the range. I didn't think it was possible to embed a binary-JSON object in a text-JSON document but I may be wrong.
No you're right. But sometimes it makes sense to have the complete coverage including domain in a single document, which can also be at the discretion of the server because it may have a rule to include small ranges directly to save the client from launching another request.
The more important thing here though is that you mention mixing JSON and CBOR. Although that's possible, I think if JSON is offered, then any links to ranges must also be offered in at least JSON. Whether a CBOR document is accessible at the same URL via content-negotiation doesn't matter. And I think this is a crucial discussion point when splitting it into two formats (JSON and CBOR) since they should be independent from each other in my opinion.
Am 28.01.2016 um 09:24 schrieb Jon Blower:
Maybe I haven't understood CBOR properly, but I thought that a document would either be encoded entirely in text-JSON or binary-JSON. So a text-JSON document could hold the domain, parameters etc but contain a link to a binary-JSON document holding the range. I didn't think it was possible to embed a binary-JSON object in a text-JSON document but I may be wrong.
— Reply to this email directly or view it on GitHub https://github.com/Reading-eScience-Centre/coveragejson/issues/44#issuecomment-176079893.
Right - but if the ranges are small and it makes sense to encode the whole coverage in a single document, then there's no real need for CBOR, right? I would think the most useful cases are:
As you say, conneg can distinguish between 2 and 3. I don't think it's particularly useful to offer a whole coverage as CBOR, although it would be possible of course.
Sure, if the coverages are small, then there's no reason that the data provider offers CBOR at all.
Hm... I went back and forth while writing this comment, not that straight forward actually. There are different things to consider.
Possible solutions: A) The easy way would be to force JSON throughout if the media type is application/prs.coverage+json, including ranges. And the same for CBOR. Then each format is independent, the client knows what to expect, and there can be implementations focused on each individually.
B) CoverageJSON does not restrict the range at a referenced URL to any particular format, but instead a mapping from media type to URL has to be explicitly included so that the client can explicitly select between formats. This would allow custom multi-dimensional array encodings, e.g. if the range is a 2D categorical grid with less than 256 categories, you could use a monochrome PNG where alpha=nodata (and this would then need a custom media type and not just image/png).
Option B would mean a restructuring of how ranges are linked. On the upside it may be more compliant to the JSON vision of the OGC, but I haven't fully understand that yet. For example, what WCS's DescribeCoverage is supposed to return, and if this is identical to the first part of a multipart GetCoverage request. And whether DescribeCoverage includes links to the range (I think not, which may be a problem).
On 28/01/2016 09:48, Jon Blower wrote:
Right - but if the ranges are small and it makes sense to encode the whole coverage in a single document, then there's no real need for CBOR, right? I would think the most useful cases are:
- Whole coverage offered as text-JSON document
- Coverage (without ranges) offered as text-JSON, links to range objects also offered as text-JSON
- Coverage (without ranges) offered as text-JSON, links to range objects offered as CBOR
As you say, conneg can distinguish between 2 and 3. I don't think it's particularly useful to offer a whole coverage as CBOR, although it would be possible of course.
— Reply to this email directly or view it on GitHub https://github.com/Reading-eScience-Centre/coveragejson/issues/44#issuecomment-176091120.
I don't think we need a dependency on CBOR in the CovJSON spec. We could say that offering all data as text-JSON is the minimum that all servers should offer. Offering range data as other formats is optional.
So, how does a provider advertise the formats that ranges can be served in? I guess your Option B is the solution - if we use conneg then a client would have to interrogate the range endpoint(s) to find the supported formats.
There's a potential issue (also faced by WCS and other related things) that some formats naturally encode more than the range. A NetCDF file, for instance, encodes the whole coverage. We could:
We could even consider restricting the range encoding to a fixed list, for which we can easily specify the behaviour. (e.g. a PNG file might not make sense for the range of a timeseries coverage, but CBOR or JSON would always work).
Maybe the range formats need to be inherently 1D?
Some more thoughts on WCS / OGC model:
A GeoTIFF can be used as rangeset in WCS, where channel order is defined by the order of the rangeType elements (our parameters). This is currently incompatible with CoverageJSON for two reasons:
So, just saying, if CoverageJSON should maybe become the JSON serialization for DescribeCoverage and the first part of a multi-part GetCoverage, then those things have to be addressed.
Good point. Is it a big task to enable multi-parameter ranges in CovJSON?
Well, first we would have to include some kind of optional "order": 0
(1, 2..) property within a parameter if one of the offered range formats depends on the order. Then we would have to add a new root level CovJSON document type "RangeSet". And the reason I didn't want the latter initially is that you then have random local keys (the parameter keys) in your "standalone" document that don't have any meaning by themselves. But thinking about it... it's just what it is, a collection of arrays indexed by some keys without meaning.
A harder issue is at what level you are describing the different range formats. In WCS, it's at the rangeset level. So we would not describe formats for individual ranges, but for the whole set of ranges. Let's write some JSON to make it more concrete:
"rangeSet": {
"application/netcdf": "http://.../data.nc",
"application/prs.coverage+json": {
"type": "RangeSet",
"temperature": "http://../temp.covjson" // or directly embedded
},
"application/prs.coverage+cbor": "http://../rangeset.covcbor"
}
I guess for netCDF ranges, the parameter key would also have to match the variable name.
Fixed some of the JSON, should be ok now. If the media type is a JSON variant, then it can be embedded, otherwise just a URL.
But there is still a problem with this. The structure above would not allow to load a single CBOR range without first loading the rangeset, when starting from a CovJSON coverage document.
And you're right about the issue that some formats encode more than the range. If you do a WCS GetCoverage with format CovCBOR, you would expect the complete coverage with domain and parameters right? And probably just a single rangeset of type CovCBOR? However, for DescribeCoverage, you would get a CovJSON document, and I'm not sure if this should contain the rangeSets structure above, and if it did, then from a WCS point of view, a range set is what you get with GetCoverage, so this would point to a full CovCBOR coverage and not just the rangeset. And then we're getting into a loop.
I think we have been looking at it from a slightly wrong angle.
All the coverage formats that I know do not allow linking to remote parts. They are standalone and have a certain degree of domain and parameter metadata. Which means there is also no mix of formats within one format. CoverageJSON is more flexible and does allow linking to remote ranges. In addition to that, we now wanted to introduce even more flexibility by allowing different formats for the remote ranges.
To put that into perspective, let's get back to WCS's DescribeCoverage. This is really just a description of the coverage domain, the parameters, and the available coverage formats. It does not have actual links to concrete formats of the coverage, which makes sense for WCS as it has a fixed query API. And especially, it does not have links to range sets for a given format.
I think we cannot use CovJSON as-is for DescribeCoverage, without changing WCS itself. However, we could derive a coverage description format based on CovJSON, which basically just includes the domain and parameters and in addition information about the available formats. It would not have "type": "Coverage"
but "type": "CoverageDescription"
. Even though it is not directly CovJSON, reusing elements of it would provide some consistency if CovJSON would become an actual WCS coverage format.
I think the WCS model makes sense from a semantic point of view. DescribeCoverage is just a format-independent abstract coverage description. And the coverage formats offered in it and retrievable with GetCoverage are coverages, not just range sets. So it makes sense to serve netCDF-CF, GeoTIFF etc. instead of TIFF without geo-referencing data.
Coming back to CovJSON, I think it would be wrong to serve a coverage format as a range set, such as a netCDF-CF file. Why? Because it would mean that we allow coverage formats and then we could link to a CovJSON coverage, which obviously would not make sense. So, you are right above that this should be disallowed and only actual range formats (like a CovJSON range document) can be linked to. I see the issue though, do you just strip metadata from a netCDF file or make the range variable one-dimensional? I think the main problem is that range-only formats are not defined, and WCS gets around that by not working on that level. OPeNDAP4 may be suitable as a range-only format.
Remote range sets with WCS (e.g. linked from a CovJSON coverage) are possible in theory but I'm not sure how much sense that makes, since you would need a separate server that serves those ranges and that's not really realistic for common WCS setups. That in turn would mean that range sets have to be embedded within the coverage, and for CovJSON this would mean a CovJSON range set and nothing else (no other formats). That could be a WCS-specific restriction/profile for CovJSON.
There is a big exception to all this which is the WCS multipart encoding, where the GetCoverage response would consist of a GML-encoded coverage where the range set contains an internal link to the second multipart part which is the range set. And that rangeset part is in a rangeset-only format, e.g. a plain tiff. And this is what Joan probably had in mind somehow when creating his first draft of CovJSON with the URLs for the ranges. However, since WCS specifically restricts this to exactly 2 parts (coverage without rangeset + rangeset) I don't know how you would be able to link to specific range files. I think the multipart encoding of WCS/GML would have to be extended to make that work.
Even though we won't use multipart in our projects, it looks like CovJSON is compatible to it (after introducing "RangeSet"
as a root level document). The idea would be to have (next to GML) a CovJSON coverage encoding for the first part of a multipart, which then links to the other multipart part which is a CovJSON range set, and it would be important that this is not restricted to CovJSON range sets, but any format. So this is one more factor to consider when thinking about the "rangeSet" JSON structure.
WCS would only allow a single range set format in a response, a typical URL would look like ?request=GetCoverage&mediaType=multipart/related&format=image/tiff
where if the mediaType parameter exists, it must be multipart/related
(and this currently forces a GML encoding for the coverage) and the format
determines the range set format. To support JSON, something would have to be changed to indicate that you want JSON and not GML. But that's another issue.
Conclusions...
request=GetCoverage&format=application/prs.coverage+json
request.Remaining problems:
"rangeSet"
structure. I think the issue here is that making this possible would require to define yet another type of format, next to coverage and range set formats, which are range formats. We have that in CovJSON/CovCBOR, but range formats as such are not defined/mentioned in OGC world.Thanks, this is interesting. Regarding multipart WCS responses, I'm not sure this has been fully thought through in WCS yet. There have been attempts at allowing NetCDF and OPeNDAP as range formats, but these suffer from the above problem that they are really coverage formats.
My points/questions would be:
I don't understand the second sentence of point 2: "to allow single ranges or multiple ranges in the same rangeset document". Do you mean multiple formats of the same range? Because a rangeset can have multiple ranges anyway (in the same format). Also, how is this connected to DescribeCoverage? DescribeCoverage does not link to a range set, it just lists the available coverage formats without any links.
To point 3, no idea, I just checked WCS 2.0 and couldn't even find how formats at all are listed for a coverage. It only mentions a native format in that structure within DescribeCoverage:
<wcs:ServiceParameters>
<wcs:CoverageSubtype>GridCoverage</wcs:CoverageSubtype>
<wcs:nativeFormat>image/tiff</wcs: nativeFormat>
</wcs:ServiceParameters>
Later it says:
The encoding format in which the coverage will be returned is specified by the combination of format and mediaType parameter. Admissible values (i.e, formats supported) are those listed in the server’s Capabilities document. Default is the coverage’s Native Format.
So that means whatever coverages you offer within one WCS endpoint must be offered in all formats you list in the GetCapabilities document, and you would have to read that document to know them. So I guess I was wrong partly, maybe the formats got listed in WCS 1.x, since I saw that in examples around the web.
A WCS 2 GetCapabilities response has this:
<wcs:ServiceMetadata>
<wcs:formatSupported>image/tiff</wcs:formatSupported>
</wcs:ServiceMetadata>
There doesn't seem to be any distinction between coverage and range set formats. I then would agree that the multipart thing is not fully thought through.
I think this is just too much pain to handle for us. I would say...
So let's get back to the example:
"rangeSet": {
"application/netcdf": "http://.../data.nc",
"application/prs.coverage+json": {
"type": "RangeSet",
"temperature": "http://../temp.covjson" // or directly embedded
},
"application/prs.coverage+cbor": "http://../rangeset.covcbor"
}
To solve the issue that you want to have access to single ranges for non-JSON formats, let's change what is allowed as value of a media type property and add an explicit "ranges" property to separate the keys more:
"rangeSet": {
"application/netcdf": "http://.../data.nc",
"application/prs.coverage+json": {
"type": "RangeSet",
"ranges": {
"temperature": "http://../temp.covjson" // or directly embedded
}
},
"application/prs.coverage+cbor": {
"type": "RangeSet",
"ranges": {
"temperature": "http://../temp.covcbor" // cannot be embedded
}
}
}
Everything under a given media type must be in that format, no mixing. Also, note that when you decide to expose individual ranges (as for CovJSON and CovCBOR above) then you can't give a URL for the complete range set. If there is a need for that we can still add that later, but I think it's not necessary since it would just be because of an optimization of the number of server requests, and with HTTP/2 (see also SPDY) this problem doesn't exist anymore anyway. And you're right, for this to work we would have to define how a range set and range looks like in existing formats like netcdf, image formats etc. You mentioned an approach of a video encoding as range set format where you had I think tiles inside the video. For those kinds of advanced encodings you would need additional information about the layout that is not embedded and that we couldn't define statically once for such a format (which would just be video/mp4). Where would those parameters go? That is important when defining the above JSON structure.
I agree with your proposals (starting "I would say..." above). I also think we should remove CBOR from the CovJSON spec for now, and maybe allude to an "unsolved issue" around supported range formats.
I'm not sure whether using the MIME type as a key in the rangeSet
is a good idea. It's quite an awkward string to use as a key and contains full stops, making it not suitable as a key in a JavaScript hashmap. Maybe it should look like this, at the expense of an extra parameter:
"rangeSet": {
"netcdf": { // arbitrary string, intended for easy reading in code
"type": "RangeSet",
"mimeType": "application/netcdf",
"location": "http://.../data.nc"
}, {
...
}
}
or the list of formats could be an array?
I removed CBOR from the spec now and copied relevant parts into cbor.md.
Honestly, I don't see the point of having an arbitrary string as key there just so that one can write rangeSet.netcdf
instead of rangeSet['application/netcdf']
. The local key made sense for parameters since this is reused for associating a range. If your client is moderately clever then there will be an abstraction layer anyway around the range fetching which would pick the most suitable format automatically. Of course it doesn't help that we don't have a registered media type yet. Maybe we should do that first? Should be quite easy in the prs. tree.
Some more WCS details: For GeoTIFF, the WCS encoding profile adds several query parameters for GetCoverage that influence how the TIFF is encoded (e.g. compression, jpeg quality (if compression=jpeg), and tiling). This is again at the coverage level. But since we may want to support alternative range set formats, this is relevant here as well. A typical use case would be that the clients first fetches a low-quality jpeg-encoded tiff file, and then on demand fetches the non-lossy variant.
So are we really targeting such use cases at the range set level? One problem I can see is that when you link to a coverage (via URI) as being an input to an operation like intercomparison (e.g. for provenance purposes), then you couldn't link to a specific range set encoding, but this is relevant in the case of lossy encodings, and in some cases a high-quality lossy encoding may be enough for a certain operation, but you still want to know that. On the other hand, a coverage URI should represent the coverage itself, meaning the domain, parameters, range values independent of format. So I would say that a lossy version of that is in effect a different logical coverage, derived from the original one, with a different URI. And this view conflicts a bit with just putting random formats into the range set construct. And this is equal to allowing CBOR with an offset/factor encoding (if this is lossy).
Not sure yet how to solve this.
How about we go back to the roots here...
Originally, we allowed a full coverage to be encoded as CBOR, not just the ranges. Why not do that? I see absolutely no problem in having the domain and parameters in CBOR as well, it is just another self-describing coverage format. From an implementation perspective, there is nearly no difference, since json = CBOR.decode(cbor)
more or less. The problem of how to advertise different coverage formats is best handled at the HTTP level: Link headers, as in:
GET /cov
Accept: text/html, */*
HTTP/1.1 303 See Other
Location: /cov.html
GET /cov.html
Accept: text/html, */*
HTTP/1.1 200 OK
Content-type: text/html
Link: </cov.covjson>; rel="alternate"; type="application/prs.coverage+json"
Link: </cov.nc>; rel="alternate"; type="application/netcdf"
<html>..</html>
Similarly when you start with Accept: application/prs.coverage+json
you would get the Link headers pointing to alternative coverage formats.
And I would really separate lossy from non-lossy versions here. The lossy version has to be linked somehow from the non-lossy version but it does not represent the same data.
Doing it like that would eliminate many pain points we are seeing. We could simply reuse OGC definitions of coverage formats and don't have to worry about defining our own range set formats (apart within json of course).
Just in case we choose the flexible range formats variant, another suggestion, inspired by the Link type of ActivityStreams:
"rangeSet": [{
"type": "RangeSet",
"mediaType": "application/netcdf",
// either or both "href" or "ranges" is required
"href": "http://.../data.nc",
"ranges": {
"temperature": "http://../temp.nc",
"wind_speed": "http://../wind.nc"
}
}, {
// fully embedded CovJSON
"type": "RangeSet",
"mediaType": "application/prs.coverage+json",
"ranges": {
"temperature": { ... }, // can also be linked with URL
"wind_speed": { ... }
}
}, {
// external covcbor without href
"type": "RangeSet",
"mediaType": "application/prs.coverage+cbor",
// could also have "href" but is not offered, just single ranges:
"ranges": {
"temperature": "http://../temp.covcbor",
"wind_speed": "http://../wind.covcbor"
}
}]
So, a RangeSet
then is a special type of Link that allows to embed range data or link to specific ranges in a range set.
For each media type we'd have to define the specific encoding (like WCS format encoding profiles).
Looks promising. An application might like an easy way to distinguish between fully-embedded CovJSON and ranges that are linked via URL. I guess they could do a test on cov.rangeSet[i].ranges.temperature
to see if this is a string or an object, but that feels a bit hacky.
Also, could the rangeSet
be an object (keyed by some semi-human-readable strings) rather than an array?
Some more reflections...
OK, I agree. I don't think we want to encode ranges as NetCDF or JPEG, or things like that. And I don't think we need range set format flexibility now - maybe we'll find it's useful later, but I agree that it introduces too many complexities.
So are we saying that a CovJSON document (and ranges that are linked from it) must be entirely text-JSON or CBOR, but no mixes? I.e. we couldn't link a CBOR range from a text-JSON document? I think that's OK.
Or we could say that the URL in a linked range will always return the same format as the original document. The server may implement content negotiation to return a different format (e.g. CBOR), but that's for the client to negotiate.
So are we saying that a CovJSON document (and ranges that are linked from it) must be entirely text-JSON or CBOR, but no mixes? I.e. we couldn't link a CBOR range from a text-JSON document? I think that's OK.
I would say so, yes. More precisely, you have to link to a resource which offers a CovJSON representation. Other resource representations (e.g. HTML, or CovCBOR) may be offered as well at the given URL, but these are explicitly ignored by the CovJSON spec. If we define a separate format based on CovJSON called CovCBOR, then this would similarly say that the linked range resource has to offer a CovCBOR representation. Whether and how exactly those two formats are mixed via content negotiation I think is out of scope here. But since it is actually easy to support once conneg is available in the server I wouldn't worry too much. The main thing is: If you start with JSON, you are guaranteed to get JSON. Even if we just define CBOR for the ranges themselves (I'm not in favour of that), then effectively we require conneg (since there's just a single range URL) if people want to use that from CovJSON in addition (because their clients are smarter and know CBOR), and I think that's fine. So... let's stop worrying, and close this... I don't think we're blocking any roads here in the future, there's enough ways for extension.
Resolution:
Ok?
Yes, agreed.
I have the feeling that it's not the best idea to have the JSON and CBOR variants merged into a single specification and under the same name. It not only creates confusion but also bloats it up unnecessarily. Any opinions?