API: add support for querying combinations of data sets

janusw commented 3 years ago

Currently the opentopodata API only has the possibility of querying elevation data from a specific data set.

It would be nice if there was a way to query elevation data from a combination of data sets. E.g. a user might be interested in saying:

"Give me the elevation data for a particular point. I don't care about the data set, just give me the 'best' one that you have for this location." or:
"Give me the elevation data for a particular point. I'd prefer to get it from data set A, but if this one does not have it, data set B is also fine. If all fails, data set C is ok as well."

For the first case, the API could just be something like /v1/best-data?locations=lat,lon. Then it would be up to the server to determine which data set is most suitable. A priority ordering of the data sets could be given by the ordering in the config.yaml file. The data sets with the highest quality / resolution would come first, and the later ones would only be used if the highest-priority data sets return a null/nodata value.

For the second case, the API call could be /v1/any-data?locations=lat,lon&data=A,B,C. Here the data priority is not given by the server setup, but specified by the user. This is more dynamic, and possibly more performant than the first option, but less convenient.

One could also unify these two cases together with the current single-data queries in a new API v2, where the data set is given as an argument, e.g. /v2/elevation?locations=lat,lon&data=something. That would include:

single-data queries (data=A), corresponding to the status quo (/v1/A?locations=...)
multi-data queries (data=A,B,C), where the desired data sets are given in order of priority (as in option 2 above)
queries without specifying a data set, where the server would choose the most appropriate one (as in option 1 above)

As an alternative, multi-data queries could be interpreted to return the elevation values from all three data sets at once, but I think returning only one elevation value from a priority list of data sets is the more important use case.

ajnisbet commented 3 years ago

This is a nice usecase, and I really like the api of /v1/nzdem,srtm30m to specify multiple datasets.

There's a couple of reasons why I haven't added this to Open Topo Data, they both boil down to "it's difficult to automate accurately":

Different datasets have different vertical dataums. Most datasets don't include the specific datum used in the raster files, so you get jumps transitioning between datasets. This is particularly bad for the use case of layering hi-res urban lidar data over a base dataset, where the hi-res coverage is often patchy.
Datasets handle NODATA and 0-elevation differently. For example, SRTM used 0-elevation to denote water so you wouldn't be able to layer SRTM over a bathymetry dataset.

So there's a few ways to handle this:

Build a new merged dataset using tools like gdalwarp and gdal_calc to manually handle differences in datum, resolution, projection, and NODATA handling for each dataset.
Build a VRT, which supports some basic transformation opeions.
Just use the Mapzen dataset, which is a careful combination of many different rasters.
Handle it on the client: make a request for the first dataset, if it's null (or maybe 0) request the next one.

janusw commented 3 years ago

This is a nice usecase, and I really like the api of /v1/nzdem,srtm30m to specify multiple datasets.

There's a couple of reasons why I haven't added this to Open Topo Data, they both boil down to "it's difficult to automate accurately":

Well, one can certainly get into trouble when combining datasets with differing datum etc, so the first two options I proposed are maybe not such a good idea in general (at least they are not 'bulletproof' in this respect).

But I still think there would be some merit in being able to query multiple datasets at once ...

So there's a few ways to handle this:

* Build a new merged dataset using tools like `gdalwarp` and `gdal_calc` to manually handle differences in datum, resolution, projection, and NODATA handling for each dataset.

* Build a VRT, which supports some basic transformation opeions.

Those two sound reasonable. I'll have to check if they are feasible for my use case.

* Just use the Mapzen dataset, which is a careful combination of many different rasters.

This is not really an option for me, since I want to use data that is not included in Mapzen.

* Handle it on the client: make a request for the first dataset, if it's `null` (or maybe 0) request the next one.

I actually considered this option, but the problem here is that each request comes with quite some network latency, so this will decrease the performance significantly. Having a way to query both datasets at once would really help here.

janusw commented 3 years ago

As an alternative, multi-data queries could be interpreted to return the elevation values from all three data sets at once,

@ajnisbet: It seems you have implemented in a960404718b1ac170c35c84051be032bc11259af some sort of multi-dataset support, which at first glance looks a lot like the alternative mentioned above. Could you please comment on that?

ajnisbet commented 3 years ago

Yeah it's now my most requested feature! It's a work in progress.

There's 2 things I want to check:

whether in real life my concerns about mismatched datums are valid.
how to implement without harming performance too much

For 2, the main thing to avoid is reading files unnecessarily. And the next most important thing is to avoid doing coordinate transforms unnecessarily.

I'm trying something to exclude datasets where the point is outside the dataset bounds, and storing those bounds in wgs84 to avoid uneeded projection transforms. In the commit about the user specifies the bounds but it would be better to do automatically.

The other approach would be to build a spatial index in wgs84 of all tiles for all datasets and use that to find the valid tiles. That would probably be more efficient, but it would massively increase startup time.

janusw commented 3 years ago

Yeah it's now my most requested feature!

No surprise to me :)

Maybe it would be useful if your other requesters join this discussion and explain their use case ...?

There's 2 things I want to check:
1. whether in real life my concerns about mismatched datums are valid.

Well, that strongly depends on which data you are combining, I guess. Some may be compatible with each other, others may not. If you want to offer such a feature, you cannot avoid this problem completely. To a certain extent, it's simply up to the user, whether he uses this feature in a reasonable way.

One thing that I do not like about your approach, is that it's less flexible than what I proposed in this issue: Every combination of data needs explicit support by the server.

That may be a good way to keep users from combining incompatible data sets, but just consider a scenario like this: There are three (or more) 'compatible' data sets, and the server contains a 'multi-data' target that contains all of them. But one user may only need data set 1 and 2, another may need 2 and 3, etc. So you'd have to add all those combinations to the sever separately, they all need a descriptive name, and it all becomes rather confusing and awkward. Or the users have to live with the performance penalty of always querying all three data sets.

IMHO, it would be much more elegant to let the user choose directly which data sets he wants, as proposed above, e.g. via a REST API call like: /v2/elevation?locations=lat,lon&data=A,B,C

2. how to implement without harming performance too much

Well, in fact a way to query multiple data sets in one call is already a performance improvement over querying the same data sets in several calls. Therefore I think any further improvements of performance should be subject of a separate issue, in order to not complicate the discussion here more than necessary.

janusw commented 3 years ago

As an alternative, multi-data queries could be interpreted to return the elevation values from all three data sets at once,

@ajnisbet: It seems you have implemented in a960404 some sort of multi-dataset support, which at first glance looks a lot like the alternative mentioned above. Could you please comment on that?

So, I have tested your implementation by now, and (contrary to my previous assumption, which was just based on a quick skim over the patch) it seems that it does not return the data for all datasets defined in the multi-data target, but instead gives only one value (where the child_datasets in the config define the priority of the datasets).

That's essentially the second option described in my original proposal:

"Give me the elevation data for a particular point. I'd prefer to get it from data set A, but if this one does not have it, data set B is also fine. If all fails, data set C is ok as well."

So, yeah, it's actually perfectly suited for my use-case, and it seems to work very solidly (judging by the light testing I have done so far).

Thanks a lot for taking up the suggestion! IMHO the current dev branch would be fully eligible to be promoted to a 1.5.0 release, with this great new feature (and the previous bugfixes!) 👍

janusw commented 3 years ago

From my side, the only thing that's missing here is a bit of documentation (I already figured out how to use the feature by myself, but in general and for other users it might be important to document that this feature even exist, and how to use it).

Once this is done, the issue can be closed IMHO :)

ajnisbet commented 3 years ago

Multiple datasets now live on the public api!

Both server-side and client-side multi dataset specification is suported.

Some documentation here: https://www.opentopodata.org/notes/multiple-datasets/

ajnisbet / opentopodata

API: add support for querying combinations of data sets #23